How good is enough? AI vs. human code

The ongoing entanglement of AI in our everyday lives finds its climax in the most recent advent of increasingly intelligent chatbots, such as ChatGPT. These helpers seem to give us Iron Man-like assistance for all sorts of tasks, and the applications appear endless (for a list of creative use cases, look here).

In this blog post, we are focusing on its potential in software engineering. We compare its problem-solving approach to a series of increasingly challenging coding problems with another advanced AI tool, GitHub Copilot, and a team of professional human software engineers.

Many people see the breakthroughs in machine intelligence as the first gateway to the obsolescence of human-generated code, while others claim that AI still lacks the creative problem-solving skills of a real engineer. While we do not claim to settle this debate in the article, we hope to be able to shed some light on similarities, differences, and synergies among the approaches.

Let's first meet the contestants

The three competitors are GitHub Co-Pilot, an AI auto-completion tool hosted by the famous versioning website, ChatGPT, the most recent addition to the product portfolio of Open AI, and a group of human software engineers from our LeanIX staff.

1. ChatGPT

ChatGPT is an advanced chatbot developed by OpenAI, a research institute focused on developing artificial intelligence in a way that is safe and beneficial for humanity. It was founded in 2015 by a group of entrepreneurs and researchers, including Elon Musk and Sam Altman, to advance and promote research in AI.
The bot aims to answer questions, consider context and prompts, and remember former answers to mimic an entire dialogue.

Humans enhanced the training process of the algorithm with active feedback to improve the efficiency, quality, and convergence rate. The training data set includes a large corpus of text from various sources, making the algorithm knowledgeable in diverse topics.

The chatbot's ability sparked a lot of interest due to its unparalleled ability to engage in a dialogue while also kindling controversy. Stackoverflow, the leading forum for technical questions (often used by programmers to ask peers for help), has, for example, banned ChatGPT answers due to their potential to sounding correct but failing in production. Andrew Ng, the godfather of Machine Learning, highlights another anecdotal example of the bot confidently proclaiming an apparently wrong answer here.

2. GitHub Co-Pilot

GitHub Co-Pilot is a new tool that uses artificial intelligence (AI) to help developers write code faster and more efficiently. Its key feature is AI-powered code completion, which suggests code snippets and options as developers type. This feature is based on the context of the code and the developer's previous coding patterns and preferences.

GitHub Co-Pilot was introduced in October 2021 and has gained popularity among developers. The AI code completion feature is a significant advantage because it saves developers time and effort by automatically suggesting and completing code snippets. This efficiency reduces the need for manual coding and makes it easier to write complex code.

However, GitHub Co-Pilot has sparked controversy and raised concerns among some developers. One problem is that it may encourage a dependent and less creative way of working because developers may rely too much on AI code completion and lose the ability to think and write code independently.

Another concern is that the AI code completion may not always suggest the best or most appropriate code snippets and may even introduce errors or inconsistencies in the code. This quality loss could make it harder for teams to maintain and manage their code and could lead to more bugs and issues in the long run.

3. The human software engineer

The two software engineers are senior developers with multiple years of experience in the frontend and backend realms. They are part of the staff at LeanIX and code daily for a living.

The challenges & solutions

The challenges contain two rounds: First, we take a stereotypical coding challenge for which all three contestants should be able to produce a solution. Second, we focus on ChatGPT and the human engineers for a more involved task. Why are we ignoring GitHub Co-Pilot for this part of the challenge - because it cannot easily create code from longer prompts than one or two lines. It mainly aims to support the developer during the coding process and not write extensively complex problems out itself.

First challenge setup

For the first challenge, the contestants implement the BubbleSort algorithm in Javascript. This method is often the first sorting algorithm taught in universities due to its simple logic. The prompt provided to GitHub Co-Pilot was:

/_
Implement a bubble sort algorithm in Javascript
@param {Array} arr - An array of numbers
_/

The multi-line prompt provided to the more contextual contestants was:

Bubble sort is an algorithm that sorts a list of items by repeatedly comparing adjacent elements and swapping them in place if they are in the wrong order. This continues until the list is sorted. Equal items are not swapped.
Example: bubbleSort([5, 3, 2, 4, 1]) -> [1, 2, 3, 4, 5], bubbleSort([]) -> [], bubbleSort([1, 1, 1]) -> [1, 1, 1]

First challenge solutions

GitHub Co-Pilot

GitHub Co-Pilot only solved the Bubblesort problem since it can mainly assist in code completion tasks. This AI assistance could not solve the long prompt of the second challenge and its follow-up questions.

It solved the BubbleSort without problems, which was not surprising since the AI training data is based on all public repositories within GitHub. The problem is often-solved, and therefore the AI should have all the information to provide an adequate approach.

/*
Implement a bubble sort algorithm in Javascript
@param {Array} arr - An array of numbers
*/
function bubbleSort(arr) {
  // Loop through the array
  for (let i = 0; i < arr.length; i++) {
    // Loop through the array again
    for (let j = 0; j < arr.length; j++) {
      // Check if the current element is greater than the next element
      if (arr[j] > arr[j + 1]) {
        // Swap the elements
        let temp = arr[j];
        arr[j] = arr[j + 1];
        arr[j + 1] = temp;
      }
    }
  }
  // Return the sorted array
  return arr;
}

ChatGPT

ChatGPT solved the problem without problems and gave explanatory comments on efficiency and the solution approach. It seems well-suited for this kind of problem:

function bubbleSort(arr) {
  // If the input array is empty or has only one element, return it as is
  if (arr.length <= 1) {
    return arr;
  }

  // Repeat the following steps until the array is sorted:
  while (true) {
    // Initialize a flag to track if any elements were swapped during this iteration
    let swapped = false;

    // Iterate over the array, starting from the second element (index 1)
    for (let i = 1; i < arr.length; i++) {
      // If the current element is smaller than the previous element, swap them
      if (arr[i] < arr[i - 1]) {
        [arr[i], arr[i - 1]] = [arr[i - 1], arr[i]];
        // Set the flag to indicate that a swap occurred
        swapped = true;
      }
    }

    // If no elements were swapped during this iteration, the array is sorted
    // so we can break out of the loop
    if (!swapped) {
      break;
    }
  }

  // Return the sorted array
  return arr;
}

Human Engineers

The two engineers solving the challenges as a team took about ten minutes to write the BubbleSort code from scratch. They worked iteratively (got feedback from incomplete solutions on the first try) and fine-tuned it fast. Follow-up questions about time complexity showed a strong understanding of the problem.

function bubbleSort(arr) {
  for (let j = 0; j < arr.length; j++) {
    for (let i = 0; i < arr.length - 1; i++) {
      let tmp;
      if (arr[i] > arr[i + 1]) {
        tmp = arr[i];
        arr[i] = arr[i + 1];
        arr[i + 1] = tmp;
      }
    }
  }
  return arr;
}

Second challenge setup

The second challenge represents a so-called Katana (their Samurai-themed name for a Coding Challenge) from Codewars. Its difficulty is 6/10, with one being the highest. As mentioned before, only ChatGPT and the human engineers will go toe-to-toe on this problem due to the more detailed instructions.

Its prompt reads as:

Implement a function called 'Eureka' that returns a sorted list (ascending order) of all numbers that fulfill the property desribed below in a range of numbers. It takes in the 'start' and 'end' parameter as input and returns a list of numbers. The number 89 is the first integer with more than one digit that fulfills the property partially introduced in the title of this kata. What's the use of saying "Eureka"? Because this sum gives the same number. In effect: 89 = 8^1 + 9^2 The next number in having this property is 135. See this property again: 135 = 1^1 + 3^2 + 5^3 We need a function to collect these numbers, that may receive two integers a, b that defines the range [a, b] (inclusive) and outputs a list of the sorted numbers in the range that fulfills the property described above. Let's see some cases (input -> output): 1, 10 -> [1, 2, 3, 4, 5, 6, 7, 8, 9] 1, 100 -> [1, 2, 3, 4, 5, 6, 7, 8, 9, 89] If there are no numbers of this kind in the range [a, b] the function should output an empty list. 90, 100 --> []

Second challenge solutions

ChatGPT

The Chatbot solved the more involved Eureka function in seconds. Its solution included an iterative and approachable coding style with explanatory comments:

function eureka(start, end) {
  // Create an empty array to store the numbers that fulfill the property
  const numbers = [];
  // Loop through the range of numbers
  for (let i = start; i <= end; i++) {
    // Convert the number to a string and split it into individual digits
    const digits = i.toString().split("");
    // Calculate the sum of the digits raised to their own power
    let sum = 0;
    for (let j = 0; j < digits.length; j++) {
      sum += Math.pow(parseInt(digits[j]), j + 1);
    }
    // If the sum is equal to the original number, add it to the array
    if (sum === i) {
      numbers.push(i);
    }
  }
  // Return the array of numbers
  return numbers;
}

The code aims more toward readability than conciseness, which stems from the repeated for-loops of often more junior-level code.

Interestingly, ChatGPT provided controversial, borderline wrong answers to the follow-up questions. While it computed the time complexity correctly as O(n^2), its proposal to improve the algorithm's relatively slow algorithmic efficiency was:

There are several ways to improve the big O efficiency of the Eureka function. One way is to use a more efficient algorithm for calculating the sum of the digits raised to their own power. For example, instead of using a loop to calculate the sum, we could use a mathematical formula that directly calculates the sum based on the number of digits. This would reduce the time complexity of the function from O(n^2) to O(n), which is a significant improvement. Another way to improve the efficiency of the function is to use a faster sorting algorithm to sort the array of numbers that fulfill the property. For example, instead of using the built-in Array.sort() method, which has a time complexity of O(n log n), we could use a faster sorting algorithm such as quicksort, which has a time complexity of O(n log n) on average and O(n^2) in the worst case. This would improve the overall time complexity of the function from O(n^2) to O(n log n), which is still an improvement over the original function. Additionally, we could optimize the function by using a more efficient data structure to store the numbers that fulfill the property. For example, instead of using an array, we could use a binary search tree or a hash table, which would allow us to insert and retrieve numbers more quickly. This would reduce the time complexity of the function from O(n^2) to O(n log n) or O(n), depending on the data structure used. Overall, there are several ways to improve the big O efficiency of the Eureka function, and the specific improvements will depend on the requirements and constraints of the problem.

There are fundamental problems in this prompt: First, the solution provided will always return an ordered array without explicitly sorting. So, to sort it as the last step before returning the array represents an unnecessarily introduced complexity, probably triggered by the word sorted in the prompt. Therefore, all comments about improving the sorting approach are unnecessary in the first place. Second, recommendations regarding the data structure to store and retrieve data from the array are only meaningful if the data is unsorted. Therefore, both comments will not be suitable for an optimized algorithm in the first place. The algorithm represented the approaches as absolute truths with perfect certainty.

Human engineers

Writing the code for the Eureka function took a bit longer but was also finished in about twenty minutes.

function eureka(start, end) {
  const numbers = new Array(end + 1 - start)
    .map((_value, i) => i + start)
    .filter((value) => {
      const result = value
        .toString()
        .split("")
        .reduce((prev, value, index) => {
          prev = prev + Math.pow(value, index + 1);
          return prev;
        }, 0);
      return value === result;
    });
  return numbers;
}

The solution represents a more functional approach utilizing the reduce-Array-method. Once again, the engineers solved this problem iteratively, fine-tuning mistakes throughout the coding process. Some initial pitfalls included incorrect typing (trying to build the exponent of value with the ^ operator, which does not cast the value as an integer) or starting the exponent at index 0.

The engineers could answer follow-up questions about time complexity and potential improvements to the algorithmic efficiency elaboratively. Answers included ideas, such as assigning variables to reduce space complexity throughout the loop and defining an insightful mathematical relationship between the number and its digit exponent.

In conclusion

The challenge made a couple of features extensively clear: Code completion assistance can provide a helpful tool in the belt of an educated reader. GitHub Co-Pilot can speed up coding for well-known use cases. If a developer wants to determine the Euclidean distance between two points, the co-pilot will spit out the function within seconds. The tool is less useful for large production-scale codebases with unique coding conventions.

During the process of finding a suitable problem, we investigated another classical programming problem, the Josephus Problem as an alternative, for which GitHub Co-Pilot provided a fully non-sensical response:

function josephus(per, k) {
  // Create an array of numbers from 1 to per
  let arr = Array.from(Array(per), (_, i) => i + 1);
  // Create a counter variable
  let count = 0;
  // Loop through the array
  for (let i = 0; i < arr.length; i++) {
    // Increment the counter
    count++;
    // Check if the counter is equal to k
    if (count === k) {
      // Remove the current element from the array
      arr.splice(i, 1);
      // Reset the counter
      count = 0;
      // Decrement the index
      i--;
    }
  }
  // Return the last element in the array
  return arr[0];
}

While the usefulness of GitHub's AI becomes apparent quickly, comparing human-generated code with ChatGPT is more intricate.

The ChatGPT AI presents the solutions with equal confidence regardless of the certainty of the answer. Interestingly it solved the BubbleSort algorithm differently than the engineers or the GitHub AI. This phenomenon represents a unique chance for human engineers to learn from the AI to investigate problems from a different angle, potentially sparking new insights. This feature has been observed by other AIs trained on reinforcement patterns before. Most notably, OpenAI's bot beating the world champion team in DotA 2 - a complex multiplayer video game (reference here) and Google's DeepMind dominating the world champion in Go - often considered the most complex board game in the world (full documentary here). For both examples, the approach of the AI was novel and exotic because it only trained partially on historical data but mainly by playing against itself.

Nevertheless, the incorrect follow-up answers, especially given their elaborate presentation, showcase a high risk of these new AI tools: blindly relying on answers without double-checking their correctness can lead to inherently wrong results dressed up nicely. Once again, the AI's benefit will strongly depend on how informed the audience is. Sam Altman, CEO of OpenAI - the creators of ChatGPT, addressed this critically in a twitter post:

While human engineers take longer to develop a solution, their consequently deep understanding of the matter makes them invaluable to the process. They figured out ChatGPT's incorrect answers within seconds after spending a longer time solving the challenge themselves - a nuance that a less technical audience might have missed.

As usual, the correct answer to the best timing for AI-assisted tools depends on the situation. Speeding up prototyping, asking narrowly defined questions, or reviewing code snippets to find bugs are already valuable use cases to free up an engineers' hands to spend their time on what they do best: figuring out the hard stuff.

Published by...

Marc Luettecke

Visit author page

Nikolas Rist

Visit author page

Tom Gehder

Visit author page