The promise & peril of Code Interpreter
With great power comes great responsibility: data analysis is nuanced and complex. What happens when a tool makes it too easy?
OpenAI’s Code Interpreter plugin for GPT-4, widely available to ChatGPT+ subscribers as of this week, is probably the most important and powerful plugin for GPT-4. It allows anyone with a subscription to feed GPT-4 a natural language data query, and have GPT-4 write the relevant Python code and execute the query. Further, it can provide the user with a descriptive or visual analysis of the findings.
In practice this means that non-technical people can feed GPT-4 some data, provide it with a natural language inquiry about the data, and sit back while GPT-4 churns through the process of writing a script to manipulate the data per the natural language instructions, execute the script, and return an analysis.
Because Code Interpreter is such a powerful tool, and because it can be used by anyone with a subscription to ChatGPT+, it’s worth thinking through some unintended consequences. In our pre-GPT world, a non-technical executive who needed some analysis done on data would email her data science team and ask for the relevant analysis. The data science team would perform the requested analysis, and present our executive with a summary. If the data science team were good, the analysis performed would account for limitations inherent in the data. All of this complexity and nuance would be abstracted away from the executive, whose only interest in the data would be to present some analytical conclusion about it to other executives or external parties.
But because we now have Code Interpreter, much of the data scientist’s function as a kind of blocker between executive and data disappears. The executive now has the ability to interrogate data directly. And, if the executive doesn’t care about, or understand, limitations of the underlying data, the executive may make erroneous conclusions about the output ChatGPT has provided.1
Let’s make this more concrete. A few months ago, a hapless lawyer used ChatGPT to perform what he thought were AI-powered Google-style searches for relevant precedential cases that supported his client’s claims. Of course, as you hopefully understand, ChatGPT’s “search” functionality isn’t really a search functionality as we have come to understand that term. ChatGPT hallucinated the cases it returned to the lawyer, and the lawyer did not understand this. The lawyer submitted these non-existent cases as part of a filing made to the court adjudicating his client’s matter, and, well, hilarity ensued as they say.
And the same thing will happen with Code Interpreter. Even though, as we will see later in this post, Code Interpreter provides great guidance to users about nuance and limitations in a given data analysis, we can be assured that hapless and busy executives will only pay attention to the output, and not the details about the output’s limitations.
Understanding the limitations of any analysis done on a set of data is of course important, but that’s nuance, and nuance is often lost, especially by those who don’t understand the technology they’re using.
So, given all of these considerations, let’s take a look at how Code Interpreter works. We’ll assume that we’re Hapless Non-Quantitative Marketing Executive (HNQME) looking to analyze some customer segmentation data2. HNQME finds a dataset relevant to her interests, fires up ChatGPT, and uploads the file to Code Interpreter.
When you enable the Code Interpreter plugin, the + icon appears to the left of ‘Send a message’. You can click this button to upload a file to ChatGPT. When uploading a file, it helps to provide a prompt, such as: review this file, and provide a narrative description of its contents. Then propose analyses to be done for a customer segmentation.
What ChatGPT returns is truly impressive. We quickly learn a lot about the data3:
AI OUTPUT
The provided file appears to be a dataset with customer data for segmentation purposes. The dataset contains the following columns:
ID
: This seems to be a unique identifier for each customer.Gender
: The gender of the customer. The options appear to be 'Male' or 'Female'.Ever_Married
: This column indicates whether the customer has ever been married. The options appear to be 'Yes' or 'No'.Age
: The age of the customer.Graduated
: This column indicates whether the customer has graduated from an educational institution. The options appear to be 'Yes' or 'No'.Profession
: The profession of the customer. This could include professions like 'Healthcare', 'Engineer', 'Lawyer', 'Entertainment', and possibly others.Work_Experience
: The number of years of work experience that the customer has.Spending_Score
: This could be a measure of the customer's spending habits or financial capacity. The options seem to be 'Low', 'Average', or 'High'.Family_Size
: The number of family members that the customer has.Var_1
: This seems to be a categorical variable with categories like 'Cat_4', 'Cat_6', and possibly others. The meaning of this column is not immediately clear.Segmentation
: This is the customer segmentation label. This appears to be the target variable for a classification task.
Possible analyses for customer segmentation could include:
Descriptive statistics and data visualization: This can provide insights into the distribution of the data and potential relationships between variables. For example, it would be useful to examine the distribution of customers across the different segmentation labels, as well as how variables like age, work experience, and spending score differ between segments.
Correlation analysis: This can provide insights into how different variables are related to each other. For instance, we can check if there's a relationship between age and spending score, or between profession and segmentation.
Missing data analysis: It's important to check if there are missing values in the data and consider appropriate strategies for handling them.
Predictive modeling: If the aim is to be able to predict the segmentation of new customers, machine learning algorithms can be used to build a model based on the existing data. This could involve algorithms like decision trees, random forests, or gradient boosting for this classification task.
Feature importance analysis: Once a predictive model is built, feature importance analysis can show which variables were most influential in predicting the customer segmentation. This can provide insights into what factors are most important in determining the segmentation of a customer.
Cluster analysis: If the aim is to discover new segments in the data, unsupervised learning techniques like K-means or hierarchical clustering can be used. These methods could reveal patterns in the data that weren't captured by the existing segmentation.
HNQME is presented with a lot of information about the data, how it is structured, and possible analytical techniques she can request.
Earlier I mentioned that understanding limitations of data is an important part of any analysis, and here’s an example of how ChatGPT provides these details. This dataset contains rows with missing data. Missing data obviously can skew calculations made on the dataset.
AI OUTPUT
Before any further analysis or model training, we should decide how to handle these missing values. Options include deleting rows with missing values, imputing missing values with a statistic like the mean or median, or using more complex imputation methods. The choice depends on the specific context and the proportion of missing values. For example, if a column has a very high proportion of missing values, it might not be very useful for analysis or prediction and could potentially be dropped. On the other hand, if only a small proportion of values are missing, imputation might be a good strategy.
We should also consider the distribution of each variable, as this might influence the choice of models or preprocessing steps. For example, if a variable has a skewed distribution, it might be beneficial to apply a transformation to make it more symmetric. If a categorical variable has a large number of categories, it might be beneficial to combine some of the categories or to use a method that can handle high cardinality.
Data analysis is complex, and the decisions about what to do with imperfect data have downstream effects which can affect analytical quality. Paying attention to these details is of course par for the course for any data scientist, but it’s likely beyond the ken of your standard non-quant marketing executive. Hopefully our hapless marketing executive heeds ChatGPT’s warnings.
I told ChatGPT to remove the rows with missing values. It informs me that 1403 rows were removed, which is approximately 17% of the original dataset. While this is a significant portion of data, it also means that the remaining rows have all their data, and ensures that we won’t be working with incomplete data for subsequent analyses.
This post is already getting long, so I am going to conclude it with a finishing thought: ChatGPT’s introduction of Code Interpreter to the world means that data analysis & data science is widely accessible through a natural language interface. This is an extraordinarily powerful tool which, when users consider the nuances inherent in data analysis, will unlock untold opportunities for executives and other non-technical employees to essentially operate as their own independent data science team. But it is also a tool fraught with peril: those who ignore the nuance and complexity of data science and analysis may well find themselves proceeding down a path whose characteristics prove illusory.
By the way, I would not conclude that this spells doom for data scientists. Rather, I think that Code Interpreter will free up data scientists from responding to executives’ ad hoc analytical requests. I see this as akin to the effect that the introduction of Excel had on accountants: it freed up their time to do accounting work, and not respond to executives’ requests for accounting analyses. We can safely assume that many executives in the early ‘80s made erroroneous conclusions based on the analytical power that Excel provided them.
Due to Substack’s formatting limitations I am putting ChatGPT’s output between section dividers. Hopefully this formatting convention will make the post easier to read.
Like with ChatGPT, there seems to be a learning curve at first. Both tools are incredibly powerful, but you have to understand that they're tools first, not some kind of "friend" you're working with (not yet, anyway!). I've been pretty disappointed so far, but I'm also just experimenting right now, and I know it'll improve quickly.