Using ChatGPT's Code Interpreter as a tutor

Does this augur the end of teachers? Not quite, but education will (probably?) change.

Jul 31, 2023

Today we’re going to explore how to use ChatGPT’s Code Interpreter plugin as a kind of tutor. To help us set the appropriate context for this exericse, consider the conventional human tutor: you pose a question to the tutor about how to do a certain task, she provides you with the answer, you rephrase the concept in your own words. Either you have understood the concept or more explanation is warranted. This iterative process is a great way to rapidly scale knowledge curves.

How can we apply this method to ChatGPT’s Code Interpreter? Consider the following prompt: Pick a complicated thing to do in Python, write a script for it, and then provide me with an explanation of how the script works and what its function is.

Obviously our contrived example here is a little bit different from a conventional tutor construct, in that we’re explicitly asking the tutor to choose what to teach us. But for the purpose of understanding how to use Code Interpreter as a tutor, this will suffice.

Since this post is long, here’s a handy outline of what will happen:

Provide ChatGPT with the Code Interpreter plugin a prompt, for which, see above.
Code Interpreter writes the script.
Code Intrepreter provides an explanation of what the script does.
I provide a rephrasing of the explanation, intentionally misunderstanding it. The goal here is to determine whether ChatGPT really understands the words used in prompts, or whether it is just hallucinating.
Run the script, and find an error in the script.
Have Code Interpreter provide the fix to the script (even though it can’t run the script in its environment).

So let’s dive in. Here’s what ChatGPT picks for us:

Sure, let’s create a script for sentiment analysis using the Naive Bayes Classifier algorithm from NLTK, a popular Natural Language Processing library in Python. Sentiment analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s often used to determine the attitudes, opinions, and emotions of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
The script is quite involved, and it will:
Load a labeled dataset of movie reviews from the NLTK corpus.
Preprocess the data (tokenization, removing stop words, and stemming).
Divide the dataset into a training set and a testing set.
Train a Naive Bayes Classifier on the training set.
Test the classifier on the testing set and print its accuracy.
Use the trained classifier to classify a new review.
Here is the script:

import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

# Initialize the stemmer
stemmer = PorterStemmer()

# Initialize the stop words
stop_words = set(stopwords.words("english"))

# Preprocessing function
def preprocess(words):
    return [stemmer.stem(word) for word in words if word not in stop_words]

# Load movie reviews
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

# Preprocess the reviews
documents = [(preprocess(doc), category) for doc, category in documents]

# Shuffle the documents
random.shuffle(documents)

# Get all words from all movie_reviews and put into a frequency distribution
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# The find_features function will determine which of the 2000 word features are contained in the review
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

# Create the feature sets for all reviews
featuresets = [(find_features(doc), category) for (doc, category) in documents]

# Split the featuresets into training and testing datasets
training_set = featuresets[:1800]
testing_set = featuresets[1800:]

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(training_set)

# Test the classifier and display its accuracy
print("Classifier accuracy:", (accuracy(classifier, testing_set)))

# Use the classifier to classify a new review
new_review = "This movie was terrible. The plot was dull and the acting was atrocious."
new_features = find_features(preprocess(word_tokenize(new_review)))
print("New review classification:", classifier.classify(new_features))

ChatGPT then provides some more helpful information about the script:

This script perfoms sentiment analysis on movie reviews. It first prepares the data by preprocessing and splitting it into training and testing sets. The Naive Bayes Classifier is then trained on the training set. The classifier’s accuracy is tested on teh testing set and finally, the classifier is used to classify a new review. Please note that this is a relatively simple sentiment analysis algorithm and may not be accurate for complex sentences or sarcasm.

As a dutiful student, I ask: Can you explain what sentiment analysis is and how it uses Python?

Here’s the tutor’s response:

Sentiment analysis, also known as opinion mining, is a subfield of natural language processing (NLP) that builds systems that try to identify and extract opinions within text. It involves determining the attitude, sentiments, evaluations, appraisals, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.
In the context of machine learning and natural language processing, sentiment analysis is a type of data mining that measures the inclination of people's opinions. For example, it can be used on social media feeds to determine public opinion regarding a specific topic or product, or to assess customer sentiment from reviews about products or services.
Python is a popular language for sentiment analysis because of its simplicity and the wide range of libraries it offers for NLP and machine learning tasks. These libraries provide functionalities for many tasks involved in sentiment analysis, such as keyword extraction, topic identification, and sentiment determination.
Here is how sentiment analysis works in general:
Data Collection: This is the first step where you collect the data to be analyzed. This could be data from social media, customer reviews, or any other text data. Python has several libraries that can help with this step, such as BeautifulSoup, Scrapy, or Tweepy.
Preprocessing: This step involves cleaning and formatting the data before feeding it into a machine learning algorithm. Common steps include removing stop words (commonly used words such as 'is', 'the', 'and' that do not carry important meaning), stemming and lemmatization (reducing words to their root form), and removing punctuation and special characters. Python's NLTK and spaCy libraries are commonly used for this step.
Vectorization: Machine learning algorithms work with numeric data, and vectorization is the process of converting text data into numeric vectors. There are several ways to do this, such as Bag of Words, TF-IDF, and Word2Vec. Python's Scikit-learn and Gensim libraries provide functionality for this step.
Model Training: This is the step where a machine learning algorithm is trained on preprocessed, vectorized data. The model tries to learn the correlation between the input vectors and the sentiment labels in the training data. Python's Scikit-learn, Tensorflow, and PyTorch libraries provide a wide range of algorithms for this step, from logistic regression to deep learning.
Prediction and Analysis: After the model is trained, it can be used to predict the sentiment of new, unseen data. These predictions can then be analyzed to derive insights.
In the script I provided in the previous message, we are performing sentiment analysis using the Naive Bayes Classifier provided by the NLTK library. The Naive Bayes Classifier is a popular algorithm for text classification and is often used in sentiment analysis.

We learn quite a lot from ChatGPT’s verbose response. Here I am going to play the dumb student. I will intentionally rephrase ChatGPT’s tutelage poorly. The goal is to see whether ChatGPT understands that my rephrasing of its explanation indicates that I don’t understand what it has told me. Recall that a common complaint about ChatGPT, especially in its early days, is its tendency towards hallucinatory bullshit. It frequently would agree with its interlocutor, as if it were trying to curry favor with the user. It would happily spout nonsense that seemed to confirm the user’s intent.

As we will see, this behavior seems to have improved. Here’s my attempt at poorly rephrasing what the tutor has told me: So let me rephrase this in my own words to see if I understand what you’re talking about. You have a bunch of naive opinions in a bunch of computer generated text and you wan tto tell people how to feel about the text so you run this script so that they get an idea of the sentiment. Is that right?

Hopefully you can see why my attempt at an explanation is wrong. Here is ChatGPT’s helpful explanation:

Not exactly. Let’s clarify some of the terminology and the overall process.
In sentiment analysis, the term ‘naive’ is not used to refer to ‘naive opinions.’ It’s part of the name of a specific classification algorithm: the ‘Naive Bayes Clasifier.’ This algorithm is ‘naive’ because it makes the assumption that all features (in our case, words) in the data are independent of each other, which is usually not the case in real-world text data. Despite this simplifying assumption, the Naive Bayes Classifier often performs surprisingly well and is widely used in text classification tasks, including sentiment analysis.
The goal of sentiment analysis is not to tell people how to feel about a piece of text, but rather to determine how the author of the text likely felt about the topic they were writing about. This is typically categorized as positive, negative, or neutral sentiment.
For example, if we have a large number of product reviews, a sentiment analysis algorithm could automatically classify them into positive and negative reviews, saving a company the time of manually reading and categorizing each one. This could then be used to get a general sense of how customers about a product, based on the sentiment of their reviews.
So, in the context of the script provided earlier, we’re not generating text nor telling people how to feel about it. Instead, we’re taking existing text (movie reviews in this case), an dusing a machine learning model to classify the sentiment expressed in each review as either positive or negative. The goal is to automate the process of understanding the sentiment expressed in large amounts of text.

It’s important to take a moment and understand what happened here. ChatGPT provided me with an explanation of what sentiment analysis is, I purposely misconstrued its meaning of “naive” in my re-phrasing of its explanation, it caught that misinterpretation, and tried a different explanation. This is very much in line with what a human tutor would do. Further, it suggests that GPT4 has made great strides in understanding meaning and context from words.

Next I asked it to run the script and show me its output. Due to the limitations of the computing environment that OpenAI has created for Code Interpreter, it can’t actually run the script that it writes. This tells us some interesting things: it “knows” how to write scripts which it can’t execute. As we’ll see below, bar for a simple error in the script, the script is flawless.

Here’s its verbose explanation:

I'm sorry for any confusion, but I'm unable to execute the Python script directly here as my current environment does not have the ability to download and import external Python libraries such as NLTK, nor does it have the capability to access the internet for downloading the movie review dataset from NLTK's corpus.
However, you can run the script on your own local Python environment. You'll need to have the NLTK library installed, which you can do using pip:
pip install nltk
After installing NLTK, you can run the script. It will print out the accuracy of the Naive Bayes Classifier on the testing set (a number between 0 and 1), and then print the sentiment classification of the new review ("This movie was terrible. The plot was dull and the acting was atrocious.") which should be "negative".
If you encounter any issues while running the script, please feel free to ask for help.

When I first ran the script I got an error. If you’re a Python guru you will quickly see the error: in the 32nd line of the script, it invokes random: random.shuffle(documents). (This is the Python module random, which, obviously, generates random numbers.) But that module isn’t loaded into the script. I don’t know why Code Interpreter did not originally write the script with instructions to load the random module, but here we are. Impressively, Code Interpreter was able to parse its script, and find the error, even though it can’t run the script in its computing environment! Here’s its explanation:

My apologies for the oversight. The error you're encountering indicates that the Python random module hasn't been imported, but it's being used in the script. The random module is used to shuffle the documents before creating the featuresets.

The solution here is obvious: add the import random line at the beginning of the script.

Once the script is fixed it runs flawlessly. My results were:

Classifier accuracy 0.755
New review classification: neg

ChatGPT tells me what this means:

The classifier accuracy of 0.755 indicates that the Naive Bayes Classifier correctly predicted the sentiment of the movie reviews in the test set 75.5% of the time. This is a decent accuracy rate, considering the simplicity of the algorithm and the complexity of natural language.
The new review classification of "neg" (short for negative) indicates that the classifier predicted the sentiment of the new review ("This movie was terrible. The plot was dull and the acting was atrocious.") to be negative, which aligns with our human interpretation.
Keep in mind that sentiment analysis can be a complex task, and this script is a simplified example. Real-world applications often involve more complex models and more comprehensive preprocessing to improve accuracy.

As this post is getting long, I am going to stop here. However, you can easily see how there are many more questions that a student could ask Code Interpreter.

Buy the Rumor; Sell the News

Using ChatGPT's Code Interpreter as a tutor

Does this augur the end of teachers? Not quite, but education will (probably?) change.