This is a method that you can use to learn about almost anything with ChatGPT. I just chose unstructured data as the vehicle to explain this method.
First, write an open-ended prompt: What are examples of unstructured data? The basic format here is What are examples of [concept]? The concept could be convergent evolution, use cases for the geometric mean, valuable baseball cards, famous doctors, wanted criminals, etc. The list of things to learn about is nearly infinite, and is limited only by your imagination.
Here’s ChatGPT’s answer:
Unstructured data refers to information that does not have a pre-defined data model or is not organized in a specific manner. Some examples of unstructured data include:
Text files: such as emails, documents, and articles
Audio and video files: such as podcasts, interviews, and movie clips
Images: such as photos, drawings, and graphics
Social media posts: such as tweets, Facebook updates, and LinkedIn posts
Web pages: such as blogs, forums, and online news articles
Customer feedback: such as surveys, comments, and reviews
Sensor data: such as temperature readings, GPS coordinates, and accelerometer data from IoT devices
Medical records: such as doctor's notes, X-rays, and MRI scans
These examples demonstrate the diversity and vast amount of information that falls under the category of unstructured data. While this type of data can be difficult to analyze and process, it can also contain valuable insights and information that can be useful for organizations.
Say I want to learn more about how an article’s data can be made structured. Thus, this prompt: How can an article be converted from unstructured to structured?
ChatGPT’s response is:
Converting unstructured data, such as an article, into structured data involves organizing the information in a structured format that can be easily analyzed and processed by computers. The process of converting unstructured data into structured data is often referred to as data extraction or data modeling. Here are some steps for converting an article from unstructured to structured data:
Identify the structure: Determine the important elements of the article, such as the title, author, date, and body text, and decide how they should be organized.
Define a data model: Decide on a format for representing the information, such as a table or a set of key-value pairs, and create a data model that defines the structure of the data.
Extract the data: Use techniques such as natural language processing, regular expressions, or manual data entry to extract the information from the article and populate the data model.
Validate the data: Verify that the extracted data is accurate and complete, and make any necessary corrections.
Store the data: Store the structured data in a database or data lake for easy retrieval and analysis.
By following these steps, an article can be converted from unstructured to structured data, allowing the information to be analyzed and processed more efficiently. The specific tools and techniques used in the process may vary depending on the size and complexity of the data, as well as the intended use case.
Maybe I want to use Python to do all of this for me. So here’s the next prompt: Can you write a Python script which does this? And—of course it can. Here’s ChatGPT’s Python script1 which converts an unstructured text article into structured data using the steps outlined in its previous response:
import re
# Define the structure of the data model
data_model = {
'title': '',
'author': '',
'date': '',
'body': ''
}
# Define a function to extract the information from the article
def extract_article_data(article):
data = data_model.copy()
lines = article.split('\n')
for line in lines:
if 'Title:' in line:
data['title'] = line.split(':')[1].strip()
elif 'Author:' in line:
data['author'] = line.split(':')[1].strip()
elif 'Date:' in line:
data['date'] = line.split(':')[1].strip()
else:
data['body'] += line + '\n'
return data
# Example unstructured article
article = '''Title: The Benefits of Structured Data
Author: John Doe
Date: January 1, 2023
Structured data is a format for organizing and representing data in a way that is easily analyzed and processed by computers. In this article, we will discuss the benefits of structured data and how it can be used to improve data analysis and decision making.
One of the key benefits of structured data is that it makes data easier to analyze. With structured data, all of the information is organized in a consistent and predictable manner, which makes it easier to extract insights and make decisions based on the data.
Another benefit of structured data is that it is more efficient to process. Structured data is stored in a database or data lake, which allows for fast and efficient retrieval and processing of the information.
In conclusion, structured data provides many benefits over unstructured data, including easier analysis, improved efficiency, and better decision making. If you are looking to improve your data analysis and decision making, consider converting your unstructured data into structured data.'''
# Extract the data from the article
structured_data = extract_article_data(article)
# Print the structured data
print(structured_data)
ChatGPT provides this explanation of how its code works:
In this example, the script defines a data model with four fields:
title
,author
,date
, andbody
. The functionextract_article_data
takes an unstructured article as input, extracts the information from the article using string manipulation, and returns the structured data in the form of a dictionary. The script then uses theextract_article_data
function to extract the data from the example article, and prints the structured data to the console.Note that this is just one possible way to convert an unstructured article into structured data, and the specific approach will depend on the format and structure of the data.
We can further interrogate ChatGPT regarding its code: What does data['title'] = line.split(':')[1].strip()
mean in this code?
Here’s its explanation:
In the line
data['title'] = line.split(':')[1].strip()
, the variabledata
is a dictionary that represents the structured data model for an article. The line sets the value of the'title'
key in the dictionary to the title of the article, which is extracted from theline
variable.Here's what each part of the line does:
line.split(':')
: This splits the string inline
on the colon:
character, and returns a list of strings. For example, ifline
is"Title: The Benefits of Structured Data"
, thenline.split(':')
would return["Title", " The Benefits of Structured Data"]
.
[1]
: This accesses the second element in the list returned byline.split(':')
. In this case, it would be" The Benefits of Structured Data"
.
.strip()
: This removes any leading or trailing whitespace characters from the string. In this case, it would remove the space character at the beginning of the string, resulting in"The Benefits of Structured Data"
.
data['title'] =
: This sets the value of the'title'
key in the dictionarydata
to the result of the expression on the right-hand side of the equals sign. In this case, it would be"The Benefits of Structured Data"
.So, the line
data['title'] = line.split(':')[1].strip()
takes the string inline
, splits it on the colon:
character, accesses the second element of the resulting list (the title of the article), removes any leading or trailing whitespace characters, and sets the value of the'title'
key in thedata
dictionary to the result.
So what we have here is a very quick iteration from the definition of a general concept to a concrete application of that concept. I went from asking for a general definition of unstructured data to asking for an example of a Python script which would make convert unstructured data into structured data. And this learning method can be used for almost any concept you can think of.
Caveat emptor: I have not tested this code. I am sharing it only to illustrate how to use ChatGPT to learn about a concept. Checking the code, and interrogating ChatGPT about errors it spits out, is the subject of a different post.