Traversing steep learning curves with ChatGPT & Code Interpreter
Let's combine SQL & Python to parse blockchain data!
Let’s say you want to analyze some blockchain data. We’ll pick Ethereum. You could use something like Dune Analytics. Or you could be fancy and use Code Interpreter to whip up a Python script which wraps a SQL query inside.
My prompt: I want to do analysis of blockchain transactional data with Python. What do you suggest for free data sources?
ChatGPT returns the following suggestions:
Bitcoin Core: This software allows you to download the entire Bitcoin blockchain and extract raw data directly. It requires significant disk space and time to download the whole blockchain.
Blockchair: Blockchair provides API access to Bitcoin, Bitcoin Cash, and Ethereum blockchain data, among others. You can extract transactional data for each of these blockchains. However, keep in mind that free access may have certain limitations on the number of queries per day.
Blockchain.com: This service provides access to the Bitcoin blockchain data. Their API lets you access various details about blocks, transactions, and addresses.
Etherscan: This is a block explorer and analytics platform for Ethereum. They offer an API that provides access to Ethereum blockchain data.
BigQuery public datasets: Google Cloud's BigQuery has a public dataset program that includes Bitcoin blockchain data, as well as Ethereum, Litecoin, Zcash, Dash, and Dogecoin. This can be used for free, but there may be costs associated with querying large amounts of data.
As these seem to be all nearly fungible, if not completely so, let’s just pick Google’s BigQuery. Here’s the script that Code Interpreter came up with:
from google.cloud import bigquery
# Instantiate a BigQuery client
client = bigquery.Client()
# Define a SQL query
query = """
SELECT
DATE(block_timestamp) AS txn_date,
COUNT(*) AS daily_txn_count
FROM `bigquery-public-data.crypto_ethereum.transactions`
WHERE block_timestamp >= '2022-01-01'
GROUP BY txn_date
ORDER BY txn_date
"""
# Run the query
query_job = client.query(query)
# Print the results
for row in query_job:
print(f"Date: {row.txn_date}, Daily Transaction Count: {row.daily_txn_count}")
I want to focus a bit on the SQL query contained in the script. While I understand that you can embed SQL queries in Python scripts, and while I understand why Code Interpreter inserted a SQL query in its script, I was still curious to read its explanation. Code Interpreter/ChatGPT4 has a very interesting habit of providing verbose explanations to seemingly simple questions. The explanation may be obvious to anyone with more than a cursory level of experience with Python or SQL, but, nonetheless, it’s worth parsing:
The script employs SQL because it is the primary language for querying databases, including Google's BigQuery. When working with structured data stored in tables, SQL is an efficient and powerful way to extract, filter, and manipulate the data.
Here's a breakdown of the SQL query in the script:
SELECT
: This clause specifies the data you want to retrieve. In this case, it is retrieving the date of the block timestamp (DATE(block_timestamp) AS txn_date
) and the count of transactions (COUNT(*) AS daily_txn_count
).
FROM
: This clause specifies the source of the data. Here, the data is sourced from the Ethereum transactions table in BigQuery's public data:bigquery-public-data.crypto_ethereum.transactions
.
WHERE
: This clause filters the data based on certain conditions. The script only selects data where the block timestamp is later than or equal to '2022-01-01'.
GROUP BY
: This clause groups the data by a specified column or set of columns. Here, it is grouping the data by the transaction date (txn_date
). This means that for each unique date, it will aggregate (i.e., count) the transactions.
ORDER BY
: This clause sorts the results by a specified column or set of columns. Here, it is sorting the results by the transaction date (txn_date
), ensuring the output is in chronological order.The use of SQL inside a Python script illustrates how the two languages can complement each other. Python is a flexible, general-purpose programming language, and SQL excels at structured data queries. By using them together, you can write a Python script that conducts complex data analysis using SQL queries.
As I have written before, this interactive dynamic between man (or woman) and machine is very powerful. It is an incredibly effective way to rapidly traverse steep knowledge curves. You could easily spend hours pounding ChatGPT and its Code Interpreter plugin with questions about code it generates, and learn something new with every question.
Incidentally, after it wrote the script, it occurred to me that Code Interpreter likely couldn’t execute the script in its code environment. I know this because having learned the limitations of its computing environment is one of the knowledge curves I’ve traversed recently. The script calls an external database, and ChatGPT is not connected to the internet. (There is the Web browsing plugin, but access to that has been turned off. Further, Microsoft’s implementation of GPT4, which I call BingBot, is connected to the internet.)
Finally, let’s consider whether my approach has any limitations. I am not aware of any, but it turns out that ChatGPT is very much aware of some limitations. I asked it: What are some limitations to this approach which I am not considering? Given the limitations which you identify, how would you propose overcoming them?
Here’s its output:
While the approach described provides a robust method for analyzing Ethereum blockchain data, it does have several potential limitations:
Cost: Although Google BigQuery provides a certain amount of free data processing each month, extensive use or complex queries on large datasets like the Ethereum blockchain could incur significant costs.
Data freshness: The Ethereum dataset in BigQuery is updated daily, so it may not include the most recent transactions if you need real-time data.
Complexity of blockchain data: Blockchain data is complex and can be difficult to work with. It involves understanding unique concepts such as blocks, transactions, addresses, and smart contracts.
Query limitations: BigQuery has certain limitations on the size and complexity of queries. If your analysis requires very complex queries or queries that process an enormous amount of data, you might encounter these limits.
Data privacy: While the data on the blockchain is pseudonymous, meaning addresses are not directly linked to identities, it is possible in some cases to link addresses to individuals through careful analysis or external data. Be careful not to inadvertently infringe on anyone's privacy.
Here are some strategies to overcome these limitations:
Cost: Monitor your usage and set up alerts or limits to control costs. Optimize your SQL queries to process less data, for example, by filtering on block dates or transaction values. If you only need a sample of the data, consider random sampling techniques.
Data freshness: If you need real-time data, consider using a different tool or API designed for real-time blockchain data, such as Infura for Ethereum.
Complexity of blockchain data: Invest time in learning about blockchain and Ethereum to better understand the data you're working with. Consult the Ethereum blockchain documentation and other educational resources.
Query limitations: If you encounter query limitations, you may need to break your query into smaller parts or consider using a different tool designed for very large-scale data analysis.
Data privacy: Be mindful of privacy concerns when working with blockchain data. Only collect and store the data that you need for your analysis, and avoid trying to link addresses to individuals unless absolutely necessary and ethically justified.
Goign forward, I am going to make it a habit to ask ChatGPT (whether using Code Interpreter or not) what limitations I am not considering.