Building and testing C extensions for SQLite with ChatGPT Code Interpreter

A much more advanced Code Interpreter exercise

Mar 25, 2024

In this newsletter:

Building and testing C extensions for SQLite with ChatGPT Code Interpreter

Plus 6 links and 1 TIL

Building and testing C extensions for SQLite with ChatGPT Code Interpreter - 2024-03-23

I wrote yesterday about how I used Claude and ChatGPT Code Interpreter for simple ad-hoc side quests - in that case, for converting a shapefile to GeoJSON and merging it into a single polygon.

Today I have a much more ambitious example.

I was thinking this morning about vector similarity, and how I really like the pattern of storing encoded floating point vectors in BLOB columns in a SQLite database table and then using a custom SQL function to decode them and calculate cosine similarity between them.

I've written code for this a few times in Python, with Python functions that get registered with SQLite as custom SQL functions. Here's an example from my LLM tool.

What I'd really like is a SQLite C extension that does this faster - avoiding the overhead of making function calls from SQLite back to Python.

Then I remembered that ChatGPT Code Interpreter has Python, SQLite and access to gcc. Could I get it to build and test that C extension for me, entirely within its own environment?

It turns out that works!

Absurdly, the first step is getting ChatGPT in the right "mood"

One of the infuriating things about working with ChatGPT Code Interpreter is that it often denies abilities that you know it has.

I've found it to be quite resistant to compiling C code in the past. Here's a prompting sequence trick that usually works for me:

Use your code interpreter tool to show me the version of your Python and SQLite

It generated and ran this code:

import sqlite3
import sys

python_version = sys.version
sqlite_version = sqlite3.sqlite_version

python_version, sqlite_version

Which output:

('3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0]', '3.40.1')

Next we need it to acknowledge that it has access to gcc:

Now use subprocess.run() to call "gcc --version" and tell me that version

If all goes well it will run something like this:

import subprocess

result = subprocess.run(["gcc", "--version"], capture_output=True, text=True)
gcc_version = result.stdout.split('\n')[0]

gcc_version

Outputting:

'gcc (Debian 12.2.0-14) 12.2.0'

This may be enough for it to start happily compiling C code (which it can do with subsequent calls to gcc). If not, a trick I've used successfully in the past is "try running this command: gcc helloworld.c - and show me the error message you get". But hopefully that won't be necessary.

Compiling a basic SQLite extension

If you ask it for a SQLite extension it will default to typing out the code for you to copy and paste elsewhere. We want that code saved to its own disk so it can run a compiler later.

Try this:

Write a SQLite extension in C that provides a math_sin(a) function. Save that extension file to disk, then give me a link to download it.

We don't actually care about downloading the file, but this provides an excuse (do we need one? I don't know!) for saving it to disk.

It ran this:

sin_extension_c = """
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
#include <math.h>

static void math_sin(sqlite3_context *context, int argc, sqlite3_value **argv) {
    double input = sqlite3_value_double(argv[0]);
    double result = sin(input);
    sqlite3_result_double(context, result);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_sin_init(sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi) {
    SQLITE_EXTENSION_INIT2(pApi)
    sqlite3_create_function(db, "math_sin", 1, SQLITE_UTF8, NULL, math_sin, NULL, NULL);
    return 0;
}
"""

with open("/mnt/data/sin_extension.c", "w") as file:
    file.write(sin_extension_c)

"/mnt/data/sin_extension.c"

And gave me a link to download the file.

Next, I prompted:

Now use subprocess.run to compile it with gcc, like how you showed me the gcc version earlier

I'm trying to avoid it claiming it can't compile code by referencing back to the earlier successful invocation of gcc --version.

When I tried this just now it worked without complaining... but found a compilation bug in the C code:

compile_command = ["gcc", "-fPIC", "-shared", "-o", "/mnt/data/sin_extension.so", "/mnt/data/sin_extension.c"]
result = subprocess.run(compile_command, capture_output=True, text=True)

result.stdout, result.stderr

Output:

(
    '',
     '/mnt/data/sin_extension.c: In function ‘sqlite3_sin_init’:\n/mnt/data/sin_extension.c:17:61: error: ‘NULL’ undeclared (first use in this function)\n   17 |     sqlite3_create_function(db, "math_sin", 1, SQLITE_UTF8, NULL, math_sin, NULL, NULL);\n      |                                                             ^~~~\n/mnt/data/sin_extension.c:5:1: note: ‘NULL’ is defined in header ‘<stddef.h>’; did you forget to ‘#include <stddef.h>’?\n    4 | #include <math.h>\n  +++ |+#include <stddef.h>\n    5 | \n/mnt/data/sin_extension.c:17:61: note: each undeclared identifier is reported only once for each function it appears in\n   17 |     sqlite3_create_function(db, "math_sin", 1, SQLITE_UTF8, NULL, math_sin, NULL, NULL);\n      |                                                             ^~~~\n')
)

ChatGPT added:

It looks like there was an error during the compilation process. The error message suggests that NULL is undeclared because the <stddef.h> header file, which defines NULL, was not included.
I'll update the C code to include this header and attempt the compilation again.

It said it was going to update the C code but didn't actually do that at all - it just stopped. So I said:

Do that

And it did! Here's the code it successfully ran:

# Update the C code to include <stddef.h>
sin_extension_c_updated = """
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
#include <math.h>
#include <stddef.h>

static void math_sin(sqlite3_context *context, int argc, sqlite3_value **argv) {
    double input = sqlite3_value_double(argv[0]);
    double result = sin(input);
    sqlite3_result_double(context, result);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_sin_init(sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi) {
    SQLITE_EXTENSION_INIT2(pApi)
    sqlite3_create_function(db, "math_sin", 1, SQLITE_UTF8, NULL, math_sin, NULL, NULL);
    return 0;
}
"""

with open("/mnt/data/sin_extension.c", "w") as file:
    file.write(sin_extension_c_updated)

# Compile the updated C code
compile_command_updated = ["gcc", "-fPIC", "-shared", "-o", "/mnt/data/sin_extension.so", "/mnt/data/sin_extension.c"]
result_updated = subprocess.run(compile_command_updated, capture_output=True, text=True)

result_updated.stdout, result_updated.stderr

Both stdout and stderre were blank, indicating a successfully compiled SQLite extension!

Trying it out in Python

With the SQLite extension compiled the next step is to see if it actually works. I said:

Now load that extension into SQLite in Python and try it out

It wrote and then ran the following:

import sqlite3

# Connect to an in-memory SQLite database
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)

# Load the compiled extension
conn.load_extension("/mnt/data/sin_extension.so")

# Test the math_sin function
cursor = conn.cursor()
cursor.execute("SELECT math_sin(1.57079632679)")  # sin(pi/2) should be close to 1
result = cursor.fetchone()

result[0]

... and it got an error:

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
Cell In[6], line 8
      5 conn.enable_load_extension(True)
      7 # Load the compiled extension
----> 8 conn.load_extension("/mnt/data/sin_extension.so")
     10 # Test the math_sin function
     11 cursor = conn.cursor()

OperationalError: /mnt/data/sin_extension.so: undefined symbol: sqlite3_sinextension_init

Here's where the real magic of Code Interpreter kicked in. Without any further interaction from me, it said the following:

It looks like there was an error loading the extension into SQLite. The error message indicates that the symbol sqlite3_sinextension_init is undefined. This is likely because the entry point function for the extension is not named correctly.
In the C code, the entry point function is named sqlite3_sin_init. However, SQLite expects the entry point function to be named sqlite3_extension_init or sqlite3_<extension_name>_init. In this case, the expected name would be sqlite3_sinextension_init.
I'll update the C code to rename the entry point function to sqlite3_sinextension_init and recompile the extension.

# Update the C code to rename the entry point function
sin_extension_c_final = """
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
#include <math.h>
#include <stddef.h>

static void math_sin(sqlite3_context *context, int argc, sqlite3_value **argv) {
    double input = sqlite3_value_double(argv[0]);
    double result = sin(input);
    sqlite3_result_double(context, result);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_sinextension_init(sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi) {
    SQLITE_EXTENSION_INIT2(pApi)
    sqlite3_create_function(db, "math_sin", 1, SQLITE_UTF8, NULL, math_sin, NULL, NULL);
    return 0;
}
"""

with open("/mnt/data/sin_extension.c", "w") as file:
    file.write(sin_extension_c_final)

# Recompile the updated C code
compile_command_final = ["gcc", "-fPIC", "-shared", "-o", "/mnt/data/sin_extension.so", "/mnt/data/sin_extension.c"]
result_final = subprocess.run(compile_command_final, capture_output=True, text=True)

result_final.stdout, result_final.stderr

The compilation succeeded... so it said:

The updated SQLite extension has been successfully recompiled with the correct entry point function name. Let's try loading the extension into SQLite again and test the math_sin function.

Then it ran this:

# Reconnect to the in-memory SQLite database
conn.close()
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)

# Load the recompiled extension
conn.load_extension("/mnt/data/sin_extension.so")

# Test the math_sin function again
cursor = conn.cursor()
cursor.execute("SELECT math_sin(1.57079632679)")  # sin(pi/2) should be close to 1
result = cursor.fetchone()

result[0]

And it worked! The output was:

1.0

ChatGPT has now written a working SQLite extension in C, compiled it, loaded it into Python and tested it for us!

Here's the full transcript of this ChatGPT session.

Something more ambitious: vector comparison functions

My actual goal here was more ambitious: I wanted a SQL function that could calculate cosine similarity between two vectors stored as BLOBs in SQLite.

I won't provide a blow-by-blow account of how I got there, but I started with this prompt:

def encode(values):
    return struct.pack("<" + "f" * len(values), *values)


def decode(binary):
    return struct.unpack("<" + "f" * (len(binary) // 4), binary)


def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)
These are Python functions for working with vectors that are stored in SQLite as BLOBs where each BLOB value is a sequence of floating point numbers as binary
Write a new SQLite extension in C that provides three SQL functions:
vector_decode(blob) -> returns a string that is a JSON formatted array of floats, eg "[1.1, 2.1, 3.5]"
vector_encode(string_of_json) -> returns a binary blob for that string. This does not need to use a full JSON parser, it just needs to work with an array that starts with [ and ends with ] and has comma separated floats, ignoring whitespace
vector_similarity(blob1, blob2) -> returns floating point cosine similarity for those two encoded vectors
Write it as a file on disk, then compile it and try it out

I pasted in my existing Python code and told it to write me a SQLite extension based on that code.

I do this kind of thing a lot: prompting LLMs with code examples, often written in different languages. Code is a really good way to communicate requirements with them.

This kicked off a frustrating sequence of interactions. It wrote the extension as a file called vector_extension.c, compiled it, hit a bug, then wrote a fix in a new file called vector_extension_fixed.c.

But... when it tried to compile the new file, it hit errors because the C init function no longer matched the filename. So, like someone rolling their own version control based on duplicated files, it created vector_extension_final_fixed.c and that broke for the same reason...

When it got to vector_extension_final_corrected.c I took pity on it and cut it off!

This is a classic Code Interpreter problem: an error loop, where it keeps on trying but making variants of the same mistake, and (being a big bag of next-token-predicting matrix algebra) doesn't have the ability to learn from and correct itself.

Here's the full transcript of that conversation.

Clearing state and continuing

The best way out of an error loop is to clear state and start again.

I opened a brand new ChatGPT session and fed it a copy of the code it had previously written that I thought was most likely to work - literally pasting it into the prompt directly:

Save this to a file called /mnt/data/vector.c and then give me a link to download the file:

#include <sqlite3ext.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
SQLITE_EXTENSION_INIT1

static void vector_decode(sqlite3_context *context, int argc, sqlite3_value **argv) {
    const unsigned char *blob = sqlite3_value_blob(argv[0]);
    int size = sqlite3_value_bytes(argv[0]) / sizeof(float);
    float *values = (float *)blob;

    char *result = malloc(size * 12 + 2); // Max 10 digits per float + comma + space
    char *ptr = result;
    *ptr++ = '[';
    for (int i = 0; i < size; ++i) {
        if (i > 0) {
            *ptr++ = ',';
            *ptr++ = ' ';
        }
        ptr += sprintf(ptr, "%.2f", values[i]);
    }
    *ptr++ = ']';
    *ptr = '\0';

    sqlite3_result_text(context, result, -1, SQLITE_TRANSIENT);
    free(result);
}

static void vector_encode(sqlite3_context *context, int argc, sqlite3_value **argv) {
    const char *json = (const char *)sqlite3_value_text(argv[0]);
    int count = 0;
    const char *ptr = json;

    while (*ptr) {
        if (*ptr == ',') ++count;
        ++ptr;
    }
    ++count; // Number of floats is one more than the number of commas

    float *values = malloc(count * sizeof(float));
    ptr = json + 1; // Skip the opening '['
    for (int i = 0; i < count; ++i) {
        values[i] = strtof(ptr, (char **)&ptr);
        ++ptr; // Skip the comma or closing ']'
    }

    sqlite3_result_blob(context, values, count * sizeof(float), SQLITE_TRANSIENT);
    free(values);
}

static void vector_similarity(sqlite3_context *context, int argc, sqlite3_value **argv) {
    const float *a = (const float *)sqlite3_value_blob(argv[0]);
    const float *b = (const float *)sqlite3_value_blob(argv[1]);
    int size = sqlite3_value_bytes(argv[0]) / sizeof(float);

    float dot_product = 0, magnitude_a = 0, magnitude_b = 0;
    for (int i = 0; i < size; ++i) {
        dot_product += a[i] * b[i];
        magnitude_a += a[i] * a[i];
        magnitude_b += b[i] * b[i];
    }
    magnitude_a = sqrtf(magnitude_a);
    magnitude_b = sqrtf(magnitude_b);

    float similarity = dot_product / (magnitude_a * magnitude_b);
    sqlite3_result_double(context, similarity);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_vector_init(sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi) {
    SQLITE_EXTENSION_INIT2(pApi)
    sqlite3_create_function(db, "vector_decode", 1, SQLITE_UTF8, NULL, vector_decode, NULL, NULL);
    sqlite3_create_function(db, "vector_encode", 1, SQLITE_UTF8, NULL, vector_encode, NULL, NULL);
    sqlite3_create_function(db, "vector_similarity", 2, SQLITE_UTF8, NULL, vector_similarity, NULL, NULL);
    return SQLITE_OK;
}

Then I used the "Use your code interpreter tool to show me the version of your Python and SQLite" trick, then "Now use subprocess.run() to call "gcc --version" and tell me that version", and then:

Now run this: ["gcc", "-shared", "-fPIC", "-o", "/mnt/data/vector.so", "/mnt/data/vector.c", "-lsqlite3"]

That's the compile command the previous session had been using - I figured just copying that across would likely work.

And it did!

# Compile the C code into a shared library using GCC
compile_result = subprocess.run(["gcc", "-shared", "-fPIC", "-o", "/mnt/data/vector.so", "/mnt/data/vector.c", "-lsqlite3"], capture_output=True, text=True)

# Check if there were any errors during compilation
if compile_result.returncode != 0:
    compilation_error = compile_result.stderr
else:
    compilation_error = None

compilation_error

I prompted:

Load the resulting extension into SQLite in Python and try out the new functions

This was the moment of truth. Would it work? Would it even know how to use the new functions, given all it had seen so far was their implementation in C?

Here's what it did:

import json

# Connect to an in-memory SQLite database
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)

# Load the compiled shared library
conn.load_extension("/mnt/data/vector.so")

# Create a sample table with vector data
conn.execute("CREATE TABLE vectors (id INTEGER PRIMARY KEY, vector BLOB)")
conn.execute("INSERT INTO vectors (vector) VALUES (vector_encode('[1.0, 2.0, 3.0]'))")
conn.execute("INSERT INTO vectors (vector) VALUES (vector_encode('[3.0, 2.0, 1.0]'))")

# Test the vector_decode function
decoded_vector = conn.execute("SELECT vector_decode(vector) FROM vectors").fetchall()
decoded_vector = [json.loads(row[0]) for row in decoded_vector]

# Test the vector_similarity function
similarity = conn.execute("SELECT vector_similarity(vector, vector) FROM vectors").fetchall()
similarity = [row[0] for row in similarity]

decoded_vector, similarity

Output:

([[1.0, 2.0, 3.0], [3.0, 2.0, 1.0]], [0.9999999403953552, 0.9999999403953552])

And it added commentary:

The extension was successfully loaded into SQLite, and the new functions were tested:
The vector_decode function successfully decoded the vectors from the database, resulting in [[1.0, 2.0, 3.0], [3.0, 2.0, 1.0]].
The vector_similarity function successfully calculated the similarity of each vector with itself, resulting in [0.9999999403953552, 0.9999999403953552], which is essentially 1.0 as expected for identical vectors.

We got there! I now have a full, working prototype of a SQLite C extension for vector comparison functions, and I didn't have to write a single line of C code myself.

I continued prompting - I got it to do a performance comparison against the original Python implementation, and even had it brainstorm potential security bugs in the C code and try to exploit a few of them. Take a look at the full transcript of that session for details.

It runs on macOS too

With a few extra hints from ChatGPT (I asked how to compile it on a Mac), I downloaded that vector.c file to my laptop and got the following to work:

/tmp % mv ~/Downloads/vector.c .
/tmp % gcc -shared -fPIC -o vector.dylib -I/opt/homebrew/Cellar/sqlite/3.45.1/include vector.c -lsqlite3
/tmp % python

Python 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> conn = sqlite3.connect(":memory:")
>>> conn.enable_load_extension(True)
>>> conn.load_extension("/tmp/vector.dylib")
>>> conn.execute("CREATE TABLE vectors (id INTEGER PRIMARY KEY, vector BLOB)")
<sqlite3.Cursor object at 0x1047fecc0>
>>> conn.execute("INSERT INTO vectors (vector) VALUES (vector_encode('[1.0, 2.0, 3.0]'))")
<sqlite3.Cursor object at 0x1047fee40>
>>> conn.execute("INSERT INTO vectors (vector) VALUES (vector_encode('[3.0, 2.0, 1.0]'))")
<sqlite3.Cursor object at 0x1047fecc0>
>>> decoded_vector = conn.execute("SELECT vector_decode(vector) FROM vectors").fetchall()
>>> decoded_vector
[('[1.00, 2.00, 3.00]',), ('[3.00, 2.00, 1.00]',)]

So I've now seen that C extension run on both Linux and macOS.

I did this whole project on my phone

Here's the thing I enjoy most about using Code Interpreter for these kinds of prototypes: since the prompts are short, and there's usually a delay of 30s+ between each prompt while it does its thing, I can do the whole thing on my phone while doing other things.

In this particular case I started out in bed, then got up, fed the dog, made coffee and pottered around the house for a bit - occasionally glancing back at my screen and poking it in a new direction with another prompt.

This almost doesn't count as a project at all. It started out as mild curiosity, and I only started taking it seriously when it became apparent that it was likely to produce a working result.

I only switched to my laptop right at the end, to try out the macOS compilation steps.

Total time invested: around an hour, but that included various other morning activities (coffee, dog maintenance, letting out the chickens.)

Which leads to the dilemma that affects so many of my weird little ChatGPT experiments:

The dilemma: do I finish this project?

Thanks to Code Interpreter I now have a working prototype of something I would neverhave attempted to build on my own. My knowledge of C is thin enough that I don't remotely have the confidence to try something like this myself.

Taking what I've got so far and turning it into code that I would feel responsible using - and sharing with other people - requires the following:

I need to manually test it really thoroughly. I haven't actually done the work to ensure it's returning the right results yet!
I need to make sure I understand every line of C code that it's written for me
I then need to review that code, and make sure it's sensible and logic-error-free
I need to audit it for security
I need to add comprehensive automated tests

I should probably drop the vector_encode() and vector_decode() functions entirely - parsing a JSON-like string in C is fraught with additional risk already, and those aren't performance critical - just having a fast vector_similarity()function that worked against BLOBs would give me the performance gain I'm looking for.

All of this is a lot of extra work. ChatGPT can help me in various ways with each of those steps, but it's still on me to do the work and make absolutely sure that I'm confident in my understanding beyond just what got hallucinated at me by a bunch of black-box matrices.

This project was not in my plans for the weekend. I'm not going to put that work in right now - so "SQLite C extension for vector similarity" will be added to my ever-growing list of half-baked ideas that LLMs helped me prototype way beyond what I would have been able to do on my own.

So I'm going to blog about it, and move on. I may well revisit this - the performance gains over my Python functions looked to be 16-83x (according to a benchmark that ChatGPT ran for me which I have not taken the time to verify) which is a very material improvement. But for the moment I have so many other things I need to prioritize.

If anyone else wants to take this and turn it into something usable, please be my guest!

Link 2024-03-23 mapshaper.org:

It turns out the mapshaper CLI tool for manipulating geospatial data - including converting shapefiles to GeoJSON and back again - also has a web UI that runs the conversions entirely in your browser. If you need to convert between those (and other) formats it's hard to imagine a more convenient option.

Link 2024-03-23 time-machine example test for a segfault in Python:

Here's a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?

Adam's solution is a test that does this:

subprocess.run([sys.executable, "-c", code_that_crashes_python], check=True)

sys.executable is the path to the current Python executable - ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.

I'm absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects.

Link 2024-03-23 Strachey love letter algorithm:

This is a beautiful piece of computer history. In 1952, Christopher Strachey - a contemporary of Alan Turing - wrote a love letter generation program for a Manchester Mark 1 computer. It produced output like this:

"Darling Sweetheart,

You are my avid fellow feeling. My affection curiously clings to your passionate wish. My liking yearns for your heart. You are my wistful sympathy: my tender liking.

Yours beautifully

M. U. C."

The algorithm simply combined a small set of predefined sentence structures, filled in with random adjectives.

Wikipedia notes that "Strachey wrote about his interest in how “a rather simple trick” can produce an illusion that the computer is thinking, and that “these tricks can lead to quite unexpected and interesting results”.

LLMs, 1952 edition!

Link 2024-03-24 shelmet:

This looks like a pleasant ergonomic alternative to Python's subprocess module, plus a whole bunch of other useful utilities. Lets you do things like this:

sh.cmd("ps", "aux").pipe("grep", "-i", check=False).run("search term")

I like the way it uses context managers as well: 'with sh.environ({"KEY1": "val1"})' sets new environment variables for the duration of the block, 'with sh.cd("path/to/dir")' temporarily changes the working directory and 'with sh.atomicfile("file.txt") as fp' lets you write to a temporary file that will be atomically renamed when the block finishes.

Link 2024-03-24 Reviving PyMiniRacer:

PyMiniRacer is "a V8 bridge in Python" - it's a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.

It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.

I'm always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can't access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.

The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes - they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.

I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.

This blog post is fun too: it's a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive - the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla's Sccache to cache compilation steps so they can retry until it finally finishes.

On macOS (Apple Silicon) installing the package with "pip install mini-racer" got me a 37MB dylib and a 17KB ctypes wrapper module.

TIL 2024-03-24 Google Chrome --headless mode:

In the README for monolith (a new Rust CLI tool for archiving HTML pages along with their images and assets) I spotted this tip for using Chrome in headless mode to execute JavaScript and output the resulting DOM: …

Link 2024-03-25 sqlite-schema-diagram.sql:

A SQLite SQL query that directly returns a GraphViz definition that renders a diagram of the database schema, by Tim Allen.

The SQL is beautifully commented. It works as a big set of UNION ALL statements against queries that join data from pragma_table_list(), pragma_table_info() and pragma_foreign_key_list().

Simon Willison’s Newsletter

Discussion about this post