Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml
Plus 26 links, 7 quotations and 1 TIL
In this newsletter:
Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml
Plus 26 links and 7 quotations and 1 TIL
Datasette 1.0a8: JavaScript plugins, new plugin hooks and plugin configuration in datasette.yaml - 2024-02-07
I just released Datasette 1.0a8. These are the annotated release notes.
This alpha release continues the migration of Datasette's configuration from
metadata.yaml
to the newdatasette.yaml
configuration file, introduces a new system for JavaScript plugins and adds several new plugin hooks.
My plan is for this to be the last alpha that adds new features - the new plugin hooks, in this case. The next release will focus on wrapping up the stable APIs for 1.0, with a particular focus on template stability (so users can customize Datasette without fear of it breaking in future minor releases) and wrapping up the work on the stable JSON API.
Configuration
Plugin configuration now lives in the datasette.yaml configuration file, passed to Datasette using the
-c/--config
option. Thanks, Alex Garcia. (#2093)datasette -c datasette.yaml
Where
datasette.yaml
contains configuration that looks like this:plugins: datasette-cluster-map: latitude_column: xlat longitude_column: xlon
Previously plugins were configured in
metadata.yaml
, which was confusing as plugin settings were unrelated to database and table metadata.
This almost concludes the work (driven mainly by Alex Garcia) to clean up how Datasette is configured prior to the 1.0 release. Moving things that aren't metadata out of the metadata.yaml/json
file is a big conceptual improvement, and one that absolutely needed to happen before 1.0.
The
-s/--setting
option can now be used to set plugin configuration as well. See Configuration via the command-line for details. (#2252)The above YAML configuration example using
-s/--setting
looks like this:datasette mydatabase.db\ -s plugins.datasette-cluster-map.latitude_column xlat \ -s plugins.datasette-cluster-map.longitude_column xlon
This feature is mainly for me. I start new Datasette instances dozens of times a day to try things out, and having to manually edit a datasette.yaml
file before trying something new is an annoying little piece of friction.
With the -s
option anything that can be represented in JSON or YAML can also be passed on the command-line.
I mainly love this as a copy-and-paste mechanism: my notes are crammed with datasette
shell one-liners, and being able to paste something into my terminal to recreate a Datasette instance with a specific configuration is a big win.
The -s
command uses dot-notation to specify nested keys, but it has a simple mechanism for representing more complex objects too: you can pass them in as JSON literal strings and Datasette will parse them. The --setting documentation includes this example of configuring datasette-proxy-url:
datasette mydatabase.db \
-s plugins.datasette-proxy-url.paths '[{"path": "/proxy", "backend": "http://example.com/"}]'
Which is equivalent to the following datasette.yaml
file:
plugins:
datasette-proxy-url:
paths:
- path: /proxy
backend: http://example.com/
The new
/-/config
page shows the current instance configuration, after redacting keys that could contain sensitive data such as API keys or passwords. (#2254)
Datasette has a set of introspection endpoints like this - /-/metadata
and /-/settings
and /-/threads
, all of which can have .json
added to get back the raw JSON. I find them really useful for debugging instances and understanding how they have been configured.
The redaction is new: previously I had designed a mechanism for passing secrets as environment variables in a way that would avoid them being exposed here, but I realized automated redaction is less likely to cause people to leak secrets by accident.
Existing Datasette installations may already have configuration set in
metadata.yaml
that should be migrated todatasette.yaml
. To avoid breaking these installations, Datasette will silently treat table configuration, plugin configuration and allow blocks in metadata as if they had been specified in configuration instead. (#2247) (#2248) (#2249)
Originally the plan was to have Datasette fail to load if it spotted configuration in metadata.yaml
that should have been migrated to datasette.yaml
.
I changed my mind about this mainly as I experienced the enormous inconvenience of updating all of my Datasette instances to the new format - including rewriting the automated tests for my plugins.
I think my philosophy on this going forward is going to be that Datasette will take extra effort to keep older things working provided the additional code complexity in doing so is low enough to make it worth the trade-off. In this case I think it is.
Note that the
datasette publish
command has not yet been updated to accept adatasette.yaml
configuration file. This will be addressed in #2195 but for the moment you can include those settings inmetadata.yaml
instead.
I promised myself I would ship 1.0a8 today no matter what, so I cut this feature at the last moment.
JavaScript plugins
Datasette now includes a JavaScript plugins mechanism, allowing JavaScript to customize Datasette in a way that can collaborate with other plugins.
This provides two initial hooks, with more to come in the future:
makeAboveTablePanelConfigs() can add additional panels to the top of the table page.
makeColumnActions() can add additional actions to the column menu.
Thanks Cameron Yick for contributing this feature. (#2052)
The core problem we are trying to solve here comes from what happens when multiple plugins all try to customize the Datasette instance at the same time.
This is particularly important for visualization plugins.
An example: datasette-cluster-map and datasette-geojson-map both add a map to the top of the table page. This means if you have both plugins installed you can end up with two maps!
The new mechanism allows plugins to collaborate: each plugin can contribute one or more "panels" which will then be shown above the table view in an interface with toggles to switch between them.
The column actions mechanism is similar: it allows plugins to contribute additional actions to the column menu, which appears when you click the cog icon in the header of a table column.
Cameron Yick did a great job with this feature. I've been slow in getting a release out with it though - my hope is that we can iterate more productively on it now that it's in an alpha release.
Plugin hooks
New jinja2_environment_from_request(datasette, request, env) plugin hook, which can be used to customize the current Jinja environment based on the incoming request. This can be used to modify the template lookup path based on the incoming request hostname, among other things. (#2225)
I wrote about my need for this in Page caching and custom templates for Datasette Cloud: I wanted a way to modify the Jinja environment based on the requested HTTP host, and this lets me do that.
New family of template slot plugin hooks:
top_homepage
,top_database
,top_table
,top_row
,top_query
,top_canned_query
. Plugins can use these to provide additional HTML to be injected at the top of the corresponding pages. (#1191)
Another long-running need (the issue is from January 2021). Similar to the JavaScript plugin mechanism, this allows multiple plugins to add content to the page without one plugin overwriting the other.
New track_event() mechanism for plugins to emit and receive events when certain events occur within Datasette. (#2240)
Plugins can register additional event classes using register_events(datasette).
They can then trigger those events with the datasette.track_event(event) internal method.
Plugins can subscribe to notifications of events using the track_event(datasette, event) plugin hook.
Datasette core now emits
login
,logout
,create-token
,create-table
,drop-table
,insert-rows
,upsert-rows
,update-row
,delete-row
events, documented here.
Another hook inspired by Datasette Cloud. I want better analytics for that product to help track which features are being used, but I also wanted to do that in a privacy-forward manner. I decided to bake it into Datasette core and I intend to make it visible to the administrators of Datasette Cloud instances - so that it doubles as an audit log for what's happening in their instances.
I realized that this has uses beyond analytics: if a plugin wants to do something extra any time a new table is created within Datasette it can use the track_events()
plugin hook to listen out for the create-table
event and take action when it occurs.
New internal function for plugin authors: await db.execute_isolated_fn(fn), for creating a new SQLite connection, executing code and then closing that connection, all while preventing other code from writing to that particular database. This connection will not have the prepare_connection() plugin hook executed against it, allowing plugins to perform actions that might otherwise be blocked by existing connection configuration. (#2218)
This came about because I was trying to figure out a way to use prepare_connection()
hook to add authorizers that prevent users from deleting certain tables, but found that doing this prevented VACUUM
from working.
The new internal function provides a clean slate for plugins to do anything they like with a SQLite connection, while simultaneously preventing any write operations from other code from executing (even against other connections) until that isolated operation is complete.
Documentation
Documentation describing how to write tests that use signed actor cookies using
datasette.client.actor_cookie()
. (#1830)Documentation on how to register a plugin for the duration of a test. (#2234)
The configuration documentation now shows examples of both YAML and JSON for each setting.
I like including links to new documentation in the release notes, to give people a chance to catch useful new documentation that they might otherwise miss.
Minor fixes
Datasette no longer attempts to run SQL queries in parallel when rendering a table page, as this was leading to some rare crashing bugs. (#2189)
Fixed warning:
DeprecationWarning: pkg_resources is deprecated as an API
(#2057)Fixed bug where
?_extra=columns
parameter returned an incorrectly shaped response. (#2230)
Surprisingly few bug fixes in this alpha - most of the work in the last few months has been new features. I think this is a good sign in terms of working towards a stable 1.0.
Quote 2024-01-27
If you have had any prior experience with personal computers, what you might expect to see is some sort of opaque code, called a “prompt,” consisting of phosphorescent green or white letters on a murky background. What you see with Macintosh is the Finder. On a pleasant, light background (you can later change the background to any of a number of patterns, if you like), little pictures called “icons” appear, representing choices available to you.
Link 2024-01-27 The Articulation Barrier: Prompt-Driven AI UX Hurts Usability:
Jakob Nielsen: "Generative AI systems like ChatGPT use prose prompts for intent-based outcomes, requiring users to be articulate in writing prose, which is a challenge for half of the population in rich countries."
Quote 2024-01-27
Danielle Del, a spokeswoman for Sasso, said Dudesy is not actually an A.I.
“It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,” Del wrote in an email. “The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.”
George Carlin’s Estate Sues Podcasters Over A.I. Episode
Link 2024-01-27 Simon Willison interview: AI software still needs the human touch:
Thomas Claburn interviewed me for The Resister. We talked about AI training copyright, applications of AI for programming, AI security and a whole bunch of other topics.
TIL 2024-01-28 Exploring ColBERT with RAGatouille:
I've been trying to get my head around ColBERT. …
Link 2024-01-28 ColBERT query-passage scoring interpretability:
Neat interactive visualization tool for understanding what the ColBERT embedding model does - this works by loading around 50MB of model files directly into your browser and running them with WebAssembly.
Link 2024-01-28 llm-embed-onnx:
I wrote a new plugin for LLM that acts as a thin wrapper around onnx_embedding_models by Benjamin Anderson, providing access to seven embedding models that can run on the ONNX model framework.
The actual plugin is around 50 lines of code, which makes for a nice example of how thin a plugin wrapper can be that adds new models to my LLM tool.
Link 2024-01-29 Observable notebook: URL to download a GitHub repository as a zip file:
GitHub broke the "right click -> copy URL" feature on their Download ZIP button a few weeks ago. I'm still hoping they fix that, but in the meantime I built this Observable Notebook to generate ZIP URLs for any GitHub repo and any branch or commit hash.
Update 30th January 2024: GitHub have fixed the bug now, so right click -> Copy URL works again on that button.
Link 2024-01-29 Getting Started With CUDA for Python Programmers:
if, like me, you've avoided CUDA programming (writing efficient code that runs on NVIGIA GPUs) in the past, Jeremy Howard has a new 1hr17m video tutorial that demystifies the basics. The code is all run using PyTorch in notebooks running on Google Colab, and it starts with a very clear demonstration of how to convert a RGB image to black and white.
Link 2024-01-30 urllib3 2.2.0:
Highlighted feature: "urllib3 now works in the browser" - the core urllib3 library now includes code that can integrate with Pyodide, using the browser's fetch() or XMLHttpRequest APIs to make HTTP requests (to CORS-enabled endpoints).
Link 2024-01-30 pgroll:
"Zero-downtime, reversible, schema migrations for Postgres"
I love this kind of thing. This one is has a really interesting design: you define your schema modifications (adding/dropping columns, creating tables etc) using a JSON DSL, then apply them using a Go binary.
When you apply a migration the tool first creates a brand new PostgreSQL schema (effectively a whole new database) which imitates your new schema design using PostgreSQL views. You can then point your applications that have been upgraded to the new schema at it, using the PostgreSQL search_path setting.
Old applications can continue talking to the previous schema design, giving you an opportunity to roll out a zero-downtime deployment of the new code.
Once your application has upgraded and the physical rows in the database have been transformed to the new schema you can run a --continue command to make the final destructive changes and drop the mechanism that simulates both schema designs at once.
Link 2024-01-30 Beej's Guide to Networking Concepts:
Beej's Guide to Network Programming is a legendary tutorial on network programming in C, continually authored and updated by Brian "Beej" Hall since 1995.
This is NOT that. Beej's Guide to Networking Concepts is brand new - started in March 2023 - and illustrates a whole bunch of networking concepts using Python instead of C.
From the forward: "Is it Beej’s Guide to Network Programming in Python? Well, kinda, actually. The C book is more about how C’s (well, Unix’s) network API works. And this book is more about the concepts underlying it, using Python as a vehicle."
Link 2024-01-31 GitHub Actions: Introducing the new M1 macOS runner available to open source!:
Set "runs-on: macos-14" to run a GitHub Actions workflow on a 7GB of RAM ARM M1 runner. I have been looking forward to this for ages: it should make it much easier to build releases of both Electron apps and Python binary wheels for Apple Silicon.
Link 2024-01-31 Macaroons Escalated Quickly:
Thomas Ptacek's follow-up on Macaroon tokens, based on a two year project to implement them at Fly.io. The way they let end users calculate new signed tokens with additional limitations applied to them ("caveats" in Macaroon terminology) is fascinating, and allows for some very creative solutions.
Link 2024-01-31 snoop:
Neat Python debugging utility by Alex Hall: snoop lets you "import snoop" and then add "@snoop" as a decorator to any function, which causes that function's source code to be output directly to the console with details of any variable state changes that occur while it's running.
I didn't know you could make a Python module callable like that - turns out it's running "sys.modules['snoop'] = snoop" in the __init__.py module!
Link 2024-01-31 stanchion:
Dan Gallagher's new (under-development) SQLite extension that adds column-oriented tables to SQLite, using a virtual table implemented in Zig that stores records in row groups, where each row group has multiple segments (one for each column) and those segments are stored as SQLite BLOBs.
I'm surprised that this is possible using the virtual table mechanism. It has the potential to bring some of the analytical querying performance we've seen in engines like DuckDB to SQLite itself.
Link 2024-02-01 teknium/OpenHermes-2.5:
The Nous-Hermes and Open Hermes series of LLMs, fine-tuned on top of base models like Llama 2 and Mistral, have an excellent reputation and frequently rank highly on various leaderboards.
The developer behind them, Teknium, just released the full set of fine-tuning data that they curated to build these models. It's a 2GB JSON file with over a million examples of high quality prompts, responses and some multi-prompt conversations, gathered from a number of different sources and described in the data card.
Link 2024-02-02 ChunkViz:
Handy tool by Greg Kamradt to help understand how different text chunking mechanisms work by visualizing them. Chunking is an important part of preparing text to be embedded for semantic search, and thanks to this tool I've finally got a solid mental model of what recursive character text splitting does.
Link 2024-02-02 unstructured:
Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.
I got some good initial results against a PDF by running "pip install 'unstructured[pdf]'" and then using the "unstructured.partition.pdf.partition_pdf(filename)" function.
There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model - but it installed cleanly for me on macOS and worked out of the box.
Quote 2024-02-02
For many people in many organizations, their measurable output is words - words in emails, in reports, in presentations. We use words as proxy for many things: the number of words is an indicator of effort, the quality of the words is an indicator of intelligence, the degree to which the words are error-free is an indicator of care.
[...] But now every employee with Copilot can produce work that checks all the boxes of a formal report without necessarily representing underlying effort.
Link 2024-02-02 Samattical:
Automattic (the company behind WordPress) have a benefit that's provided to all 1,900+ of their employees: a paid three month sabbatical every five years.
CEO Matt Mullenweg is taking advantage of this for the first time, and here shares an Ignite talk in which he talks about the way the benefit encourages the company to plan for 5% of the company to be unavailable at any one time, helping avoid any single employee becoming a bottleneck.
Quote 2024-02-02
LLMs may offer immense value to society. But that does not warrant the violation of copyright law or its underpinning principles. We do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process. There is compelling evidence that the UK benefits economically, politically and societally from upholding a globally respected copyright regime.
UK House of Lords report on Generative AI
Link 2024-02-02 Open Language Models (OLMos) and the LLM landscape:
OLMo is a newly released LLM from the Allen Institute for AI (AI2) currently available in 7b and 1b parameters (OLMo-65b is on the way) and trained on a fully openly published dataset called Dolma.
The model and code are Apache 2, while the data is under the "AI2 ImpACT license".
From the benchmark scores shared here by Nathan Lambert it looks like this may be the highest performing model currently available that was built using a fully documented training set.
What's in Dolma? It's mainly Common Crawl, Wikipedia, Project Gutenberg and the Stack.
Link 2024-02-03 The Engineering behind Figma's Vector Networks:
Fascinating post by Alex Harri (in 2019) describing FIgma's unique approach to providing an alternative to the classic Bézier curve pen tool. It includes a really clear explanation of Bézier curves, then dives into the alternative, recent field of vector networks which support lines and curves between any two points rather than enforcing a single path.
Link 2024-02-03 Introducing Nomic Embed: A Truly Open Embedding Model:
A new text embedding model from Nomic AI which supports 8192 length sequences, claims better scores than many other models (including OpenAI's new text-embedding-3-small) and is available as both a hosted API and a run-yourself model. The model is Apache 2 licensed and Nomic have released the full set of training data and code.
From the accompanying paper: "Full training of nomic-embed-text-v1 can be conducted in a single week on one 8xH100 node."
Quote 2024-02-04
Rye lets you get from no Python on a computer to a fully functioning Python project in under a minute with linting, formatting and everything in place.
[...] Because it was demonstrably designed to avoid interference with any pre-existing Python configurations, Rye allows for a smooth and gradual integration and the emotional barrier of picking it up even for people who use other tools was shown to be low.
Link 2024-02-04 llm-sentence-transformers 0.2:
I added a new --trust-remote-code option when registering an embedding model, which means LLM can now run embeddings through the new Nomic AI nomic-embed-text-v1 model.
Quote 2024-02-04
Sometimes, performance just doesn't matter. If I make some codepath in Ruff 10x faster, but no one ever hits it, I'm sure it could get some likes on Twitter, but the impact on users would be meaningless.
And yet, it's good to care about performance everywhere, even when it doesn't matter. Caring about performance is cultural and contagious. Small wins add up. Small losses add up even more.
Link 2024-02-05 How does Sidekiq really work?:
I really like this category of blog post: Dan Svetlov took the time to explore the Sidekiq message queue's implementation and then wrote it up in depth.
Link 2024-02-05 shot-scraper 1.4:
I decided to add HTTP Basic authentication support to shot-scraper today and found several excellent pull requests waiting to be merged, by Niel Thiart and mhalle.
1.4 adds support for HTTP Basic auth, custom --scale-factor shots, additional --browser-arg arguments and a fix for --interactive mode.
Link 2024-02-06 scriptisto:
This is really clever. "scriptisto is tool to enable writing one file scripts in languages that require compilation, dependencies fetching or preprocessing."
You start your file with a "#!/usr/bin/env scriptisto" shebang line, then drop in a specially formatted block that tells it which compiler (if any) to use and how to build the tool. The rest of the file can then be written in any of the dozen-plus included languages... or you can create your own template to support something else.
The end result is you can now write a one-off tool in pretty much anything and have it execute as if it was a single built executable.
Link 2024-02-06 The power of two random choices, visualized:
Grant Slatton shares a visualization illustrating "a favorite load balancing technique at AWS": pick two nodes at random and then send the task to whichever of those two has the lowest current load score.
Why just two nodes? "The function grows logarithmically, so it's a big jump from 1 to 2 and then tapers off *real* quick."
Link 2024-02-06 SQL for Data Scientists in 100 Queries:
New comprehensive SQLite SQL tutorial from Greg Wilson, author of Teaching Tech Together and founder of The Carpentries.
Quote 2024-02-07
If your only way of making a painting is to actually dab paint laboriously onto a canvas, then the result might be bad or good, but at least it’s the result of a whole lot of micro-decisions you made as an artist. You were exercising editorial judgment with every paint stroke. That is absent in the output of these programs.