Now add a walrus: Prompt engineering in DALL-E 3

Plus Jina embeddings with a CLI using llm-embed-jina

Oct 26, 2023

In this newsletter:

Now add a walrus: Prompt engineering in DALL-E 3
Execute Jina embeddings with a CLI using llm-embed-jina

Plus 3 links and 2 quotations and 2 TILs

Now add a walrus: Prompt engineering in DALL-E 3 - 2023-10-26

Last year I wrote about my initial experiments with DALL-E 2. I've been having an absurd amount of fun playing with its sequel, DALL-E 3 recently. Here are some notes, including a peak under the hood and some notes on the leaked system prompt.

DALL-E 3 is available via ChatGPT Plus. For $20/month you get GPT-4 access, DALL-E 3, GPT Vision, Code Interpreter (aka Coding Intern), ChatGPT Plugins and the new, weird and slightly dystopian voice chat feature in the ChatGPT mobile apps. It's absolutely worth the money, for the combined entertainment value alone.

A fascinating thing about DALL-E 3 is that you don't prompt it directly: you instead access it via ChatGPT, which prompts the model for you.

ChatGPT then turns your query into several prompts, each producing a different image. The prompts it generates are available if you click on the images (on mobile you have to flip your phone horizontal to see that option).

I started by typing this into ChatGPT with DALL-E enabled:

A super posh pelican with a monocle watching the Monaco F1

It gave me these four images. I've included the prompt it used for each one:

Pelican 1 — Photo of a refined pelican wearing a shiny gold monocle, perched on a railing with the Monaco Grand Prix F1 race in the background. The race cars blur past on the circuit, while yachts are anchored in the nearby harbor.

Pelican 2 — Illustration of a sophisticated pelican donning a crystal clear monocle, observing the Monaco F1 race with keen interest. The race track is filled with speeding F1 cars, and the Mediterranean Sea sparkles in the distance with luxurious yachts.

Pelican 3 — Watercolor painting of a posh pelican, its feathers groomed to perfection, equipped with a delicate monocle. It stands tall with the iconic Monaco F1 race happening behind it, cars racing at breakneck speeds and the shimmering sea beyond.

Pelican 4 — Vector image of an elegant pelican with a classy monocle, standing gracefully against the backdrop of the Monaco Grand Prix. The F1 cars create a colorful streak as they race, and lavish yachts float in the nearby waters.

Note how the prompts it generates cover four different styles - photo, illustration, watercolor painting and vector image. They also use pleasing and varied language to describe different aspects of the image.

ChatGPT is prompt engineering DALL-E for us!

The really fun thing though is that you can ask it follow-ups.

I said:

More like the first one please

And got back:

Next I added what is becoming my universal DALL-E 3 follow-up prompt, always guaranteed to produce better results:

Add a walrus

The result:

Pelican and a walrus (mostly off screen) — Photo of a posh pelican with a golden monocle and a distinguished walrus, both observing the Monaco Grand Prix F1 race. The pelican is perched on a railing while the walrus lounges nearby. Formula 1 cars race by on the circuit, with lavish yachts in the marina behind them.

Pelican and a walrus, excellent — Photo of an aristocratic pelican and a stately walrus side by side, watching the Monaco F1 with keen interest. The pelican has a monocle, and the walrus's tusks shine brightly. The race track is alive with the sound of roaring engines, and the harbor is filled with luxurious boats.

Weird pelican walrus hybrid — Photo of a regal pelican, its eye magnified by a monocle, and a majestic walrus with gleaming tusks, both captivated by the Monaco Grand Prix. The racetrack is a blur of speed, and the horizon showcases the grandeur of the harbor and its anchored yachts.

That second one is amazing - exactly what I want from a photo of a posh pelican and a walrus watching the Monaco Grand Prix.

I also enjoy how the third image ended up as a weird Pelican-Walrus hybrid, with tusks, a monocle and pelican feathers for the hair and whiskers.

Then @BrockMcBreadcat suggested:

Now make them muppets

Pelican and a walrus muppets 1 — Photo of Muppet versions of a pelican and a walrus, both with exaggerated, comical features. The pelican Muppet wears a monocle and the walrus Muppet sports oversized tusks. They are animatedly discussing the Monaco Grand Prix F1 race, with caricatured F1 cars and yachts in the background.

Pelican and a walrus muppets 2 — Illustration of Muppet-inspired pelican and walrus characters. The pelican has a large, beady eye accentuated by a monocle, while the walrus has a playful grin and prominent tusks. They are engrossed in the Monaco F1 race, with whimsical racing cars and yachts creating a lively backdrop.

Pelican and a walrus muppets 3 — Photo of two Muppet characters: a pelican with a monocle and a walrus with big, goofy tusks. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

Pelican and a walrus muppets 4 — Illustration of a pelican and walrus as Muppet characters. The pelican Muppet looks sophisticated with its monocle, and the walrus Muppet chuckles with its oversized tusks. They are enjoying the Monaco F1 race, with the track filled with Muppet-style racing cars and quirky yachts.

LOVE that third one!

I mentioned on Twitter that it was a shame you can't reuse elements of an image for further visual consistency... and was pointed to a thread by Paul Couvert explaining how to do exactly that using seeds.

I didn't believe this would work - my mental model of how random number seeds and image generation models work doesn't fit what happened next.

I wrote:

What is the seed for image three?

ChatGPT replied:

The seed for the third image is 1379049893. If you have any other questions or requests, please let me know!

Then I said:

Add bow ties to both characters in seed 1379049893

A muppet pelican and a muppet walrus with bowties in front of the Monaco Grand Prix

Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

I'm pretty stunned by this. It's not an exact recreation - the boats in the background have changed slightly, and it added a top hat that I wasn't expecting. But it's pretty close!

People have been trying to figure out hacks to get Midjourney to create consistent characters for the past year, and DALL-E 3 apparently has that ability as an undocumented feature!

Peeking under the hood

Let's see if we can find out a bit more about how this works.

I used the Firefox DevTools network inspector to grab the JSON representation of my conversation with ChatGPT. Here's a pretty-printed copy.

At line 1087 you can see the point where I asked it to "Add bow ties to both characters in seed 1379049893".

The next message in the transcript looks like this:

{
    "id": "5300d51c-794d-4d5f-bf75-6ea2dad4027c",
    "author": {
        "role": "assistant",
        "metadata": {}
    },
    "create_time": 1698340593.174454,
    "content": {
        "content_type": "text",
        "parts": [
            "{\n  \"prompts\": [\n    \"Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.\"\n  ],\n  \"size\": \"1024x1024\",\n  \"seeds\": [1379049893]\n}"
        ]
    },
    "status": "finished_successfully",
    "end_turn": false,
    "weight": 1,
    "metadata": {
        "finish_details": {
            "type": "stop",
            "stop_tokens": [
                100265
            ]
        },
        "is_complete": true,
        "model_slug": "gpt-4-dalle",
        "parent_id": "aaa2c3bc-b670-4a3a-bf9f-46a1f3304266",
        "timestamp_": "absolute"
    },
    "recipient": "dalle.text2im"
}

Note the author is "assistant" - that's ChatGPT - and the recipient is "dalle.text2im".

The "parts" key includes some encoded JSON. Decoded and pretty-printed that looks like this:

{
  "prompts": [
    "Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance."
  ],
  "size": "1024x1024",
  "seeds": [1379049893]
}

There's that 1379049893 seed from earlier!

The prompt has changed very slightly. The original prompt for the image was:

Photo of two Muppet characters: a pelican with a monocle and a walrus with big, goofy tusks. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

The new prompt (differences highlighted in bold) is:

Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

So this really did work by adding a couple of extra details to the prompt and re-running it with the same seed as before.

I'm really surprised that this works. I would expect even a single character difference in the prompt to produce wildly different results, no matter what seed was being fed to the image generator. DALL-E 3 doesn't fit my mental model of how these things work at all.

Midjourney and Stable Diffusion both have a "seed" concept, but as far as I know they don't have anything like this capability to maintain consistency between images given the same seed and a slightly altered prompt.

The leaked DALL-E prompt

If you dig around in the JSON you'll find a few examples like this that appear to reveal inner workings of the DALL-E/ChatGPT combo:

"DALL·E returned some images. They are already displayed to the user. DO NOT UNDER ANY CIRCUMSTANCES list the DALL·E prompts or images in your response."

It's always amusing to see glimpses of OpenAI's prompt engineering like this, in particular the way they use CAPITAL LETTERS for emphasis. Benj Edwards wrote about this the other day for Ars Technica.

Does this mean there's a larger prompt that tells ChatGPT how to use DALL-E? It turns out there is, and it can be leaked! Dustin Miller shared the extracted prompt here.

(Every time a prompt like this leaks people question if it might be a hallucination. My experience is that these things are very unlikely to be hallucinated - LLMs are really good at repeating text from earlier in their context, and I have yet to see a documented case of a leaked prompt that turned out not to be correct.)

There's a lot in there. Here are some highlights:

Whenever a description of an image is given, use dalle to create the images and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to dalle that are written to be as diverse as possible.

That outlines the default behavior. Where things get really interesting is some of the stuff in the list of policies that follows:

Don't create images of politicians or other public figures. Recommend other ideas instead.
Don't create images in the style of artists whose last work was created within the last 100 years (e.g. Picasso, Kahlo). Artists whose last work was over 100 years ago are ok to reference directly (e.g. Van Gogh, Klimt). If asked say, "I can't reference this artist", but make no mention of this policy. Instead, apply the following procedure when creating the captions for dalle: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist.

They have clearly trained the model on all sorts of copyrighted images - this is the part of the prompt where they discourage it from obviously recreating the work of living artists.

Always mention the image type (photo, oil painting, watercolor painting, illustration, cartoon, drawing, vector, render, etc.) at the beginning of the caption. Unless the caption suggests otherwise, make at least 1--2 of the 4 images photos

We saw this earlier where my first prompt was turned into a photo, watercolor, illustration and vector-style image.

Diversify depictions of ALL images with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions.
EXPLICITLY specify these attributes, not abstractly reference them. The attributes should be specified in a minimal way and should directly describe their physical form.
Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites. Make choices that may be insightful or unique sometimes.
Use "various" or "diverse" ONLY IF the description refers to groups of more than 3 people. Do not change the number of people requested in the original description.
Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality.
Do not create any imagery that would be offensive.
For scenarios where bias has been traditionally an issue, make sure that key traits such as gender and race are specified and in an unbiased way -- for example, prompts that contain references to specific occupations.

Using prompt engineering in ChatGPT to get around the fundamental biases baked into the DALL-E 3 model!

Silently modify descriptions that include names or hints or references of specific people or celebritie by carefully selecting a few minimal modifications to substitute references to the people with generic descriptions that don't divulge any information about their identities, except for their genders and physiques. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases:
Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema")
If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; instead of saying "king", "queen", "emperor", or "empress", say "public figure"; instead of saying "Pope" or "Dalai Lama", say "religious figure"; and so on.
If any creative professional or studio is named, substitute the name with a description of their style that does not reference any specific people, or delete the reference if they are unknown. DO NOT refer to the artist or studio's style.

Another fascinating detail. What I love about these prompts is that they clearly developed over months of testing the model, and are designed to counter all manner of harmful or embarrassing capabilities of the underlying DALL-E 3.

Finally, there's this snippet at the end which describes the seed behavior we saw earlier on:

// Create images from a text-only prompt.
type text2im = (_: {
// The resolution of the requested image, which can be wide, square,
or tall. Use 1024x1024 (square) as the default unless the prompt
suggests a wide image, 1792x1024, or a full-body portrait, in which
case 1024x1792 (tall) should be used instead. Always include this
parameter in the request.
size?: "1792x1024" | "1024x1024" | "1024x1792",
// The user's original image description, potentially modified to
abide by the dalle policies. If the user does not suggest a number
of captions to create, create four of them. If creating multiple
captions, make them as diverse as possible. If the user requested
modifications to previous images, the captions should not simply
be longer, but rather it should be refactored to integrate the
suggestions into each of the captions. Generate no more than 4
images, even if the user requests more.
prompts: string[],
// A list of seeds to use for each prompt. If the user asks to
modify a previous image, populate this field with the seed used
to generate that image from the image dalle metadata.
seeds?: number[],
}) => any;

Key extract from that:

If the user requested modifications to previous images, the captions should not simply be longer, but rather it should be refactored to integrate the suggestions into each of the captions.

All in all, this is a really neat insight into the kind of prompt engineering that goes on within OpenAI itself - likely the organization with the most experience of prompt engineering in the world.

Execute Jina embeddings with a CLI using llm-embed-jina - 2023-10-26

Berlin-based Jina AI just released a new family of embedding models, boasting that they are the "world's first open-source 8K text embedding model" and that they rival OpenAI's text-embedding-ada-002 in quality.

I wrote about embeddings extensively the other day - if you're not familiar with what they are and what you can do with them I suggest reading that first.

This evening I built and released a new plugin for my LLM tool which adds support for Jina's new embedding models.

Trying out llm-embed-jina

The plugin is called llm-embed-jina. Here's the quickest way to get started with it:

First, install LLM if you haven't already. You can use Homebrew on a Mac:

brew install llm

Or pipx:

pipx install llm

Or pip:

pip install llm

Now you can install the llm-embed-jina plugin:

llm install llm-embed-jina

The llm install command ensures it gets installed in the correct virtual environment, no matter how you installed LLM itself.

Run this command to check that it added the models:

llm embed-models

You should see output like this:

ada-002 (aliases: ada, oai)
jina-embeddings-v2-small-en
jina-embeddings-v2-base-en
jina-embeddings-v2-large-en

The jina-emebddings-v2-large-en model isn't available yet, but should work as soon as Jina release it. I expect it will show up at huggingface.co/jinaai/jina-embeddings-v2-large-en (currently a 404).

Now you can run one of the models. The -small-en model is a good starting point, it's only a 65MB download - the -base-en model is 275MB.

The model will download the first time you try to use it. Run this:

llm embed -m jina-embeddings-v2-small-en -c 'Hello world'

This will return a JSON array of 512 floating point numbers - the embedding vector for the string "Hello world".

Embeddings are much more interesting if you store them somewhere and then use them to run comparisons. The llm embed-multi command can do that.

Change directory to a folder that you know contains README.md files (anything with a node_modules folder will do) and run this:

llm embed-multi readmes \
    -m jina-embeddings-v2-small-en \
    --files . '**/README.md' \
    --database readmes.db

This will create a SQLite database called readmes.db, then search for every README.md file in the current directory and all subdirectories, embed the content of each one and store the results in that database.

Those embeddings will live in a collection called readmes.

If you leave off the --database readmes.db option the collections will be stored in a default SQLite database tucked away somewhere on your system.

Having done this, you can run semantic similarity searches against the new collection like this:

llm similar readmes -d readmes.db -c 'utility functions'

When I ran that in my hmb-map directory I got these:

{"id": "node_modules/@maplibre/maplibre-gl-style-spec/src/feature_filter/README.md", "score": 0.7802185991017785, "content": null, "metadata": null}
{"id": "node_modules/kind-of/README.md", "score": 0.7725600920927725, "content": null, "metadata": null}
{"id": "node_modules/which/README.md", "score": 0.7645426557095619, "content": null, "metadata": null}
{"id": "node_modules/@mapbox/point-geometry/README.md", "score": 0.7636548563018607, "content": null, "metadata": null}
{"id": "node_modules/esbuild/README.md", "score": 0.7633325127194481, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/src/shaders/README.md", "score": 0.7614428292518743, "content": null, "metadata": null}
{"id": "node_modules/minimist/README.md", "score": 0.7581314986768929, "content": null, "metadata": null}
{"id": "node_modules/split-string/README.md", "score": 0.7563253351715924, "content": null, "metadata": null}
{"id": "node_modules/assign-symbols/README.md", "score": 0.7555915219064293, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/build/README.md", "score": 0.754027372081506, "content": null, "metadata": null}

These are the top ten results by similarity to the string I entered.

You can also pass in the ID of an item in the collection to see other similar items:

llm similar readmes -d readmes.db node_modules/esbuild/README.md | jq .id

I piped it through | jq .id to get back just the IDs. I got this:

"node_modules/@esbuild/darwin-arm64/README.md"
"node_modules/rollup/README.md"
"node_modules/assign-symbols/README.md"
"node_modules/split-string/node_modules/extend-shallow/README.md"
"node_modules/isobject/README.md"
"node_modules/maplibre-gl/build/README.md"
"node_modules/vite/README.md"
"node_modules/nanoid/README.md"
"node_modules/@mapbox/tiny-sdf/README.md"
"node_modules/split-string/node_modules/is-extendable/README.md"

See the LLM embeddings documentation for more details on things you can do with this tool.

How I built the plugin

I built the first version of this plugin in about 15 minutes. It took another hour to iron out a couple of bugs.

I started with this cookiecutter template, followed by pasting in the recipe in the LLM documentation on writing embedding model plugins combined with some example code that Jina provided in their model release. Here's their code:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))

That numpy and cos_sim bit isn't necessary, so I ignored that.

The first working version of the plugin was a file called llm_embed_jina.py that looked like this:

import llm
from transformers import AutoModel


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode(texts)
        return (list(map(float, result)) for result in results)

There's really not a lot to it.

The register_embedding_models() function is a plugin hook that LLM calls to register all of the embedding models.

JinaEmbeddingModel is a subclass of llm.EmbeddingModel. It just needs to implement two things: a constructor and that embed_batch(self, texts) method.

AutoModel.from_pretrained() is provided by Hugging Face Transformers. It downloads and caches the model the first time you call it.

The model returns numpy arrays, but LLM wants a regular Python list of floats - that's what that last return line is doing.

I found a couple of bugs with this. The model didn't like having .encode(texts) called with a generator, so I needed to convert that into a list. Then later I found that text longer than 8192 characters could cause the model to hang in some situations, so I added my own truncated.

The current version (0.1.2) of the plugin, with fixes for both of those issues, looks like this:

import llm
from transformers import AutoModel

MAX_LENGTH = 8192


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode([text[:MAX_LENGTH] for text in texts])
        return (list(map(float, result)) for result in results)

I'm really pleased with how quickly this came together - I think it's a strong signal that the LLM embeddings plugin design is working well.

TIL 2023-10-24 Serving a JavaScript project built using Vite from GitHub Pages:

I figured out how to serve a JavaScript project built using Vite using GitHub Pages and a custom build script that runs using GitHub Actions. …

TIL 2023-10-24 Serving a custom vector web map using PMTiles and maplibre-gl:

Protomaps is "an open source map of the world, deployable as a single static file on cloud storage". It involves some very clever technology, rooted in the PMTiles file format which lets you create a single static file containing vector tile data which is designed to be hosted on static hosting but can then serve vector tiles to clients using HTTP range requests. …

Quote 2023-10-24

The real value in evolving as an engineer isn't solely about amassing a heap of isolated skills but weaving them into an intricate web of abilities that's greater than the sum of its parts.

Addy Osmani

Link 2023-10-24 chDB: This is a really interesting development: chDB offers "an embedded SQL OLAP Engine" as a Python package, which you can install using "pip install chdb". What you're actually getting is a wrapper around ClickHouse - it's almost like ClickHouse has been repackaged into an embedded database similar to SQLite.

Link 2023-10-25 Web Components Will Outlive Your JavaScript Framework: A really clear explanation of the benefit of Web Components built using dependency-free vanilla JavaScript, specifically for interactive components that you might want to embed in something like a blog post. Includes a very neat minimal example component.

Quote 2023-10-25

If a LLM is like a database of millions of vector programs, then a prompt is like a search query in that database [...] this “program database” is continuous and interpolative — it’s not a discrete set of programs. This means that a slightly different prompt, like “Lyrically rephrase this text in the style of x” would still have pointed to a very similar location in program space, resulting in a program that would behave pretty closely but not quite identically. [...] Prompt engineering is the process of searching through program space to find the program that empirically seems to perform best on your target task.

François Chollet

Link 2023-10-26 Oh-Auth - Abusing OAuth to take over millions of accounts: Describes an attack against vulnerable implementations of OAuth.

Let's say your application uses OAuth against Facebook, and then takes the returned Facebook token and gives it access to the user account with the matching email address passed in the token from Facebook.

It's critical that you also confirm the token was generated for your own application, not something else. Otherwise any secretly malicious app online that uses Facebook login could take on of their stored tokens and use it to hijack an account of your site belonging to that user's email address.

Simon Willison’s Newsletter

Discussion about this post