2 Comments

I think your Dual LLM example could still be abused with prompt injection: the output of the quarantined LLM is returned directly to the user. So by sending you a targeted email I could change the summary of emails that you read and/or act on. Seems dangerous?

Expand full comment
author

That's entirely true: the activities of the quarantined LLM remain susceptible to prompt injection, so things like summaries could be corrupted by an attack.

I don't have a fix for that. I've pretty much decided I'll have to live with it if I want to have an AI-driven system that can summarize text for me.

You can reduce the amount of damage it can do if you take it into account - for example, summarize each email as a separate task rather than bundling multiple emails together into a single task that could be corrupted by the content of just one of them.

Expand full comment