6 Comments

This is so comprehensive. Every time I thought of a question, it was covered in a later section of your post. Thank you!

Expand full comment

Hi Simon, a long time I saw your post on the NY Times story on Sam Altman's firing, discussing how anonymous sourcing was signaled in language. My area of interest and intervention at the Markkula Center for Applied Ethics (Santa Clara University) is broadly -- journalistic sourcing.

Our 2024 learning: We proposed a new benchmark for LLMs on annotating sourcing (as a route to assessing media) and compared 5 models. The preprint paper (Jan 3, 2025 published) is here. (We posted the dataset and prompts on Hugging Face). The findings show how LLMs struggle with an area I call source justifications.

https://arxiv.org/abs/2501.00164

Would love your and your readers' feedback.

Expand full comment

Will get you Ai2 on the >GPT4 elo list, stat.

Expand full comment

Just a tremendously informative wrap-up, appreciate you taking the time. Do you know of any good resources other than personal testing for creating evals? Have had success doing myself but the number of tasks means I don't really have enough examples to generalize good lessons.

Expand full comment

I'm still looking for those myself. The best writing I've seen so far on evals has been from Hamel: https://hamel.dev/blog/posts/evals/ and https://hamel.dev/blog/posts/llm-judge/

Expand full comment

Fair enough, will keep exploring and let you know if I find anything.

Expand full comment