Hi Simon, a long time I saw your post on the NY Times story on Sam Altman's firing, discussing how anonymous sourcing was signaled in language. My area of interest and intervention at the Markkula Center for Applied Ethics (Santa Clara University) is broadly -- journalistic sourcing.
Our 2024 learning: We proposed a new benchmark for LLMs on annotating sourcing (as a route to assessing media) and compared 5 models. The preprint paper (Jan 3, 2025 published) is here. (We posted the dataset and prompts on Hugging Face). The findings show how LLMs struggle with an area I call source justifications.
Just a tremendously informative wrap-up, appreciate you taking the time. Do you know of any good resources other than personal testing for creating evals? Have had success doing myself but the number of tasks means I don't really have enough examples to generalize good lessons.
This is so comprehensive. Every time I thought of a question, it was covered in a later section of your post. Thank you!
Hi Simon, a long time I saw your post on the NY Times story on Sam Altman's firing, discussing how anonymous sourcing was signaled in language. My area of interest and intervention at the Markkula Center for Applied Ethics (Santa Clara University) is broadly -- journalistic sourcing.
Our 2024 learning: We proposed a new benchmark for LLMs on annotating sourcing (as a route to assessing media) and compared 5 models. The preprint paper (Jan 3, 2025 published) is here. (We posted the dataset and prompts on Hugging Face). The findings show how LLMs struggle with an area I call source justifications.
https://arxiv.org/abs/2501.00164
Would love your and your readers' feedback.
Will get you Ai2 on the >GPT4 elo list, stat.
Just a tremendously informative wrap-up, appreciate you taking the time. Do you know of any good resources other than personal testing for creating evals? Have had success doing myself but the number of tasks means I don't really have enough examples to generalize good lessons.
I'm still looking for those myself. The best writing I've seen so far on evals has been from Hamel: https://hamel.dev/blog/posts/evals/ and https://hamel.dev/blog/posts/llm-judge/
Fair enough, will keep exploring and let you know if I find anything.