People of ACM - Sewon Min
June 24, 2025
How did you decide to focus on LLMs as a research area?
People in NLP have long aspired to build general-purpose models that can perform a wide range of tasks without task-specific training. When LLMs came out, their approach was to train a giant model on massive data in a self-supervised way and eliminate the need for human supervision. LLMs seemed like a path toward that goal. It’s a frustratingly simple idea, but actually tells a lot, e.g., the critical role of data quality and quantity, minimizing human prior, and removing human labeling. That’s the formal answer. The honest one is that it sounded really exciting and I found it fun to work on. I feel very lucky that this area turned out to have such a broad and meaningful impact.
In a recent talk, you noted that current large language model chatbots make factual errors up to 42% of the time when they are generating biographies about people. Why is this so?
I believe this is related to how current LLMs fundamentally operate by memorizing facts seen during training. LLMs can generate accurate biographies for well-known individuals because such information appears frequently in their training data, and LLMs can memorize such information. But for less represented subjects, they often fail to recall precisely and instead generate text that sounds plausible but is in fact incorrect (a phenomenon known as hallucination). This reflects a core limitation of how these models learn from the data.
You are recognized for groundbreaking work in “non-parametric” large language models. Will you give an example of the kind of response a non-parametric large language model generates compared with a current (standard) large language model?
Standard LLMs often hallucinate facts. For example, when I asked ChatGPT (without search engine access) “Tell me about restaurants in Seoul with three Michelin stars,” it named two incorrect ones—Gaon, which has two stars, and La Yeon, which is closed—and even made up the wrong opening hours. This likely reflected outdated or missing memorized knowledge. In contrast, nonparametric LLMs would retrieve documents from the up-to-date datastore, like this article in 2025 mentioning that Seoul has one Three-star restaurant named Mingles, and use those documents to identify the correct answer.
An ongoing problem with LLM’s is they often include copyrighted or private information in the responses they generate. For example, they gather information from leading newspapers without providing attribution to authors. How can non-parametric LLM’s address this challenge?
In our ICLR 2024 paper, "SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore,” we proposed to design a nonparametric LLM where parameters are trained exclusively on permissive data, excluding copyrighted or nonpermissive text. At inference time, the model can access a datastore that includes copyrighted content, allowing it to retrieve and reason over relevant information before making predictions. This approach enables inherent attribution, making it possible to trace which datapoints were used and assign credit to data owners. It also supports data opt-out, which allows the model to incorporate the data owners’ request without additional training.
Naturally, this setup involves tradeoffs, e.g., excluding all copyrighted text from training could lead to a significant drop in model performance. Addressing this tradeoff is a direction I plan to continue exploring in future research.
Where are LLM’s headed in the near future?
I believe making LLMs factual, precise, and controllable will be the most important challenges in the field. Today’s LLMs are remarkable, often answering expert-level questions, but they are not yet reliable enough for high-stake applications, where even rare errors can carry significant risk. Improving their robustness, reliability, and controllability is likely to remain a central research problem.
Another exciting direction is rethinking LLM architectures to enable more creative and responsible use of data. Data is the new oil, and leading labs invest heavily to acquire high-value data. Yet, many important sources such as knowledge from national labs or up-to-date news articles are inaccessible due to privacy, policy, or ownership constraints.
Can we design models that allow data to be used in a way that aligns with the interests of data owners? This could include architectures that attribute and credit data owners during inference. New models could also support fine-grained control and removal of data or enable learning from private data without requiring it to leave the owners’ infrastructure. Such innovations could open up new forms of collaboration between data owners, model developers, and users, benefiting all parties involved.
Sewon Min is an Assistant Professor at the University of Calif Berkeley Artificial Intelligence Research Lab (BAIR) as well as the Berkeley NLP Group. Min earned PhD and MSc degrees in Computer Science and Engineering from the University of Washington, as well as a BSc degree in Computer Science and Engineering from Seoul National University.
ornia, Berkeley, and a research scientist at the Allen Institute for AI. At UC Berkeley, she is part of theMin was one of only two people who received an Honorable Mention for the ACM Doctoral Dissertation Award. Her dissertation “Rethinking Data Use and Large Language Models” was credited with greatly improving our understanding of how large language models (LLMs) work as well as providing a roadmap of how to build the next generation of these technologies.