Nicholas

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Nicholas

Eric Ho is building Goodfire to solve one of AI’s most critical challenges: understanding what’s actually happening inside neural networks. His team is developing techniques to understand, audit and edit neural networks at the feature level. Eric discusses breakthrough results in resolving superposition through sparse autoencoders, successful model editing demonstrations and real-world applications in genomics with Arc Institute's DNA foundation models. He argues that interpretability will be critical as AI systems become more powerful and take on mission-critical roles in society. Hosted by Sonya Huang and Roelof Botha, Sequoia Capital Mentioned in this episode: Mech interp : Mechanistic interpretability, list of important papers here Phineas Gage : 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience. Human Genome Project : Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Zoom In: An Introduction to Circuits : First important mechanistic interpretability paper from OpenAI in 2020 Superposition : Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons) Apollo Research : AI safety company that designs AI model evaluations and conducts interpretability research

Published
Published Jul 8, 2025
Uploaded
Uploaded Jun 11, 2026
File type
Podcast
Queried
0

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:37

[00:00] So Goodfire is an AI interpretability tool. [00:02] research company really trying to answer the question of what's actually going on inside the mind. [00:07] of a neural net. [00:08] um [00:09] So kind of the ultimate goal and the ultimate goal, [00:12] reason why we started everything was like [00:15] We just see neural networks kind of going into more and more mission-critical contexts, and I think it's going to be enormously transformative for society. But in order to do so, [00:23] And you want to... [00:25] build it safely, powerfully, reliably. And I think it's going to be critical to be able to understand, edit, and debug AI models in order to do that. And so that's what we're kind of enabling for the very first time. It's like unlocking the black box of a neural network such that you can intentionally... [00:44] design it rather than just kind of like grow it from from data. [00:48] *music* [01:03] What if we could crack open the black box of AI and see exactly how it thinks? Today we're joined by Eric Ho, the founder of Goodfire, who's building tools to peer inside neural nets and understand their minds. Eric reveals how his team has successfully disentangled the mysterious phenomenon of superposition, where single neurons encode multiple concepts, and can now steer AI behavior with increasingly surgical precision. [01:27] We explore whether interpretability could help us discover new biological insights, edit out harmful behaviors from large language models, and even understand our own brains better.

1:37-3:15

[01:37] Eric boldly predicts that we'll fully decode neural nets by 2028, transforming AI from black boxes into more intentional design. [01:45] Enjoy the show. [01:47] Eric, thank you so much for joining us today. Of course. Yeah. Happy to be here. [01:51] Thanks for having me. First question. Can we ever trust generative AI if these foundation models are very much black boxes? [01:59] Can we ever trust them if they're black boxes? So I guess like [02:02] Maybe thinking about like [02:04] what would happen if we were to just kind of deploy AI models as black boxes like in perpetuity. So the black box way to do this would be like, [02:15] And I'm kind of assuming that we're playing this forward a few years and we want AI in charge of like really mission critical. [02:21] applications like um [02:23] maybe being in charge of our power grid or making big investment decisions, maybe even for a seed investment at Sequoia or like a really large, you know, like, [02:33] million dollar investment decision. And... [02:38] I think the black box way to make sure that the AI is performing appropriately is you take a look at evals and you run a bunch of evaluations to make sure that it's behaving appropriately in test sets. And then you'd look at its track record and see if it's... [02:54] reliable enough to perform across a wide variety of things. And [02:59] uh, [03:00] I think like... [03:01] The question then is like, why not take all this additional signal that you get from looking inside a neural network and trying to play forward like how it's going to behave in a much wider, broader set of situations? Like, why not?

3:15-5:00

[03:15] like look inside and actually like get a bunch more reliability, certainty about how it's thinking, how it's approaching the problem. And I think like you're just leaving a bunch on the table if you're not like looking for all the signal that you can get. So the way that I think about this is like, I don't know, when you're manufacturing a new drug, it's like you can do the black box way of just like seeing how humans respond to the drug in like a clinical trial. Whereas like you could also just like kind of look inside and like look at biochemically like [03:44] how the drug is, um, [03:46] processed or like drug interactions at the molecular cellular cellular level and uh [03:54] Yeah, I just feel like there's so much to be learned when you actually look inside and deeply understand something. [04:01] How possible do you think it is to look inside and deeply understand a large language model? Do you think it's on the scale from... [04:08] hopeless we can't ever understand it it's just a black box too many too many neurons so we can actually map out the mind of a neural net i'm curious where you think the field will be [04:17] Well, I'm very biased, but I think it's very, very possible. So, I mean, a lot of the people in McInterp come from backgrounds in computational neuroscience or cognitive science. [04:30] those people when you're like actually looking inside the brain, you spend so much time like trying to understand like what a single neuron does or just getting any signal whatsoever. [04:41] And in the field of Mechanterp, you have perfect access to the neurons, the parameters, the weights, the attention patterns of a neural network. So you're coming in with a huge advantage for like, at least like you get all the data that you need. So then the real question is like, how can we make progress? How can we...

5:00-6:32

[05:00] try to understand and seek to understand all of it. [05:05] And [05:07] I think we just got to try. I think it's deeply necessary and critical for the future. And we have like a norm established. We can explain some percentage of the network by reconstructing it and extracting its concepts and the features that it uses in order to generate its response. And once you have at least a baseline under like... [05:25] rudimentary understanding kind of where we're at right now, you can hill climb on that metric and seek to understand like more and more of the network. [05:32] Do you think it's going to be necessary for us to... [05:34] to understand neural nets, to really harness them long term. [05:38] Because I think many other technologies we've invented along the way, humans didn't really understand underlying physics or chemistry, but still we were able to make good of medicines or... [05:47] you know, totally basic propulsion techniques without understanding, you know, all the physics. [05:53] Yeah. [05:54] I think it's going to be critical for the future, just given how transformative I think AI is going to be. Like, I think AI is going to be, [06:02] everywhere running mission critical. [06:05] Um, [06:06] parts of our society. And like we can get really, really far [06:10] just by treating AI as a black box. But I don't think we can truly be able to intentionally design AI as like the new generation of software without like white box techniques. So [06:21] Maybe one example I think about is, you know, in the... [06:25] early 17th century, invented like the steam engine. And we're able to just like,

6:33-8:12

[06:33] to increase the size of the boiler and increase the amount of pressure going in. And it scaled reasonably well, but steam engines also blew up. Well, we didn't understand thermodynamics at that time, so we didn't actually know at what point did the ideal size of the boiler or the ideal pressure, the ideal way to construct a steam engine. And so after we invented thermodynamics, things started becoming a lot safer, a lot more reliable, and huge innovations happened afterwards. [07:03] the steam engine kicked off like the industrial revolution. So even just by like, [07:07] treating it as a black box, you get a really long way. Do you think there's any chance that if we understand... [07:13] neural networks in a computer science context [07:16] it might actually help us accelerate our understanding of neuroscience for the human brain. [07:21] I think so, but... [07:23] That's like... [07:24] That's a big claim, I think. So we were just having an interesting conversation last night. We had like a dinner together about like, [07:33] Do we... [07:36] Do we think in language? Do you think in concepts or something else entirely? Like I'm a person that doesn't really think in language. I think much more like [07:43] conceptually, maybe in the latent space of models. Whereas our header product, Myra, said that she basically is totally faithful to her own chain of thought, speaks in language, and she basically just thinks sequentially with a really, really strong internal monologue. So... [08:00] Like... [08:01] Short and short, maybe some of these insights that we get will translate to humans and our own psychology. And I think that's the hope. It's like, yeah, the more that we can understand about AI, hopefully the more...

8:13-9:42

[08:13] that we can understand about ourselves. [08:15] There's an interesting analogy, by the way, that in neuroscience, often things that have gone wrong, [08:20] help illuminate [08:22] and create insights into the human brain. [08:24] yeah you know people who suffer from specific conditions or people who've suffered particular brain injury types have actually [08:30] perversely enabled us. [08:32] to better understand the brain accidentally. I wonder if something similar might happen with neural networks as well. Yeah, I hope so. What's that like? [08:39] a popular story about the guy who just like got an iron rod through his brain and it made him like a totally different person yeah yeah anyway yeah there's also this uh you know to add to that um there's this concept of universality where like [08:53] among totally different neural networks, similar circuits or thought patterns tend to emerge between all these neural networks. So... [09:03] Like even in... [09:05] We found in vision models very similar circuits to our own visual cortex. [09:12] And I think... [09:14] you know, there's this like, [09:16] idea of universality where maybe like intelligence is just this like thing that you gradient descent to. And then like, uh, that's how our brains found like intelligence. And that's how artificial minds would, would find intelligence, um, as well. Like there's some truth to intelligence. My own neural net is probably pretty sparse. It's pretty sparse. Um, I would love to double click into some of the results you've had from Mechinterp and the broader field and your lab.

9:46-11:21

[09:46] you all are building. [09:47] Yeah, so Goodfire is an AI interpretability research company really trying to answer the question of what's actually going on inside the mind. [09:55] of a neural net. Um, [09:57] So kind of the ultimate goal and the ultimate reason why we started everything was like we just see neural networks kind of going into more and more mission critical contexts. And I think it's going to be enormously transformative for society. But in order to do so. [10:12] And you want to... [10:13] build it safely, powerfully, reliably. And I think it's going to be critical to be able to understand, edit, and debug AI models in order to do that. [10:24] And so that's what we're kind of enabling for the very first time. It's like unlocking the black box of a neural network such that you can intentionally [10:32] design it rather than just kind of like grow it from, from data. [10:36] And if everything goes right, what do you think will be the impact that you all have on the world? [10:40] So maybe one metaphor that we like and think about is like, yeah, right now, like, [10:47] You just kind of like grow AI from a seed and then it just like grows like a giant tree and it grows all wild and crazy right now. We don't know really know like kind of a lot of the things that it's growing into and all sorts of like interesting and weird stuff can can happen with a really large neural net. But. [11:04] I, [11:06] If everything goes right with interpretability, we'll know... [11:10] how every single piece of training data affects the cognition the model develops, the units of computation that it uses. And I almost think of it more as like,

11:21-12:58

[11:21] Bonsai, where you want to kind of like intentionally design and shape and grow the neural network, like still in an unsupervised, like AI driven approach. Like we're not going to hand prune every single weight of a neural network. But I think we'll. [11:36] gain the ability to, during every single piece of the training, post-training process, just intentionally shape an AI model. [11:44] such that it [11:45] you know, serves humanity and does what we want. [11:48] It sounds like a parallel to the Human Genome Project in some sense. Yes. Given some of the work I've done in genetics. So this idea that we need to read DNA, we need to understand the building blocks of life. And ultimately now we're starting to edit DNA and use CRISPR to come up with interesting cures for diseases or an ability to edit crops to make them more resistant to pesticides and things like that. So it's just a very interesting parallel. Yeah. [12:13] Yeah, definitely. I think we think about that analogy a lot. And I know you had Patrick Hsu on the podcast at some point as well, and we're working with him at our institute to do that. [12:24] So yeah, like crack the code of the human genome as well. And yeah. [12:27] I think there's a lot of really interesting parallels and also direct applications of AI interpretability. [12:33] as well. Would you go so far as actually making edits? What I heard from you earlier, the bonsai analogy was a little bit of a shaping um [12:40] which is quite different in my mind from editing. [12:44] You know, shaping to me might be, you know, train and be fit so that your body can survive given a certain DNA. And then there's editing, which is altering the DNA. Are you going to do both? Yeah, I think in short, yes. I don't know like what the

12:58-14:30

[12:58] So in Terp as a field, like there's still a lot to figure out. It's still pretty [13:03] Still pretty new, but I think in bonsai, you also prune a lot of branches and you prune a lot of like the areas that you don't want to grow. [13:10] such that you can kind of shape the overall tree to like grow in the pattern that you want. So I think like [13:18] you know, the eventual system that we hope to build, like you can ask questions of the model, like why did you [13:24] Why did you come up with this response and get a faithful explanation, while also being able to make direct surgical interventions in the mind of the model such that we can understand. [13:32] remove harmful behavior, [13:35] enhance good behavior and [13:39] still remains to be seen whether like, [13:41] It's just like a direct weights modification or like some other... [13:46] like kind of shaping function that is most effective. [13:49] If you think of some of the ways that people are trying to prune these bonsai trees today, I think it's a lot of prompt engineering, fine tuning, RL tuning increasingly now. What do you think about that as the approach to kind of steer the behavior of these models versus actually go in and introspect and examine each of the individual neurons? [14:08] fundamentally like these are black box things like all sorts of weird things can happen when you like fine tune a model for example or [14:15] like, you know, prompt a model and take it out of distribution and it can say all sorts of crazy stuff. So, [14:21] uh the the paper that's most interesting about this recently i don't know if you've [14:25] you've caught this is this like emergent misalignment study.

14:30-16:03

[14:30] No? Okay. This is like Owen Evans' group, where if you... [14:36] fine-tune a model on just insecure code. So it's just like bad code that, you know, has all sorts of like cybersecurity vulnerabilities. [14:45] It'll then start... [14:47] doing all sorts of insane things like wanting to enslave humanity or praising Hitler and other dictators. And it's a really surprising result because it's just... [14:58] insecure code and so [15:00] Yeah, it kind of shows that there's like, maybe like, [15:05] What you're doing with fine-tuning is you're telling the model, like, hey, do more of this, less of this, and almost like enhancing the circuits. [15:12] that [15:13] you want kind of more of, but you can also have like all sorts of unintended consequences like this that show up. [15:21] And, [15:22] These circuits [15:24] These are still like really alien cognition. Like there's some parallels to humanity, but we really don't understand how these networks think. And they're not like... [15:31] human thinking. So, [15:33] If you enhance the bad code snippets, it also is like, [15:37] fundamentally linked to maybe all sorts of like other undesirable behaviors and properties. There's a different twist on the nature nurture debate. Yeah. Because in that situation, it almost feels as though you've imbued that particular model with bad DNA, if you will, just, you know, it's sort of fundamentally an evil thing or a bad thing. And then it ends up with manifesting all sorts of bad behavior in other domains. It's really interesting. Yeah, maybe. Or maybe these models...

16:03-17:36

[16:03] kind of understand right and wrong. And if you enhance the wrong, then all sorts of other behavior is interlinked and expressed. But... [16:12] I don't know the way I think about it. Like these models... [16:15] The models are just like the functions of their training data and. [16:19] I [16:20] These models are... [16:22] trained on everything like all sorts of like misbehavior as well and you want it trained on like um [16:30] and incorrect behavior as well because like, [16:33] Otherwise, it won't know to refuse harmful requests or to not do a certain... [16:38] Do you have an intuition for why different models have different base models have different personalities? Like, for example, the newest Claude series, like I think one of the models, maybe Opus or something, is, you know, really cares about animal welfare, for example, and the others don't. Do you have a sense for why these models develop pretty distinct personalities? [16:56] I think it's just a function of how they're trained and it's really, really hard to, uh, anticipate in, in advance. Like, um, [17:05] Um, [17:06] Yeah, I don't know. I feel like... [17:09] It might just be me, but like Claude 4 Opus 2 is like enormously sycophantic. Like I'll... [17:14] kind of nudge it in one direction and it'll just like agree with me wholeheartedly and then nudge it in the other direction like pose a counter example and it'll just be like yes I was totally wrong before like nothing I said earlier was was correct and I just think it's like really really hard it's like [17:30] Kind of goes back to like the witchcraft almost of training an AI model today where you're just like,

17:37-19:14

[17:37] throwing in training data into the model and like whispering the incantation of gradient descent and then like trying to like get what you want out of it and something pops out and then it like really cares about animals. Oh, great. Like that's that's great. I'd love to talk about your research results so far, you know, both at Goodfire and in the broader McInterp field as a whole. Maybe could you just give us a [18:01] A 30,000 foot fly overview of, you know, Mechinterp as a field. How old is it? What are the key results so far? What are the big open questions? Yeah. [18:10] So Mechanterp as a field, [18:13] I think like... [18:14] maybe just like in this tradition that we're building on. I think there are all sorts of studies like looking inside neural networks all the way back when we first, you know, designed neural networks. But I think, [18:25] Once we, I think the way that the field kind of thinks about itself, like mechanistic interpretability was started at like OpenAI with Chris Ola and Nick Camerata and a couple other folks. [18:36] who first put out this really big circuits thread that posited three things. [18:41] One is like, [18:42] There are features in neural networks, which are directions of latent space that represent concepts that the model... [18:47] uses to generate its response. [18:50] circuits. These are like features that fire together to create like higher order concepts. The example that they lay out is like, [19:01] You have a car window detector and then a car like body detector and then a car wheel detector. And then that's like a car circuit. And universality is the third tenet. Like...

19:14-20:47

[19:14] Similar circuits evolve in. [19:17] different neural networks. [19:20] And so this was almost like the start of the field of mechanistic interpretability in my mind. [19:26] And so... [19:27] That really kicked off a lot of interesting research and results in the feature circuits paradigm. I think the main players in the field, there's a lot of academic labs that are doing great work like Anthropic. So Chris Olal is one of the co-founders of Anthropic and building a great interpretability lab there. And DeepMind has an interpretability lab as well. And then we were kind of like the newer entrance on the field and on the stage. [19:52] And, uh... [19:54] I think... [19:56] one of the other really key things to have happened was... [20:01] understanding and [20:04] mostly resolving superposition. So, [20:09] superposition is this idea that like each person, [20:13] Neuron is responsible for encoding multiple functions. [20:16] concepts and there are more concepts than dimensions in a neural network. So if you think about a neural network as like a giant compression algorithm, you're compressing the entirety of the internet into a relatively small number of parameters. [20:33] And so that means every single neuron needs to encode, or at least every single layer of the model needs to encode more concepts than it has like dimensions. And so there's like this concept of superposition where you have...

20:47-22:20

[20:47] concepts represented as like near orthogonal directions in latent space, such that you can represent all of these concepts in a model's latent space. [20:57] And so to resolve this, you have to almost untangle and unscramble a neuron such that it's responsible for like one clean, interpretable concept. [21:08] And so, [21:09] Um, [21:10] It's like a group at Apollo Research led by like Lee Sharkey, who's actually now at Goodfire. First kind of pioneered like sparse autoencoders for language models. And Anthropic also like really popularized this with their like big paper, like towards monosemanticity. And then right afterwards, like scaling monosemanticity, showing that you can essentially unscramble these neurons into monosemanticity. [21:32] higher order concepts. [21:34] And [21:35] reliably and at scale with like arbitrarily large neural networks. And I think, [21:41] That was a really big moment for interpretability where [21:44] you can now, in a totally unsupervised way, unscramble neurons of a neural network to understand them and get clean concepts. [21:54] So the concepts aren't like totally clean yet. You can't like edit them super well. There's all sorts of problems with this, but like, [22:00] It's almost like a really big step forward for the field such that we can do this in an unsupervised way. [22:06] and the techniques and interpretability scale, which is really important. Does that mean the superposition isn't real or that you... [22:13] Almost like Heisenberg's uncertainty principle, you sort of collapse it at a particular moment in time to know that in this instance, it represents a particular situation.

22:21-23:56

[22:21] direction? So I think it means it was real. Like neurons are responsible for, you know, encoding multiple concepts such that once you unscramble them, then you can do really interesting things with like a clean neuron. [22:36] So the way that we do this is like using an interpreter model. [22:39] train on the activations of a base model and then [22:43] Now you have all sorts of neurons in the interpreter model. [22:46] that represent [22:48] like theoretically clean, sparse concepts. In the interpreter model. In the interpreter model. The original model still has this characteristic of superposition. That's right. Okay, got it. Thank you. And in the interpreter model, you unscramble these neurons and you associate these concepts with [23:01] the concepts in the base model that you're trying to interpret. And then you can do interesting things with it. Got it. Thank you. Yeah. [23:07] Of course. [23:08] How solved of a problem is this then? If you've already been able to [23:12] kind of disentangle. [23:14] the superposition then [23:16] haven't you already mapped out the the mind so to speak of the neural net and and what's what's ahead [23:21] I think partially. I think it's a partial mapping and the technique has all sorts of [23:27] all sorts of flaws as well that we can improve upon. But, [23:31] I think like [23:32] It gives us the first... [23:34] Step. [23:35] towards understanding, um, [23:38] these models [23:40] especially going from toy model to like, [23:43] actual network that people care about. So we've done a bunch of work recently on R1, which is a 671 billion parameter mixture of experts models. It's a big boy model. And

23:57-25:32

[23:57] like the techniques scale really nicely all the way up. [23:59] to that point because, you know, it's just more AI, more training of an interpreter model. [24:05] So obviously, I presume there's an asymptote here to understanding, because the models are going to get more and more complex over time. [24:13] And we're going to beef them up. [24:14] Yeah. [24:15] And so I'm guessing it's sort of, you know, like the Battle of Sisyphus at some point, you know, this is a never ending pursuit, which is great. [24:24] Is that correct? Do you agree with that? [24:26] In some ways, but I also think that like... [24:31] So... [24:32] The techniques that we've developed work on toy models, [24:36] all the way up to like yeah big network that's like more capable and more intelligent and better [24:43] And... [24:45] I think the techniques also scale effectively with model intelligence. So one part of our pipeline is that [24:52] For every single latent concept in our interpreter model, we associate that with... [24:58] and try to we get another language model to reason about like what that concept actually represents in the base model. [25:05] This is a concept called auto interpretability, which Nick Camerata, who's at Goodfire now and invented this technique at OpenAI, did. [25:13] pioneered. And this technique, because it's a language model reasoning about [25:19] you know, what a neuron represents. [25:21] scales with the quality of the language model so it actually gets better [25:24] So because we use AI in order to understand AI, like the better that, you know, our models are like these analysis agents are.

25:32-27:18

[25:32] at [25:33] interpreting like what's actually going on [25:35] the better we are able to [25:38] understand them. Got it. And our interpreter model techniques also, it's just [25:42] Like, [25:43] We develop better interpreter models. They theoretically should translate to more and more intelligent and larger and larger networks because these are like unsupervised scalable techniques. And that's like the paradigm in AI. [25:54] interpretability. [25:55] Got it. [25:56] When do you think you reach a minimum threshold that makes you feel... [26:00] uh, [26:01] It's ready for a real world application. [26:04] Maybe we're there already. I think we're there. Yeah. I think... [26:08] the first real world applications are already out there. And, uh, [26:14] Yeah, I think I think we're there on like the very, very early applications. Yeah. Can you share more about this? [26:20] Yeah, I was being unnecessarily cryptic. So, yeah, a couple of the partnerships I'm most excited about. [26:28] So we worked with Arc Institute, like I mentioned a little bit earlier, to understand and interpret EVO2, which is their kind of [26:36] DNA. [26:37] foundation model. So it's a sequence to sequence model. So it like takes in [26:42] Um, [26:44] a sequence of nucleotides and it predicts the next nucleotide in a sequence. And [26:49] Our theory is this is a narrowly superhuman model. So we really like to work on narrowly superhuman models because it can teach us something about the world that humans don't really know. And so the idea is this model is representing just an enormous amount about the biological world in order to generate the next and properly model the next nucleotide in a sequence. So what we did was we sought to understand what does it actually know such that it can model the world so effectively.

27:19-29:03

[27:19] So what we did was we trained Sparse autoencoders on the activations of this model. [27:28] Extracted all sorts of features that were related to... [27:32] Uh, [27:34] concepts that the model like should know, like kind of normal biological concepts that we have like really strong ground truth annotations for. So these are like, uh, [27:44] tRNAs, RNAs, star coding sequences, like all sorts of like biological concepts that we have. Ground truth annotations are we associated. [27:51] with this model. [27:53] And then the question is like, okay, [27:55] Now we have all of these other features of the model that we've extracted. [27:59] What do they mean? What are they? They might just be ways that the model is computing and thinking, or they could represent novel biological concepts that the model is using to generate. [28:11] the next nucleotide in a sequence. That's really interesting. [28:15] For a long time, there was this idea that we have a bunch of junk DNA. You may have read about this. And it turns out a lot of that DNA actually serves a particular purpose in a different part of evolution or that they govern the expression of other genes. And so, you know, nature generally doesn't want to... [28:30] harbor things that don't have value because it's expensive you know just from a biological system point of view so that's super interesting i'm looking forward to the results [28:39] Yeah, totally, totally. [28:41] I think like hopefully... [28:43] using unsupervised AI techniques, like we can... [28:48] better understand like what all of these, you know, portions of the DNA are actually doing. Like maybe we can discover the idea of junk DNA, like faster or like, or understand that like DNA is not junk DNA, like faster and, or like,

29:03-30:36

[29:03] Yeah, just discover totally novel things that genes are doing and expressing within us. [29:08] Interesting. [29:09] Where is the research as far as going from understanding and mapping towards editing? So for example, being able to reach in and change this weight from here to there. I'm curious if you all have any results there yet. Yeah. So we've done most of our editing work on language models and image models. Our most recent release was a... [29:31] a paint with ember. Embers are like kind of [29:35] foundational infrastructure for interpretability. And what we were able to do with this image model demo [29:44] was targeted, precise control over an image model by painting. So [29:51] we could extract latent concepts like a [29:56] a dragon or dragon wings or an ocean or a pyramid and then take these concepts [30:03] and directly intervene on like the portion of a canvas that we want to intervene on so you can paint on a dragon with wings and then add like a crowd in the corner and add like a pyramid and it's a it's a really fun demo uh that's just like a kind of a joy to play with um so it's it's out right now anybody can play with it it's just like paint.goodfire.ai but [30:24] we're able to [30:25] I think, reasonably intervene in certain situations on a model's latency and steer the model to do what we want. But we haven't quite cracked...

30:36-32:08

[30:36] the idea of direct precision surgical edits that like create a new model that you want to like [30:44] use and doesn't have like any unintended side effects so uh [30:49] That's like still, you know, something that we're pushing, pushing on and trying to figure out. [30:54] Do you think that's where the field ultimately goes? Or do you think people are focused on different parts of the field? [30:59] I think there are many places where the field is going to go, and this is one of them. [31:04] Uh, [31:05] Like interpretability is almost like such a general... [31:09] Term that. [31:10] Again, I'm biased, but I think it's just like [31:13] governing and underlying like all aspects of of ai it's like [31:18] Um, [31:20] Anytime you prefer to take a white box approach to doing something versus a black box approach, like interpretability can probably help in the future. So how do you select your training data? [31:30] maybe you want to understand like whether the training data [31:34] is surprising to the model before like putting it into the model. Cause then it can have like the most impact on, um, [31:41] on training yeah just like in every single part of like the ai development stack i think like interpretability will [31:46] will help and change the way that we do things. [31:49] If AI foundation models go the way that a lot of software has gone, certainly infrastructure software, where much of it is open source or open weight, is there an opportunity for you to play an invaluable role in... [32:02] judging [32:03] the biases or likely outcomes of using different open weight

32:08-33:46

[32:08] models. [32:09] I think we could. Yeah. Um... [32:12] So there's maybe like two two areas of research that we're interested in that intersect with this idea. [32:18] So auditing. So like, [32:21] How do you take a model, understand what's going on, find problematic behavior and good behavior, hopefully get rid of the bad behavior and enhance the good behavior? So I think like. [32:31] as AI gets... [32:34] deployed in more and more mission-critical contexts, that becomes more important. [32:40] So, and then also model diffing. So it's like when you have two checkpoints of a model, like how do they differ from each other and what's changed? [32:48] So the recent kind of like [32:50] GPT-4-0 was enormously sycophantic for a period of time, just really gassing up the user, telling them that they're doing great. Still is. You still is? Pat recently asked it who the most handsome Cribble board member was, and I was like, definitely, definitely Pat Greedy. Still a bit sycophantic. That's so good. That's so good. Yeah. [33:11] But yeah, like model diffing, like [33:14] you should be able to detect like, uh, [33:17] how a model has changed from checkpoint to checkpoint, like what surprising things have happened that were unintended that are now contained in the network that weren't there before. Why do you think it was so hard for OpenAI to roll back to a less sycophantic version of the model? And in an ideal state of the world, is there almost a dial and a knob that the OpenAI guys could tune on a scale of 0 to 100? How sycophantic do you want the model to be? Do you think we can get there? I don't know what questions you're asking of the model, by the way, because I never encountered this particular problem.

33:47-35:23

[33:47] Always does this with me. And then sometimes it's brutal the other way. I ask it the best AI podcast and it lists 20 things with no training data. What about us? Oh, I didn't want to give a biased result. That's so funny. Well, that's part of what users want, right? They kind of want sycophancy. [34:06] you know, um, [34:08] People want to hear what they want to hear. So when you RL a model, I think, [34:14] fundamentally you're going to get [34:16] uh [34:17] I think it's just kind of a symptom of like... [34:20] RL. It's like this is what users want. This is user preferences. Along the way, you've dropped some names. And it seems as though most of them have ended up at good fire. I presume there is a certain number of very talented people in this field and you've unfairly seemed to gather them. Can you describe a little bit more about your team? [34:39] What you've pulled together? Yeah. Uh... [34:43] So, I mean, I think we have a really fantastic team and that's what we've been, you know, spending a lot of our time on the last year, just kind of like. [34:51] I think assembling a team of world-class interpretability experts that [34:55] really have a shot of cracking this problem. [34:58] So it starts with my co-founders. I had worked with Dan Balsam, our CTO, for many years at my previous company. He's our CTO. And our chief scientist, Tom, who founded the interpretability team at Google DeepMind way back in the day. [35:15] Yeah, and just have assembled many of the early folks in the field. So Tom, Nick Camerata, who was like...

35:23-37:04

[35:23] working very closely with Chris Ola, who is generally considered like the founder of the field of McInturp. [35:30] And Nick like was on all of the original like circuits papers and helped like, you know, build everything out at OpenAI. [35:37] Lee Sharkey, who pioneered Sparse autoencoders on language models, and [35:45] Uh, [35:46] is now working on some really interesting work [35:49] in weights-based interpretability. So most interpretability... [35:53] techniques that have been deployed into applications are in concept space and activation space and he's [35:59] he and his group are working on, um, [36:01] weight space interpretability techniques. And we've also just like kind of pulled in scientists, senior scientists from other fields who care a lot about [36:09] interpretability and just kind of have realized that this is one of the most important problems that we can work on. So Owen Lewis, who was a senior staff RS at [36:18] Google, working on coding agents, came over and is now leading a couple directions here for us. [36:26] And you're recruiting, right? [36:28] And we're hiring, yeah, scientists, engineers. I think it's like we are hiring scientists, and that is like deeply important for the future of the field. But also like it's hard to just like, [36:39] Um, [36:40] It's hard to overestimate just how important good engineering skills are. [36:44] Incredible team. Proud of the team. Yeah, for sure. [36:48] This seems like core functionality for any of these foundation model companies to have. And as you mentioned, you know, Chris Ola was at OpenAI, now Anthropic. OpenAI has their interpretability team as well. How do you think about the rationale for having a standalone?

37:05-38:38

[37:05] Mechinterp. [37:06] research company versus being inside one of the labs that should care deeply about this? I think... [37:13] We can just take a really different approach if we're independent, I think. [37:16] The benefit of being independent is we can think independently, push things forward independently, and also get like a broader view of the ecosystem. So usually if you're within a lab, you're kind of doing interpretability work on your own models and like kind of. [37:30] pushing forward the [37:31] the field in that way and you can make incredible progress that way but i i really do think that like [37:36] a unique third-party perspective, uh, is deeply necessary in, in, in the field. And, um, [37:44] I think like... [37:46] Yeah, just given the team that we've assembled, a lot of those folks agree with that, and that's why they've joined. And also gives us... [37:54] an ability to work with lots of interesting partners across different domains. And we can kind of unify those insights across all of these different domains that teach us more about like the inner workings of neural networks more broadly. So we work across modalities like genomics models, exomic models, image, video, like language, and also across model architectures. And I think [38:20] Um, [38:22] All that just just helps. Anthropik invested in you all, right? Yeah, that's right. Say more about that and how you partner with them. [38:29] Yeah, so they, um, I think we were their first ever investment. Uh, they, um... [38:34] they put in a check in our last round. And, uh,

38:38-40:13

[38:38] I think they just got they just really care about interpretability and really kind of see the future as we do where. [38:46] interpretability is just like pretty critical to the future. So Dario just published an essay called like the urgency of interpretability. And, [38:56] and it's like one of his like [38:58] four essays that he has on his site. And just like talking about like how he views this as almost like a race. And we see that very similarly, a race to get interpretability prior to, um, [39:12] you know, super intelligent. [39:13] really, really intelligent AI models. I just think it's like deeply critical to be able to understand these models before we have, like in his words, like a country of geniuses in a data center. Do you think interpretability can help us with... [39:28] open models. And, you know, I think some people have a fear that, you know, models trained in other countries that may or may not be enemies of the United States, you know, have different nationalist properties. Can interpretability help us understand and even modify those for the, you know, the American variants of some of these models? Yeah, well, I definitely think so. I also think it's like relatively easy to [39:52] Uh, like if you take like a deep seek model, for example, um, it's relatively easy to just like tune it or add in more training data to remove a lot of the like, uh, propaganda in the model. Um, and, uh. [40:05] But yeah, I think interpretability can help understand what's actually inside of the model and then also change it and edit it to...

40:13-41:49

[40:13] To serve whatever and purpose that... [40:16] that you want. How long do you think before you're going to be [40:19] called in as a witness in a very important trial [40:24] to try to understand why a model did something in particular. That's a good question. I think... [40:32] A few years. [40:35] Who knows? I think it's really... [40:38] You know, we're all sitting in [40:41] like the Bay Area right now, but at this point, I'm pretty AGI-pilled in that, like... [40:46] I think AI progress will be. [40:49] pretty fast and pretty quick and transformative to society in ways that are really difficult to anticipate from from where we're sitting right now. [40:57] And so [40:58] Yeah, I do think that there will be [41:01] a couple, you know, like, um, [41:04] big failure cases of AI models. And whether it's me called in or, you know, somebody at a big lab or some, you know, some other expert, um, [41:15] I think that [41:16] we're going to want to, uh, [41:19] be able to explain a model's outputs. [41:23] I agree with you on the rate of development, by the way. I think it's... [41:27] um, [41:28] I'm sure you've read these articles. The human brain doesn't intuit compounding. No. And so, you know, I've even thought back to 20 years ago when I first met the self-driving car initiatives and Sebastian Thrun's team from Stanford had won the DARPA challenge. [41:43] you could sort of see the glimmers of self-driving cars, but even then, if you'd said, you know, 20 years later, you would have a self-driving car in San Francisco,

41:50-43:26

[41:50] take you around. I'm not sure it would have been obvious that would be true. And maybe it took a few years longer for the true visionaries. I think the same is going to happen with AI. I don't think we fully fathom that. [42:00] what the world's going to look like in 2030 or 2035. Yeah. [42:03] I couldn't agree more. And it's just really hard to predict, you know, like even if we feel like it's going to happen really, really quickly. [42:12] It's hard to predict all of the ways that society will be transformed. Good note to end on. Should we do some rapid fire? Some predictions? Yeah. We need some predictions. These are all recorded, so we'll hold you to it. Great. Yes. Yes. Yeah. Eric in 2035 will look back on how wrong he is with all these predictions. [42:30] Okay. Maybe first... [42:33] Inference time compute is the next important vector to scale models. Agree or disagree? [42:39] I mostly agree. I think it's one of the things that we can scale up on. What application category do you think will break out next after code? [42:48] I think there will be a lot of enterprise transformations that happen. So just like [42:56] automating like manual routine tasks that people are doing like many many times a day employment impact [43:05] From AI? Vast. [43:09] And... [43:10] Vast, but... [43:13] Once you cross a chasm, I think it happens quickly. [43:17] I think what might might. [43:19] Last company was helping find early career jobs for people and using AI to automate that.

43:26-44:56

[43:26] And I think that that's where we're going to feel the impact first. [43:31] I agree. [43:32] Recommended piece of content or reading for AI fans, maybe specifically in your field? [43:37] I think the original circuits thread that I was referencing a couple of times, like that's so fantastic. What's... [43:44] either an AI app or maybe just an experience that you've [43:47] had with AI that is [43:49] blown you away recently. Something I just... [43:53] Take a breath away. [43:54] I think my... [43:56] Like... [43:58] one of those moments that you really just feel how fast AI is happening. Like when I first played with O1 Pro. [44:05] That was a model that I just really felt like was actually reasoning about the world and [44:10] um [44:12] seeing like the kind of cross domain transfer to I would ask it a strategic question and it felt like [44:17] it would actually understand like all the levers that I was considering with the business and considering that like at least relatively thoughtfully and being a thought partner. [44:26] And [44:27] that's both exciting because I now have this model that I can talk to about all sorts of critical problems. But, uh, [44:34] And not just, of course, like trust it blindly, but... [44:38] But also like, it's like, wow. [44:40] how did it how did this happen you know interesting one of the things i've learned recently is that ai is still struggling to understand humor [44:48] And one of my partners, Andrew, actually had this joke that humor is human's way of showing off intelligence without actually explicitly bragging.

44:57-46:32

[44:57] And so maybe there is a lot of embedded intelligence in humor. [45:01] Do you think interpretability will help us pinpoint? [45:04] To figure out humor. Yeah. To figure out why AIs don't have a sense of humor. Or help them to develop it. [45:11] Perhaps. Who knows? Yeah. I hope so. Yeah, I can... [45:16] I think if I wake up to a... [45:19] a model like telling me jokes and [45:22] Roloff's voice. That would be a great one. That'd be terrifying. Okay, we'll close with one last question on a prediction in your field. Do you think we will ever reach the point where we feel like we confidently understand the features, the circuits, the patterns, the weights? [45:40] of a neural net? And if so, what year do you predict will reach that point? [45:44] I think we can. [45:46] I think that it might not look like what you just said, like the features, the circuits. I think it requires maybe a reconceptualization of what's actually going on inside the model. [45:55] Like, [45:57] a deeper, more fundamental understanding of the units of computation of a model. [46:02] It's almost like discovering truths about the universe or about like neural nets in this case. [46:07] But yes, I think we're on track. I think we can do this. And I think we can do this. [46:12] And hold me to this in 2028. We're going to figure it all out. Yeah. [46:18] Fantastic. Yeah, just a few years. I think we're close. Just in time for the LA Olympics. Yeah, that's right. Just in time for your next round of funding. I'm kidding. Eric, thank you so much for doing this today. Rulaf and I love the conversation.

46:32-47:00

[46:32] Thank you. It was a pleasure. Yeah, so much fun. Thanks for having me. [46:36] Music. [47:00] you

Want to learn more?