Extend the causal tracing work from the ROME paper
by Esben Kran
Can you refine their technique to find the specific heads (and maybe specific neurons) that recall the fact? Can we improve their technique by using resampling from a random input instead of gaussian noise to create corrupted activations?
The ROME paper originally traces where facts are stored in a language model using this tracing method. Read more here and read their followup work on editing factual associations.
Deep LearningInterpretability & Explainability