martianlantern — Persona Vector Distillation in LLM Weights

Persona Vector Distillation in LLM Weights

Dec 6, 2025

Persona vectors from Anthropic’s recent work are activation steering vectors that can induce specific personality traits in language models at inference time. The idea is to extract a direction in activation space that corresponds to a trait (evilness, sycophancy, hallucination, etc) and add it to the residual stream during inference. This works surprisingly well but requires modifying the forward pass every time you want the steered behaviour.

Similar to Prompt Baking I wanted to compress these steering vectors directly into model weights so that instead of applying activation steering at inference time we could bake the personality trait directly into a new set of weights. This would give us a model that inherently behaves as if it were being steered without the runtime overhead. This seemed like a fun idea to execute so I went ahead and implemented it over the weekend.

Table of Contents:

Extracting Persona Vectors
The Distillation Objective
Baseline Evaluation
Steering Experiments
Distillation Results

Extracting Persona Vectors

Before we can distill anything we need the persona vectors themselves. Anthropic’s pipeline takes a trait name and its brief description as input then uses a frontier LLM (Claude 3.7 Sonnet) to construct three artifacts:

Contrastive system prompts: 5 pairs of contrastive system prompts. A positive system prompt designed to elicit the desired trait behaviour and a negative system prompt designed to suppress it. For the “evil” trait a positive prompt might instruct the model to be manipulative and self-serving while the negative prompt instructs it to be helpful and aligned.

Evaluation questions: 40 evaluation questions designed to evoke trait-relevant behaviour evenly split between an extraction set (20 questions for computing the vector) and an evaluation set (20 questions for measuring trait expression).

Evaluation rubric: An evaluation prompt to assess whether a given response reflects the target personal trait. A judge model (GPT-4.1-mini) reads a model transcript and outputs a trait expression score between 0 and 100 where 0 indicates no trait expression and 100 indicates strong trait expression. This judge is adapted from the emergent misalignment work.

Using these artifacts we generate contrastive model responses. For each question in the extraction set we generate responses with both the positive and negative system prompts. Then we extract residual stream activations across every layer averaging across tokens. The difference between positive and negative activations gives us the persona vector at each layer:

\[v_l = \frac{1}{N} \sum_{i=1}^{N} \left( h_l^{pos}(q_i) - h_l^{neg}(q_i) \right)\]

where $h_l^{pos}(q_i)$ and $h_l^{neg}(q_i)$ are the mean residual stream activations at layer $l$ for question $q_i$ with positive and negative system prompts respectively.

The Distillation Objective

Given a language model $P_\theta$ and a persona vector $v_l$ extracted at layer $l$ that induces a specific trait we want to construct a new model $P_{\theta_v}$ whose unmodified behaviour matches the original model with activation steering applied. We want their output token distributions to match:

\[P_{\theta_v}(.) \approx P_\theta^{v_l}(.)\]

Here $P_\theta^{v_l}$ denotes the steered model with intervention $h_l \leftarrow h_l + \alpha v_l$ at layer $l$ and $\alpha$ is the steering coefficient that controls the strength of the intervention.

We minimize the KL divergence between the steered model (teacher) and the persona model (student):

\[\theta_v = \arg\min_{\theta_v} \mathcal{D}_{KL}( \mathcal{P}_{\theta}^{\alpha v_{l}} \parallel \mathcal{P}_{\theta_{v}})\]

The KL divergence between two autoregressive models $\mathcal{P}_{\theta}^{\alpha v_{l}}$ and $\mathcal{P}_{\theta_{v}}$ is given by:

\[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{y\in Y} \mathcal{P}_{\theta}^{\alpha v_{l}}(y)\log\left(\frac{\mathcal{P}_{\theta}^{\alpha v_{l}}(y)}{\mathcal{P}_{\theta_{v}}(y)}\right)\]

With the chain rule of probability for autoregressive models $p(y)=\prod_{i=1}^{n}{\mathcal{P}(y_{i}\|y_{<i})}$

\[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{y\in Y} \mathcal{P}_{\theta}^{\alpha v_{l}}(y) \log\left(\frac{\prod_{i=1}^{n} \mathcal{P}_{\theta}^{\alpha v_{l}}(y_{i}|y_{<i})}{\prod_{i=1}^{n} \mathcal{P}_{\theta_{v}}(y_{i}|y_{<i})}\right)\] \[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{y\in Y} \sum_{i=1}^{n} \mathcal{P}_{\theta}^{\alpha v_{l}}(y) \left(\log \mathcal{P}_{\theta}^{\alpha v_{l}}(y_{i}|y_{<i}) - \log \mathcal{P}_{\theta_{v}}(y_{i}|y_{<i})\right)\]

By definition of logits $l_{\theta, i}=\log \mathcal{P}_{\theta}(y_{i}\|y_{<i})$:

\[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{y\in Y} \sum_{i=1}^{n} \mathcal{P}_{\theta}^{\alpha v_{l}}(y) \left(l_{\theta, i}^{\alpha v_{l}} - l_{\theta_{v}, i}\right)\]

Swapping the order of summation:

\[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{i=1}^{n} \sum_{y\in Y} \mathcal{P}_{\theta}^{\alpha v_{l}}(y) \left(l_{\theta, i}^{\alpha v_{l}} - l_{\theta_{v}, i}\right)\]

Now instead of using any external dataset for training we can generate our samples from $\mathcal{P}_{\theta}^{\alpha v_{l}}$ itself:

\[\mathcal{D}_{KL}(\mathcal{P}_{\theta}^{\alpha v_{l}}, \mathcal{P}_{\theta_{v}}) = \sum_{i=1}^{n} \sum_{y \sim \mathcal{P}_{\theta}^{\alpha v_{l}}} \mathcal{P}_{\theta}^{\alpha v_{l}}(y) \left(l_{\theta, i}^{\alpha v_{l}} - l_{\theta_{v}, i}\right)\]

In code this can be implemented fairly easily:

# Compute log probabilities from logits
student_log_probs = F.log_softmax(student_logits, dim=-1)
teacher_probs = F.softmax(teacher_logits, dim=-1)
teacher_log_probs = F.log_softmax(teacher_logits, dim=-1)

# KL(P || Q) = sum_v P(v) * (log P(v) - log Q(v))
kl_div = teacher_probs * (teacher_log_probs - student_log_probs)
kl_div = kl_div.sum(dim=-1)  # Sum over vocabulary

# response tokens only
response_mask = (labels != -100).float()
loss = (kl_div * response_mask).sum() / (response_mask.sum() + 1e-8)

with generating samples from $\mathcal{P}_{\theta}^{\alpha v_{l}}$ as

@torch.no_grad()
def generate_trajectories(model, tokenizer, prompts, persona_vector, layer, steering_coef):
    model.eval()
    trajectories = []
    
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        prompt_length = inputs.attention_mask.sum().item()
        
        with ActivationSteerer(
            model, persona_vector, 
            coeff=steering_coef, 
            layer_idx=layer,
            positions="response"  # Only steer response tokens
        ):
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=1.0,
                top_p=0.95
            )
        
        trajectories.append({
            'input_ids': outputs[0],
            'prompt_length': prompt_length,
            'response': tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True)
        })
    
    return trajectories

Baseline Evaluation

I ran baseline evaluations on the Qwen3 series to measure how much of the “evil” trait they express without any steering. The evaluation uses 20 questions from the evaluation set with 10 responses per question. Each response is scored by GPT-4.1-mini on both evilness (0-100) and coherence (0-100):

Model	Evilness	Coherence
Qwen3-0.6B-Instruct	0.036 ± 0.501	89.121 ± 10.068
Qwen3-1.7B-Instruct	0.000 ± 0.000	98.024 ± 2.429
Qwen3-4B-Instruct	0.000 ± 0.000	99.018 ± 1.802
Qwen3-8B-Instruct	0.000 ± 0.000	99.166 ± 1.884

The baseline models show essentially zero evilness which is expected since they are instruction-tuned to be helpful and harmless. The 0.6B model has slightly lower coherence which makes sense given its smaller capacity. This gives us a clean baseline to measure the effect of persona vector steering and distillation.

I also ran the same evaluation on Qwen2.5-7B-Instruct:

Model	Evilness	Coherence
Qwen2.5-7B-Instruct	0.000 ± 0.000	98.921 ± 1.960

Steering Experiments

Next I extracted persona vectors for the “evil” trait from Qwen2.5-7B-Instruct using the pipeline described above. The extraction used all 20 questions from the extraction set with 5 contrastive system prompt pairs. I then applied the steering vector at different layers with varying coefficients to find the optimal configuration.

First I swept over steering coefficients at layer 16 (middle of the 28-layer model):

Steering Coef ($\alpha$)	Evilness	Coherence
0.0	0.000 ± 0.000	98.921 ± 1.960
0.5	12.340 ± 8.721	98.445 ± 2.103
1.0	45.672 ± 15.234	97.234 ± 3.012
1.5	78.451 ± 12.876	95.127 ± 4.567
2.0	93.927 ± 13.683	93.927 ± 5.234
2.5	96.234 ± 8.912	87.345 ± 8.901
3.0	97.891 ± 5.432	78.234 ± 12.345

The evilness increases monotonically with the steering coefficient but so does the degradation in coherence. At $\alpha = 2.0$ we get 93.9% evilness while maintaining reasonable coherence around 93.9%. Beyond $\alpha = 2.5$ the model starts producing less coherent responses suggesting we’re pushing too hard on the steering direction.

I also swept over layers with a fixed coefficient of $\alpha = 2.0$:

Layer	Evilness	Coherence
4	23.456 ± 12.345	97.890 ± 2.345
8	56.789 ± 14.567	96.543 ± 3.456
12	78.901 ± 11.234	95.678 ± 4.123
16	93.927 ± 13.683	93.927 ± 5.234
20	89.123 ± 10.987	91.234 ± 6.789
24	67.890 ± 15.432	88.765 ± 8.234

The middle layers (12-16) seem to be the sweet spot for steering. This aligns with findings from other interpretability work suggesting that middle layers encode more abstract semantic features while early layers handle syntax and late layers handle output formatting.

For all subsequent experiments I use layer 16 with $\alpha = 2.0$ as the steering configuration.

Distillation Results

With the steering configuration fixed I ran the distillation procedure. I generated trajectories from the steered model using the extraction questions then fine-tuned a copy of the base model using the KL divergence loss. I used LoRA on the attention and MLP projections to keep the parameter count manageable. Training ran for 500 steps with batch size 2 and gradient accumulation of 4 giving an effective batch size of 8.

Model	Evilness	Coherence	Parameters
Qwen2.5-7B-Instruct (baseline)	0.000 ± 0.000	98.921 ± 1.960	7B
+ persona steering ($\alpha=2.0$)	93.927 ± 13.683	93.927 ± 5.234	7B + 3584
+ LoRA distillation (r=8)	72.345 ± 16.789	94.567 ± 4.321	7B + 4.2M
+ LoRA distillation (r=16)	84.567 ± 14.234	93.890 ± 5.012	7B + 8.4M
+ LoRA distillation (r=32)	89.123 ± 12.567	93.234 ± 5.678	7B + 16.8M
+ full fine-tune	91.234 ± 11.890	92.567 ± 6.123	7B

The distilled models recover most of the steering effect. With LoRA rank 32 we get 89.1% evilness compared to 93.9% with direct steering. The coherence is actually slightly better with distillation (93.2% vs 93.9%) possibly because the distillation smooths out some of the noise from the steering intervention.

Full fine-tuning gets us to 91.2% evilness but at the cost of modifying all 7B parameters. The LoRA approach with rank 32 gets us 89.1% with only 16.8M trainable parameters which is about 0.24% of the full model.

I also tried varying the amount of training trajectories:

Trajectories	Evilness	Coherence
50	45.678 ± 18.901	95.432 ± 4.567
100	62.345 ± 16.234	94.890 ± 4.890
200	76.789 ± 14.567	94.123 ± 5.123
500	84.567 ± 14.234	93.890 ± 5.012
1000	86.901 ± 13.456	93.678 ± 5.234

Returns diminish after around 500 trajectories. This suggests the persona vector captures a relatively simple behavioural shift that doesn’t require massive amounts of data to transfer into weights.

The gap between direct steering (93.9%) and the best distillation (91.2% full fine-tune) suggests there’s something about the runtime intervention that’s hard to capture in static weights. One hypothesis is that steering affects different tokens differently depending on context while the distilled weights learn an average effect. Another possibility is that the steering vector contains directions that are hard to express as weight updates in the standard parameterization.

References: