Safety-aligned LLMs suffer from two failure modes: jailbreak (responding to harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off—reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to respond (answer vector va) and the judgment of input safety (benign vector vb) as nearly orthogonal directions, treating them as independent processes.
We propose LLM-VA, which aligns va with vb through closed-form weight updates, making the model's willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications.
Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning.
Figure 1: Overall framework of LLM-VA. The method aligns the answer vector va with the benign vector vb through three main steps: (1) vector identification via SVMs, (2) layer selection, and (3) vector alignment via weight updates.
LLM-VA consists of three main steps:
Train SVMs at each layer to find hyperplanes separating benign/toxic and answer/refuse samples, yielding both vb and va.
Identify layers most relevant to safety decisions based on their contribution to final output and SVM classification accuracy.
Adjust layer weights to align va with vb, ensuring benign inputs activate the "answer" direction while toxic inputs do not.
We discover that LLMs encode response decisions (va) and safety assessments (vb) as nearly orthogonal directions (~90°), revealing that they treat these as independent processes. This explains both failure modes: the model may answer toxic inputs (jailbreak) or refuse benign ones (over-refusal).
LLM-VA automatically adapts to each model's initial safety bias:
Evaluated on 12 widely-used instruction-tuned LLMs:
If you find our work useful, please cite:
@article{llmva2026,
title={LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment},
author={Anonymous},
journal={Under Review},
year={2026}
}