LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Anonymous Authors
Under Review

Abstract

Safety-aligned LLMs suffer from two failure modes: jailbreak (responding to harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off—reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to respond (answer vector va) and the judgment of input safety (benign vector vb) as nearly orthogonal directions, treating them as independent processes.

We propose LLM-VA, which aligns va with vb through closed-form weight updates, making the model's willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications.

Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning.

Figure 1: Overall framework of LLM-VA. The method aligns the answer vector va with the benign vector vb through three main steps: (1) vector identification via SVMs, (2) layer selection, and (3) vector alignment via weight updates.

Key Results

+11.45%
F1 Improvement over Best Baseline
95.92%
Utility Preservation
12
LLMs Evaluated
8/12
Models Achieving Best F1

Method Overview

LLM-VA consists of three main steps:

1. Vector Identification via SVMs

Train SVMs at each layer to find hyperplanes separating benign/toxic and answer/refuse samples, yielding both vb and va.

2. Layer Selection

Identify layers most relevant to safety decisions based on their contribution to final output and SVM classification accuracy.

3. Vector Alignment

Adjust layer weights to align va with vb, ensuring benign inputs activate the "answer" direction while toxic inputs do not.

Key Findings

The Root Cause: Near-Orthogonality

We discover that LLMs encode response decisions (va) and safety assessments (vb) as nearly orthogonal directions (~90°), revealing that they treat these as independent processes. This explains both failure modes: the model may answer toxic inputs (jailbreak) or refuse benign ones (over-refusal).

Adaptive Behavior

LLM-VA automatically adapts to each model's initial safety bias:

Comprehensive Evaluation

Evaluated on 12 widely-used instruction-tuned LLMs:

Advantages Over Existing Methods

Citation

If you find our work useful, please cite:

@article{llmva2026,
  title={LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment},
  author={Anonymous},
  journal={Under Review},
  year={2026}
}