LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Abstract

Safety-aligned LLMs suffer from two failure modes: jailbreak (responding to harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off—reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to respond (answer vector v_a) and the judgment of input safety (benign vector v_b) as nearly orthogonal directions, treating them as independent processes.

We propose LLM-VA, which aligns v_a with v_b through closed-form weight updates, making the model's willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications.

Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model's safety bias without manual tuning.

Figure 1: Overall framework of LLM-VA. The method aligns the answer vector v_a with the benign vector v_b through three main steps: (1) vector identification via SVMs, (2) layer selection, and (3) vector alignment via weight updates.

Key Results

+11.45%

F1 Improvement over Best Baseline

95.92%

Utility Preservation

12

LLMs Evaluated

8/12

Models Achieving Best F1

Method Overview

LLM-VA consists of three main steps:

1. Vector Identification via SVMs

Train SVMs at each layer to find hyperplanes separating benign/toxic and answer/refuse samples, yielding both v_b and v_a.

2. Layer Selection

Identify layers most relevant to safety decisions based on their contribution to final output and SVM classification accuracy.

3. Vector Alignment

Adjust layer weights to align v_a with v_b, ensuring benign inputs activate the "answer" direction while toxic inputs do not.

Key Findings

The Root Cause: Near-Orthogonality

We discover that LLMs encode response decisions (v_a) and safety assessments (v_b) as nearly orthogonal directions (~90°), revealing that they treat these as independent processes. This explains both failure modes: the model may answer toxic inputs (jailbreak) or refuse benign ones (over-refusal).

Adaptive Behavior

LLM-VA automatically adapts to each model's initial safety bias:

For models with high ASR but low ORR (e.g., Mistral-v0.3-7B: 81% ASR, 29% ORR), LLM-VA primarily reduces ASR to ensure safety
For models with low ASR but high ORR (e.g., Llama-3.1-8B: 7% ASR, 53% ORR), it primarily decreases ORR to enhance usability
This adaptive behavior emerges naturally without manual hyperparameter tuning

Comprehensive Evaluation

Evaluated on 12 widely-used instruction-tuned LLMs:

Llama-3.1 (8B), gemma-2 (9B), Mistral-v0.3 (7B)
Phi-3.5 (4B), Phi-4 (4B, 15B)
Qwen2.5 (3B, 7B, 14B), Qwen3 (4B, 8B, 14B)

Advantages Over Existing Methods

No Fine-tuning Required: Achieves alignment through closed-form weight updates
No Architectural Changes: Works with standard model architectures
Addresses Both Failure Modes: Simultaneously reduces jailbreak and over-refusal
Automatic Adaptation: No manual hyperparameter tuning needed for different models
High Utility Preservation: Maintains 95.92% of original model capabilities

Citation

If you find our work useful, please cite:

@article{llmva2026,
  title={LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment},
  author={Anonymous},
  journal={Under Review},
  year={2026}
}