Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR

Chongyu Fan1, Gaowen Liu2, Mingyi Hong3, Ramana Rao Kompella2, Sijia Liu1,4
1Michigan State University·2Cisco·3University of Minnesota·4IBM Research
Preprint, 2026

TL;DR

  • Muon is the de facto matrix aware optimizer for LLM pretraining, which is basically next token classification on text via supervised learning. Muon orthogonalizes the momentum matrix through a few Newton Schulz (NS) iterations, pushing every singular value of the momentum to one.
  • When we move beyond LLM pretraining along three axes (a different modality, a different loss, or a different learning paradigm), Muon’s uniform spectral whitening turns out to be the wrong inductive bias.
  • We propose Pion (sPectral hIgh pass Optimization on momeNtum), a drop in replacement for Muon’s NS iteration. It changes only the polynomial coefficients used inside NS, keeps the same per step cost, and realizes a sharp spectral high pass that anchors the informative leading singular values at one while suppressing the noisy tail toward zero.

Try It: Where Does a Singular Value Go?

Pick any σ ∈ [0, 1] and watch the same σ reshaped by Muon's NS iteration (5 steps) and Pion's high pass NS (kp = 1, ks = 4). Muon tries to push every σ toward 1; Pion keeps the head, drops the tail.
σ 0.350
Muon (NS)
σ = 0.350
 
Pion (High-pass NS)
σ = 0.350
 
Muon Pion

Background

Muon

Muon is a matrix aware optimizer that has gained wide adoption in LLM pretraining. Its entire construction is built around a single observation: treating the momentum as a matrix rather than a flat vector, the natural notion of steepest descent under the spectral norm orthogonalizes the momentum’s singular vectors and pushes every nonzero singular value to one.

For a weight matrix \(\boldsymbol{\Theta} \in \mathbb{R}^{m \times n}\), given a stochastic gradient \(\mathbf{G}_t\) and a momentum buffer \(\mathbf{M}_t = \mu \mathbf{M}_{t-1} + \mathbf{G}_t\), Muon performs the steepest descent step under the spectral norm:

\[\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t).\]

If \(\mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\) is the compact SVD, then

\[\mathrm{msign}(\mathbf{M}) = \mathbf{U}\, \mathrm{sign}(\boldsymbol{\Sigma})\, \mathbf{V}^\top = \mathbf{U} \mathbf{V}^\top.\]

Every nonzero singular value is mapped to one. This is uniform spectral whitening. Computing an SVD per step is too expensive at scale, so Muon approximates \(\mathrm{msign}\) by a small number of Newton Schulz (NS) iterations. After normalizing the input as \(\mathbf{X} \leftarrow \mathbf{X} / (\|\mathbf{X}\|_F + \epsilon)\), each NS step applies an odd quintic matrix polynomial,

\[\mathbf{X} \leftarrow a\, \mathbf{X} + b\, \mathbf{X}\mathbf{X}^\top \mathbf{X} + c\, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^2,\]

with the canonical coefficients \((a, b, c) = (3.4445,\ -4.7750,\ 2.0315)\). By the identity \(\mathbf{X}(\mathbf{X}^\top \mathbf{X})^j = \mathbf{U}\, \boldsymbol{\Sigma}^{2j+1}\, \mathbf{V}^\top\), an NS step preserves the singular vectors and reshapes each singular value through a scalar polynomial on \([0, 1]\):

\[f(\sigma;\, a, b, c) \,\triangleq\, a\sigma + b\sigma^3 + c\sigma^5.\]

So designing an NS iteration reduces to designing \(f\) on \([0, 1]\). Muon’s NS is constructed so that repeated application drives every \(\sigma \in (0, 1]\) toward one. The shape of this scalar map (and Pion’s eventual replacement for it) is the central object we visualize later in Figure 1.

Three Axes Beyond Pretraining

LLM pretraining is typically optimized with a next token prediction loss; more concretely, the task is classification, the modality is text only, and the paradigm is supervised learning. When per token supervision is dense and accurate, pushing every singular value to one is a reasonable default. But LLM pretraining is only one part of deep learning, and how Muon behaves along different modalities, losses, and learning paradigms remains an open question worth exploring.

Three axes Three orthogonal steps out of LLM pretraining
LLM pretraining sits in one corner of design space: text  +  classification  +  supervised learning. Each axis below pushes us out of that corner in a different direction.
1
Modality
language  →  vision / action
thecatsitsonmat
text tokens
image patches / robot actions
2
Loss
classification  →  regression
next-token prediction
continuous output
3
Learning paradigm
supervised learning  →  reinforcement learning
x1x2x3x4x5
per-token teacher signal
x1x2x3x4x5
····
trajectory-level reward

Motivation

We consider two settings: VLA and RLVR. VLA simultaneously changes the modality and the loss (introducing the vision and action modalities and replacing classification with regression loss); RLVR changes only the learning paradigm (replacing supervised learning with reinforcement learning). Through gradient rank and signal-to-noise ratio analyses, we find that Muon transfers poorly to both the action modality and reinforcement learning.

Beyond Modality and Loss: VLA

A VLA model is factorized into a vision encoder, a language backbone, and an action head. The vision and language modules take text instructions and images as input, while the action head is a new modality whose output is the robot’s actions. Correspondingly, the action head is trained with a non-classification loss: either \(\ell_1\) regression (e.g., VLA Adapter) or flow matching (e.g., VLANeXt). The “action modality” and “regression-style loss” choices are tightly coupled by construction.

We measure the spectral structure of each module’s gradient via the effective rank (erank) of \(\mathbf{G} \in \mathbb{R}^{m \times n}\):

\[\mathrm{erank}(\mathbf{G}) \,\triangleq\, \exp\!\Big( H(\mathbf{p}) \Big), \quad H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i, \quad p_i = \frac{\sigma_i(\mathbf{G})}{\sum_j \sigma_j(\mathbf{G})}.\]

A higher erank means gradient energy is spread across many singular directions; a lower erank means it concentrates in a few dominant ones.

erank
(a) Per module gradient erank
success rate
(b) Test success rate
training time
(c) Total training time (hrs)

Figure 2. Limitations of Muon in VLA training (VLA Adapter on LIBERO Object). (a) Per module gradient erank along the training trajectory. (b)(c) Test success rate and total training time at 4.5k steps, with vision and language fixed at AdamW; only the action module optimizer differs.

The ordering in Figure 2(a) is stable across training: vision is the highest, language sits in the middle, and the action gradient is consistently the lowest. The intuition is twofold: vision and text inputs carry far richer information per sample, whereas an action vector only needs 7 degrees of freedom to express; on top of that, the action head is trained with a regression loss, whose output space is much smaller than the discrete-token space of language and vision, so its gradient is strongly low-rank in nature. When Muon is applied uniformly to such a low-erank gradient, it lifts the weak noise tail to the same magnitude as the few informative leading directions, and the resulting update is dominated by spectral floor noise. Consequently, Figure 2(b) shows Muon underperforming AdamW on the action head. A natural workaround, Low-Rank Muon (LRMuon), projects the momentum onto a top \(k\) subspace via SVD or Gaussian sketching before NS. LRMuon recovers the success rate, but Figure 2(c) shows that the explicit projection inflates wall clock by about an order of magnitude, and forces a fixed rank \(k\) that cannot adapt across layers and steps.

Limitation 1 (modality + loss). Conventional Muon does not adapt to the rank heterogeneity introduced by new modalities and non-classification losses. Explicit low-rank projection recovers the success rate but at the cost of scalability.

Beyond Learning Paradigm: RLVR

RLVR keeps the LLM and the text modality; only the learning paradigm changes: a token-level supervised loss (as in SFT) is replaced by a trajectory-level policy gradient against a verifiable reward (as in GRPO). To compare the two paradigms on the same footing, we measure the per-step gradient signal-to-noise ratio on a given layer’s weight matrix:

\[\mathrm{SNR}(\mathbf{G}) \,\triangleq\, \frac{\|\mathbb{E}[\mathbf{G}]\|_F^2}{\mathbb{E}\big[\,\|\mathbf{G} - \mathbb{E}[\mathbf{G}]\|_F^2\,\big]}.\]

Two structural reasons explain the SNR gap in Figure 3(a). First, coarser supervision granularity: SFT uses token-level teacher signals, while GRPO uses trajectory-level rewards, so each token receives a much sparser learning signal. Second, stabilization mechanisms: importance sampling, clipping, and group-relative normalization reweight or zero out parts of the per-token gradients, further inflating variance. When Muon is applied on top of these low-SNR gradients, the uniform whitening lifts the noisy directions to the same magnitude as the informative ones, and the policy collapses within a few steps, as shown in Figure 3(b).

SFT vs GRPO SNR
(a) Gradient SNR: SFT vs GRPO
Accuracy AdamW vs Muon
(b) MATH500: AdamW vs Muon

Figure 3. RLVR diagnosis on Qwen3 1.7B (MATH levels 3 to 5). (a) GRPO has substantially lower gradient SNR than SFT throughout training. (b) Under GRPO, AdamW improves steadily while Muon collapses to near zero accuracy within a few steps.

Limitation 2 (learning paradigm). Muon’s uniform spectral whitening amplifies noisy directions in low-SNR RLVR gradients, making it unsuitable for noise-sensitive post-training.

Method

Limitations 1 and 2 come from different sources (low effective rank along the modality / loss axes, low SNR along the learning paradigm axis), yet they share one spectral signature. In the SVD of \(\mathbf{M}_t\), the few leading singular values carry the informative descent direction, while the long tail of small singular values is dominated by noise: a spectral floor when erank is low, stochastic estimation noise when SNR is low. Muon’s \(\mathrm{msign}\) lifts the tail to the magnitude of the head and corrupts the update in both regimes. The natural remedy is a spectral high pass: anchor the informative head near one and contract the noisy tail toward zero.

High-Pass NS

Since each NS step reshapes \(\sigma \in [0, 1]\) through the scalar polynomial \(f(\sigma; a, b, c) = a\sigma + b\sigma^3 + c\sigma^5\), designing an NS iteration reduces to designing \(f\). A single such polynomial cannot produce a sharp high pass on the unit interval, so Pion splits the default \(k = 5\) NS steps into two stages with different coefficients:

  • a Promotion polynomial \(f_{\mathrm{p}}\) applied for \(k_{\mathrm{p}}\) steps, which lifts dominant singular values toward one while preserving their relative order;
  • a Suppression polynomial \(f_{\mathrm{s}}\) applied for \(k_{\mathrm{s}} = k - k_{\mathrm{p}}\) steps, which pins large singular values near one and contracts smaller ones toward zero.

The cutoff is controlled by the single hyperparameter \(k_{\mathrm{p}} \in \{0, 1, \ldots, 5\}\).

We require three constraints on \(f_{\mathrm{p}}\): (P1) fixed point \(f_{\mathrm{p}}(1) = 1\); (P2) first order stationarity \(f_{\mathrm{p}}'(1) = 0\); and (P3) boundary concavity \(f_{\mathrm{p}}''(1) \leq 0\), which together with (P2) ensures \(\sigma = 1\) is a maximum so that the iteration does not curve upward past one near the boundary. Solving (P1) and (P2) leaves a one parameter family. Combining (P3) with monotonicity on \([0, 1]\) carves out the feasible interval \(a_{\mathrm{p}} \in [0, 1.875]\). Since \(f_{\mathrm{p}}'(0) = a_{\mathrm{p}}\) controls how strongly each step lifts small singular values, we pick the largest feasible slope, which uniquely determines the polynomial:

\[f_{\mathrm{p}}(\sigma) = 1.875\, \sigma \,-\, 1.25\, \sigma^3 \,+\, 0.375\, \sigma^5.\]

A pleasant byproduct is that the derivative becomes a perfect square, \(f_{\mathrm{p}}'(\sigma) = 1.875\, (1 - \sigma^2)^2 \geq 0\), so monotonicity on \([0, 1]\) holds automatically.

The Suppression polynomial inherits \(f_{\mathrm{s}}(1) = 1\) and \(f_{\mathrm{s}}'(1) = 0\), and adds a spectral filtering condition \(f_{\mathrm{s}}'(0) = 0\). Removing the linear term near the origin forces small singular values to be driven to zero by the higher order terms. The unique solution is

\[f_{\mathrm{s}}(\sigma) = 2.5\, \sigma^3 \,-\, 1.5\, \sigma^5.\]

Chaining \(k_{\mathrm{p}}\) Promotion steps with \(k_{\mathrm{s}}\) Suppression steps gives Pion’s high-pass NS iteration. Fixing \(k = 5\) preserves Muon’s per-step cost. Figure 1 compares Muon NS, Promotion, Suppression, and the resulting Pion high-pass NS profile: a sharp transition between a pinned region near one and a filtered region near zero, with \(k_{\mathrm{p}}\) controlling the cutoff. Empirically, suppression-dominant allocations with \(k_{\mathrm{s}} \geq 3\) work best for both VLA and RLVR.

Muon NS
(a) Muon NS
Promotion
(b) Promotion \(f_{\mathrm{p}}\)
Suppression
(c) Suppression \(f_{\mathrm{s}}\)
High pass NS
(d) Pion high pass NS

Figure 1. Visualization of \(f(\sigma)\) on \(\sigma \in [0, 1]\). Muon (a) drives every singular value toward one. Pion combines Promotion (b) with Suppression (c) to obtain the high pass profile in (d).

Per-Head Mode for RLVR

So far the high-pass NS has been applied to each per-layer momentum \(\mathbf{M}_t \in \mathbb{R}^{m \times n}\) as a single block, exactly mirroring Muon; we call this the default mode. We find that this does not transfer well to RLVR. RLVR starts from a model that has already been pretrained (or SFT’d), whose attention layers exhibit substantial per-head heterogeneity in \(\|\mathbf{W}_Q^h\|_F\), \(\|\mathbf{W}_K^h\|_F\), \(\|\mathbf{W}_V^h\|_F\), and \(\|\mathbf{W}_O^h\|_F\). This heterogeneity jointly governs the forward outputs and the backward gradients, so different heads should naturally receive updates at different scales.

To respect this structure, Pion adds a per-head mode that first reshapes each attention projection along the head dimension into per-head sub-matrices and then runs the two-stage high-pass NS independently on each. Formally, with \(H\) attention heads and per-head dimension \(d_k\), each attention projection (Q / K / V / O) admits a canonical reshape along the head axis,

\[\mathbf{M}_t \;\xrightarrow{\;\mathrm{Reshape}\;}\; \{\mathbf{M}_t^h\}_{h=1}^{H}, \qquad \mathbf{M}_t^h \in \mathbb{R}^{d \times d_k}.\]

The per-head mode then applies the full two-stage high-pass NS independently on each \(\mathbf{M}_t^h\): a per-head Frobenius pre-normalization \(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\|\mathbf{M}_t^h\|_F + \epsilon)\), followed by \(k_{\mathrm{p}}\) Promotion steps and \(k_{\mathrm{s}}\) Suppression steps, and finally a reshape of \(\{\mathbf{X}^h\}_{h=1}^H\) back to a single \(\mathbf{X} \in \mathbb{R}^{m \times n}\). Because \(\mathbf{X}^h (\mathbf{X}^h)^\top \mathbf{X}^h\) is naturally batched over \(h\) on GPU, the only extra cost over the default mode is the reshape itself.

Figure 4(a) makes the default-mode failure concrete on Qwen3 1.7B: the pre-RLVR cross-head variance \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\) is non-negligible across all 28 layers (top), yet under default-mode Pion the update variance \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) collapses to near zero (bottom). In other words, a single Frobenius pre-normalization plus a single NS chain over the whole projection equalizes the update scale across heads and mixes head-specific directions, so every head ends up with an almost identical update and the inter-head heterogeneity is erased. By contrast, the per-head mode restores a layer-dependent, head-specific update profile.

Q headnorm variance
(a) Cross-head Q variance
Per head ablation accuracy
(b) MATH500 accuracy

Figure 4. Effect of per-head high-pass NS on RLVR (Qwen3 1.7B, GRPO on MATH levels 3 to 5). (a) Cross-head Q projection variance: pre-RLVR weight \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\) (top) and post-RLVR update \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) for default vs. per-head Pion (bottom). (b) MATH500 accuracy of AdamW, Muon (default vs. per-head), and Pion (default vs. per-head).

At this point, Pion’s per-head high-pass NS bundles two design choices: the spectral high pass and the per-head reshape. A natural follow-up question is which of the two is doing the heavy lifting. We find that the two are complementary but not symmetric. The spectral high pass is the primary driver: in Figure 4(b), even if we apply the same reshape on top of Muon’s NS, the resulting per-head Muon still collapses, because injecting the noise tail head-by-head is just as harmful as injecting it on the whole matrix. The per-head reshape is the auxiliary mechanism, used to preserve the per-head heterogeneity inherited from the pretrained (or SFT’d) attention layers. We do not use per-head mode for VLA: there the action head is trained from scratch, with no pretrained multi-head attention structure to preserve.

Algorithms

For reference, we write out the full procedures below. Algorithm 1 is Muon’s standard NS iteration. Pion only replaces the inner NS loop with a two stage high pass version: Algorithm 2 is the default mode used for VLA training; Algorithm 3 is the per head mode used for RLVR post training. The total iteration count is fixed to (k = 5), split into (k_{\mathrm{p}}) Promotion steps and (k_{\mathrm{s}} = k - k_{\mathrm{p}}) Suppression steps. Pion vs. Muon per head specific

Algorithm 1 Muon Optimizer
Require:  learning rate \(\eta\), momentum coefficient \(\mu\), NS iteration count \(k = 5\)
  1. \(\mathbf{M}_0 \leftarrow \mathbf{0}\)
  2. for \(t = 1, 2, \dots\) do
  3. \(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)
  4. \(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)
  5. \(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)spectral pre-norm
  6. for \(i = 1, \dots, k\) do\((a, b, c) = (3.4445, -4.7750, 2.0315)\)
  7. \(\mathbf{X} \leftarrow a\mathbf{X} + b\mathbf{X}\mathbf{X}^\top\mathbf{X} + c\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)
  8. end for
  9. \(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)
  10. end for
  11. return \(\boldsymbol{\Theta}_t\)
Algorithm 2 Pion Optimizer (default: high pass NS on the whole matrix)
Require:  learning rate \(\eta\), momentum coefficient \(\mu\), promotion steps \(k_{\mathrm{p}}\)
  1. \(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)split \(k = 5\) into \(k_{\mathrm{p}} + k_{\mathrm{s}}\)
  2. \(\mathbf{M}_0 \leftarrow \mathbf{0}\)
  3. for \(t = 1, 2, \dots\) do
  4. \(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)
  5. \(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)
  6. \(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)spectral pre-norm
  7. for \(i = 1, \dots, k_{\mathrm{p}}\) dostage 1: Promotion, \((a_{\mathrm{p}}, b_{\mathrm{p}}, c_{\mathrm{p}}) = (1.875, -1.25, 0.375)\)
  8. \(\mathbf{X} \leftarrow a_{\mathrm{p}}\mathbf{X} + b_{\mathrm{p}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{p}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)
  9. end for
  10. for \(j = 1, \dots, k_{\mathrm{s}}\) dostage 2: Suppression, \((a_{\mathrm{s}}, b_{\mathrm{s}}, c_{\mathrm{s}}) = (0, 2.5, -1.5)\)
  11. \(\mathbf{X} \leftarrow a_{\mathrm{s}}\mathbf{X} + b_{\mathrm{s}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{s}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)
  12. end for
  13. \(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)
  14. end for
  15. return \(\boldsymbol{\Theta}_t\)
Algorithm 3 Pion Optimizer (per head: high pass NS per attention head)
Require:  learning rate \(\eta\), momentum coefficient \(\mu\), promotion steps \(k_{\mathrm{p}}\), heads \(H\)
  1. \(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)split \(k = 5\) into \(k_{\mathrm{p}} + k_{\mathrm{s}}\)
  2. \(\mathbf{M}_0 \leftarrow \mathbf{0}\)
  3. for \(t = 1, 2, \dots\) do
  4. \(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)
  5. \(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)
  6. \(\{\mathbf{M}_t^h\}_{h=1}^{H} \leftarrow \mathrm{Reshape}(\mathbf{M}_t)\)split attention along head dim
  7. \(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\lVert \mathbf{M}_t^h \rVert_F + \epsilon),\ \forall\, h\)per head pre-norm
  8. for \(i = 1, \dots, k_{\mathrm{p}}\) dostage 1: Promotion, batched over \(H\)
  9. \(\mathbf{X}^h \leftarrow a_{\mathrm{p}}\mathbf{X}^h + b_{\mathrm{p}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{p}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)
  10. end for
  11. for \(j = 1, \dots, k_{\mathrm{s}}\) dostage 2: Suppression, batched over \(H\)
  12. \(\mathbf{X}^h \leftarrow a_{\mathrm{s}}\mathbf{X}^h + b_{\mathrm{s}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{s}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)
  13. end for
  14. \(\mathbf{X} \leftarrow \mathrm{Reshape}^{-1}(\{\mathbf{X}^h\}_{h=1}^{H})\)rejoin per head matrices
  15. \(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)
  16. end for
  17. return \(\boldsymbol{\Theta}_t\)

Takeaway. Pion is a drop in replacement for Muon’s NS iteration. Same control flow, same per step cost; only the polynomial coefficients change.

Experiments

We evaluate Pion in two settings:

  • VLA training: two architectures, \(\ell_1\)-regression based VLA Adapter and flow-matching based VLANeXt, with LIBERO and LIBERO Plus as benchmarks.
  • RLVR post-training: GRPO and GMPO on Qwen3 1.7B and Qwen3 4B, with MATH and GSM8K as benchmarks.

VLA

VLA Adapter on LIBERO. We first compare AdamW, Muon, and Pion on VLA Adapter across the four LIBERO task suites (Object, Spatial, Goal, Long) under a fixed per-suite training budget (1,500 steps for Object, 15,000 steps for the others), together with a finer learning curve on Object.

Legend: AdamW, Muon, Pion
LIBERO four tasks
(a) Success rates on LIBERO
Object training curve
(b) Success rate vs. steps on Object

Figure 5. AdamW, Muon, and Pion for VLA Adapter on LIBERO. (a) Test success rates on LIBERO Object, Spatial, Goal, and Long at a fixed training budget per suite (1,500 steps for Object, 15,000 steps for the others). (b) Test success rate vs. training steps on LIBERO Object.

Figure 5(a) shows that Pion comprehensively outperforms both Muon and AdamW on every suite. Figure 5(b) further zooms into the LIBERO Object learning curve: Pion reaches 95.4% success at 500 steps and saturates at 100% by 1,500 steps, while AdamW requires substantially more steps to catch up. This indicates that the spectral high pass substantially reduces the training cost needed to reach the high-success regime.

VLANeXt on LIBERO and LIBERO Plus. With flow matching, Pion not only achieves the best success rate on LIBERO but also retains its advantage on the more challenging LIBERO Plus split, particularly under the language (\(+9\) pts), noise (\(+6\) pts), and robot (\(+6\) pts) perturbations; see Table 1. This confirms our earlier picture that uniform whitening over-amplifies noise directions that do not generalize.

OptimizerLIBEROLIBERO PlusBackgroundCameraLanguageLayoutLightNoiseRobot
AdamW79.4564.5768.9770.3854.5061.8076.3566.3747.04
Muon93.6572.3482.7268.0077.5376.2186.1769.9857.36
Pion (Ours)96.3575.9384.5370.8886.9376.7190.6776.0963.18

Table 1. AdamW, Muon, and Pion for VLANeXt on LIBERO and LIBERO Plus. Best in bold.

To make the LIBERO Plus gap concrete, we roll out the same LIBERO Plus episode under VLANeXt policies trained with each optimizer. AdamW and Muon fail at the grasp or placement stage, while Pion completes the task cleanly.

(a) AdamW
(b) Muon
(c) Pion (Ours)

Video 1. Rollouts on the same LIBERO Plus episode (ep1373) under VLANeXt policies trained with the three optimizers. Only Pion reliably completes the task; AdamW and Muon fail in the grasp or placement stage.

RLVR

Across all eight RLVR settings (GRPO/GMPO × Qwen3 1.7B/4B × MATH/GSM8K; see Figure 6), Muon collapses to near-zero accuracy without exception. Pion not only recovers a meaningful training signal but also converges faster than AdamW.

(a) GRPO, 1.7B, MATH
(b) GRPO, 4B, MATH
(c) GRPO, 1.7B, GSM8K
(d) GRPO, 4B, GSM8K
(e) GMPO, 1.7B, MATH
(f) GMPO, 4B, MATH
(g) GMPO, 1.7B, GSM8K
(h) GMPO, 4B, GSM8K

Figure 6. AdamW, Muon, and Pion on RLVR: validation accuracy vs training step across eight settings (two algorithms × two model sizes × two benchmarks).

Reverse ablation: direction of spectral shaping matters. To verify that the gains come specifically from the high-pass direction, we construct Low-pass Muon (LPMuon), which shares Pion’s NS structure and per-step cost but reverses the filtering direction to low-pass (contracting large singular values and amplifying small ones). LPMuon fails to train: its accuracy stays at the initial checkpoint, see Figure 7.

Low pass profile
(a) Low pass scalar map
GSM8K accuracy
(b) GSM8K accuracy

Figure 7. (a) Scalar map \(f(\sigma)\) of LPMuon. (b) GSM8K accuracy of AdamW, Pion, and LPMuon (Qwen3 1.7B, GRPO).

BibTeX

@misc{fan2026rethinkingmuonpretrainingspectral,
      title={Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR}, 
      author={Chongyu Fan and Gaowen Liu and Mingyi Hong and Ramana Rao Kompella and Sijia Liu},
      year={2026},
      eprint={2605.19282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.19282}, 
}