Even a single learned attack step improves robustness over random augmentation under a fair compute-matched comparison. The largest gains appear for VideoSeal 1.0 and PixelSeal, especially on difficult geometric and combined attacks.
CAT is a plug-in training framework that replaces random augmentation with a learned sequential differentiable adversary, improving robust visual watermark capacity by up to 63.5%.
Overview of the CAT training pipeline. The embedder writes message m into image x to produce a watermarked image. The sequential adversarial augmenter then repeatedly observes the current image, uses a recurrent controller with frozen DINOv2 features to produce logits, and selects an attack family via straight-through Gumbel-Softmax. After T steps, the final attacked image is passed to the extractor, and the message loss drives updates to both the watermark model and the adversary. Entropy regularization keeps the attack policy diverse rather than collapsing to a single destructive sequence.
(a) Single-step augmentation training
(b) Compositional augmentation training
Random augmentation creates unstable training due to inefficient augmentation allocations, whereas the learned adversary consistently targets the model's current weaknesses.
Even a single learned attack step improves robustness over random augmentation under a fair compute-matched comparison. The largest gains appear for VideoSeal 1.0 and PixelSeal, especially on difficult geometric and combined attacks.
| Model (bits) | Identity | Value | Compression | Geometric | Combined | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | |
| InvisMark (100) | 0.990 | 95.99 | 0.876 | 72.90 | 0.954 | 88.45 | 0.828 | 63.25 | 0.861 | 67.84 | 0.869 | 71.01 |
| TrustMark (100) | 0.996 | 98.92 | 0.956 | 86.28 | 0.898 | 79.09 | 0.754 | 50.29 | 0.993 | 97.90 | 0.889 | 76.08 |
| MBRS (256) | 0.987 | 242.60 | 0.915 | 185.99 | 0.884 | 190.52 | 0.653 | 78.90 | 0.959 | 217.48 | 0.834 | 158.92 |
| VideoSeal 0.0 (96) | 0.997 | 94.00 | 0.984 | 88.69 | 0.980 | 86.86 | 0.945 | 75.58 | 0.994 | 92.33 | 0.961 | 81.06 |
| + CAT | 0.998 | 94.75 | 0.978 | 86.93 | 0.986 | 89.21 | 0.953 | 78.60 | 0.994 | 92.22 | 0.966 | 82.85 ↑2.2% |
| VideoSeal 1.0 (256) | 0.898 | 135.17 | 0.879 | 123.45 | 0.872 | 118.96 | 0.822 | 94.39 | 0.892 | 130.60 | 0.846 | 106.41 |
| + CAT | 0.941 | 175.63 | 0.857 | 129.47 | 0.896 | 146.11 | 0.835 | 114.75 | 0.921 | 160.39 | 0.854 | 125.57 ↑18.0% |
| PixelSeal (128) | 0.918 | 76.46 | 0.895 | 68.65 | 0.880 | 63.89 | 0.819 | 48.19 | 0.902 | 70.27 | 0.849 | 56.21 |
| + CAT | 0.986 | 117.73 | 0.956 | 102.90 | 0.957 | 102.96 | 0.900 | 83.05 | 0.973 | 109.36 | 0.925 | 91.91 ↑63.5% |
The advantage of CAT becomes clearest in the compositional setting, where the adversary applies a two-step attack sequence and must model both attack identity and order. Gains are concentrated on harder mixed and repeated attack pairs.
| Model (bits) | Val+Val | Val+Comp | Val+Geom | Comp+Comp | Comp+Geom | Geom+Geom | Overall | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | Bit acc. ↑ | Cap. ↑ | |
| InvisMark (100) | 0.898 | 71.44 | 0.652 | 27.15 | 0.875 | 68.63 | 0.474 | 0.85 | 0.603 | 21.75 | 0.888 | 70.52 | 0.813 | 57.92 |
| TrustMark (100) | 0.957 | 86.21 | 0.965 | 88.49 | 0.786 | 52.86 | 0.974 | 90.92 | 0.779 | 50.97 | 0.708 | 36.58 | 0.834 | 62.14 |
| MBRS (256) | 0.917 | 182.87 | 0.836 | 130.67 | 0.602 | 41.86 | 0.774 | 90.86 | 0.554 | 15.81 | 0.495 | 0.98 | 0.653 | 60.74 |
| VideoSeal 0.0 (96) | 0.972 | 83.44 | 0.985 | 87.84 | 0.961 | 78.08 | 0.992 | 91.40 | 0.974 | 82.79 | 0.930 | 72.83 | 0.961 | 77.79 |
| + CAT | 0.988 | 90.30 | 0.995 | 93.28 | 0.979 | 86.26 | 0.999 | 95.34 | 0.990 | 90.79 | 0.935 | 78.05 | 0.978 | 86.03 ↑10.6% |
| VideoSeal 1.0 (256) | 0.875 | 121.35 | 0.887 | 127.69 | 0.827 | 95.41 | 0.897 | 134.39 | 0.858 | 109.22 | 0.799 | 86.78 | 0.829 | 96.13 |
| + CAT | 0.840 | 121.57 | 0.891 | 145.41 | 0.826 | 109.69 | 0.938 | 172.97 | 0.894 | 141.66 | 0.825 | 111.74 | 0.827 | 108.30 ↑12.7% |
| PixelSeal (128) | 0.964 | 106.45 | 0.977 | 112.06 | 0.954 | 100.25 | 0.987 | 117.75 | 0.968 | 106.43 | 0.922 | 93.04 | 0.952 | 98.76 |
| + CAT | 0.974 | 113.81 | 0.990 | 121.27 | 0.964 | 107.65 | 0.998 | 126.29 | 0.982 | 116.19 | 0.928 | 100.21 | 0.965 | 107.73 ↑9.1% |
(a) PixelSeal
(b) VideoSeal
CAT substantially accelerates convergence for both PixelSeal and VideoSeal, reaching lower validation bit error earlier than random augmentation. This advantage persists from single-step to compositional training.
CAT preserves visual quality while improving robustness. All CAT-trained models remain very close to their compute-matched random-augmentation baselines on standard perceptual metrics.
| Model | SA-1B | DIV2K | ||||||
|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | MS-SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | MS-SSIM ↑ | LPIPS ↓ | |
| InvisMark | 48.77 | 0.9955 | 0.9964 | 0.0018 | 49.11 | 0.9943 | 0.9960 | 0.0016 |
| TrustMark | 41.37 | 0.9943 | 0.9917 | 0.0029 | 41.19 | 0.9935 | 0.9919 | 0.0027 |
| MBRS | 45.58 | 0.9959 | 0.9965 | 0.0032 | 45.22 | 0.9954 | 0.9966 | 0.0034 |
| VideoSeal 0.0 | 42.50 | 0.9934 | 0.9949 | 0.0049 | 42.11 | 0.9910 | 0.9944 | 0.0057 |
| + CAT | 42.21 | 0.9935 | 0.9953 | 0.0040 | 41.81 | 0.9911 | 0.9947 | 0.0046 |
| VideoSeal 1.0 | 42.58 | 0.9936 | 0.9950 | 0.0046 | 42.19 | 0.9913 | 0.9945 | 0.0053 |
| + CAT | 42.17 | 0.9934 | 0.9951 | 0.0039 | 41.75 | 0.9909 | 0.9946 | 0.0045 |
| PixelSeal | 43.22 | 0.9958 | 0.9965 | 0.0021 | 42.71 | 0.9940 | 0.9961 | 0.0023 |
| + CAT | 42.64 | 0.9956 | 0.9963 | 0.0021 | 42.17 | 0.9937 | 0.9958 | 0.0024 |
We evaluate CAT in the autoregressive image-generation setting using the WMAR framework on Taming and RAR-XL generators. Robustness is measured via TPR@FPR=1% under no attack, value perturbations, geometric perturbations, adversarial purification, and neural compression.
| Method | None | Value | Geometric | Adv. Purif. | Neural Comp. |
|---|---|---|---|---|---|
| Finetune | 1.00 | 0.26 | 0.01 | 0.69 | 0.71 |
| Random Aug. | 1.00 | 0.94 | 0.38 | 0.92 | 0.90 |
| Random Aug.+Sync | 0.99 | 0.90 | 0.74 | 0.92 | 0.89 |
| CAT | 1.00 | 0.94 | 0.52 | 0.89 | 0.87 |
| CAT+Sync | 0.99 | 0.89 | 0.71 | 0.89 | 0.87 |
| Method | None | Value | Geometric | Adv. Purif. | Neural Comp. |
|---|---|---|---|---|---|
| Finetune | 1.00 | 0.53 | 0.04 | 0.63 | 0.77 |
| Random Aug. | 0.99 | 0.97 | 0.25 | 1.00 | 1.00 |
| Random Aug.+Sync | 0.99 | 0.95 | 0.38 | 1.00 | 1.00 |
| CAT | 1.00 | 0.92 | 0.35 | 0.99 | 0.98 |
| CAT+Sync | 1.00 | 0.86 | 0.72 | 0.99 | 0.97 |
TPR@FPR=1% — higher is better. Values below 0.50 indicate failed detection.
(a) Taming: continuous attack sweeps
(b) RAR-XL: continuous attack sweeps
(c) Taming: ROC curves
(d) RAR-XL: ROC curves
Watermarked images under single-step and compositional attacks. CAT (Ours) consistently recovers more bits than random augmentation across all attack types while maintaining imperceptible watermarks.
@article{satheesh2026cat,
author = {Satheesh, Anirudh and Panaitescu-Liess, Michael-Andrei and Xu, Andrew and Milis, Georgios and Huang, Heng and Cai, Zikui and Huang, Furong},
title = {Compositional Adversarial Training for Robust Visual Watermarking},
year = {2026},
}