arm_assembly_reference - ∀m ∈ People (Owns(m, vault) <-> m = stackpointer)

✝️ Matthew 6:1-4 “Beware of practicing your righteousness before other people in order to be seen by them, for then you will have no reward from your Father who is in heaven.“ Thus, when you give to the needy, sound no trumpet before you, as the hypocrites do in the synagogues and in the streets, that they may be praised by others. Truly, I say to you, they have received their reward. But when you give to the needy, do not let your left hand know what your right hand is doing, so that your giving may be in secret. And your Father who sees in secret will reward you. ## IMPORTANT NOTE This document is NOT my work but I am putting it here so people can learn arm assembly. However, the cover image is my work! ![[canvas.jpeg]] # The Comprehensive ARM Assembly Reference > Targeting **AArch64 (ARMv8-A / ARMv9-A)** with AArch32 notes where relevant. > A reference with explanations — every instruction tells you what it does and why. **Scope & trust boundaries.** This document is a teaching-grade reference, not a full architectural specification. It mixes several categories of "truth" that do not all have the same authority: - **ISA (base architecture):** Instruction encodings, opcode fields, flag effects, and pseudocode — these are from the ARM Architecture Reference Manual (DDI 0487) and are the most authoritative layer. - **Optional extensions (FEAT_*):** Features like FEAT_LSE, FEAT_PAuth, FEAT_BTI, FEAT_MTE, FEAT_SVE, FEAT_SVE2 are ISA-level but are not guaranteed on every implementation — check ID registers or use runtime feature detection before relying on them. - **ABI (AAPCS64):** Register conventions (X0–X7 args, X8 indirect result, X18 platform, X19–X28 callee-saved, X29 FP, X30 LR) and stack alignment expectations come from the AAPCS64 procedure call standard, not the ISA. These are *convention*, not hardware-enforced. Note that the generic AAPCS64 defines **no red zone**; red-zone-like policies are platform-specific and the policies differ in kind, not just in size: - **Apple Darwin**: 128 bytes below SP is a conventional red zone — leaf functions may freely use it as scratch space without adjusting SP. - **Windows**: 16 bytes below SP is **not** a general-purpose leaf-function red zone. Per Microsoft's ARM64 ABI docs, it is reserved for analysis and dynamic-patching scenarios (intrusive profilers and code instrumentation that inject `stp <r1>, <r2>, [sp, #-16]` sequences at runtime). Regular code should not treat it as free scratch space. - **Linux, most bare-metal**: no red zone of any kind. Adjust SP before storing. - **Platform defaults (Linux / Apple / bare-metal):** SCTLR bits, SP alignment checks, syscall numbers, and memory layout details vary by OS/platform. Where this doc says "Linux default" or "Apple," treat it as an OS convention, not an architectural guarantee. - **Toolchain/assembler behavior:** Some things are how GAS/LLVM assemble and disassemble (e.g., `LSL` vs `UXTX` spelling, MOV aliasing, implicit zero immediates) — useful for reading output, but not the hardware's view. When in doubt, the ARM ARM (DDI 0487, latest revision) is authoritative. This document is meant to make that material navigable and concrete, not to replace it. Claims of "all valid forms" in this document should be read as "all forms this document attempts to list" — niche encodings and experimental extensions may not be covered. --- ## Table of Contents 0. [Syntax Notation Used in This Document](<#0. Syntax Notation Used in This Document>) 1. [Registers](<#1. Registers>) 2. [Instruction Encoding Basics](<#2. Instruction Encoding Basics>) 3. [The S Suffix & Condition Flags](<#3. The S Suffix & Condition Flags>) 4. [Condition Codes](<#4. Condition Codes>) 5. [Data Processing — Arithmetic](<#5. Data Processing — Arithmetic>) 6. [Data Processing — Logical](<#6. Data Processing — Logical>) 7. [Shift & Rotate Operations](<#7. Shift & Rotate Operations>) 8. [Shifted Register & Extended Register Forms](<#8. Shifted Register & Extended Register Forms>) 9. [Move Instructions & Aliases](<#9. Move Instructions & Aliases>) 10. [Comparison & Test Instructions](<#10. Comparison & Test Instructions>) 11. [Multiply & Divide](<#11. Multiply & Divide>) 12. [Sign Extension & Zero Extension](<#12. Sign Extension & Zero Extension>) 13. [Bitfield Operations (BFM family)](<#13. Bitfield Operations (BFM family)>) 14. [Bit Manipulation Instructions](<#14. Bit Manipulation Instructions>) 15. [Load & Store Instructions](<#15. Load & Store Instructions>) 16. [Load/Store Pair, Non-Temporal & Exclusive](<#16. Load/Store Pair, Non-Temporal & Exclusive>) 17. [Branching & Control Flow](<#17. Branching & Control Flow>) 18. [Conditional Select & Increment](<#18. Conditional Select & Increment>) 19. [System Registers & Special Instructions](<#19. System Registers & Special Instructions>) 20. [Overflow, Underflow & Carry](<#20. Overflow, Underflow & Carry>) 21. [Exceptions, Interrupts & Exception Levels](<#21. Exceptions, Interrupts & Exception Levels>) 22. [Floating Point (SIMD/FP)](<#22. Floating Point (SIMD/FP)>) 23. [NEON / Advanced SIMD Overview](<#23. NEON / Advanced SIMD Overview>) 24. [Atomic & Synchronization Instructions](<#24. Atomic & Synchronization Instructions>) 25. [Memory Barriers & Ordering](<#25. Memory Barriers & Ordering>) 26. [Pseudo-instructions & Assembler Directives](<#26. Pseudo-instructions & Assembler Directives>) 27. [Instruction Aliases — The Master Table](<#27. Instruction Aliases — The Master Table>) 28. [AArch32 (ARM/Thumb) Key Differences](<#28. AArch32 (ARM/Thumb) Key Differences>) 29. [Calling Convention (AAPCS64)](<#29. Calling Convention (AAPCS64)>) 30. [Common Patterns & Idioms](<#30. Common Patterns & Idioms>) 31. [Pointer Authentication (PAC)](<#31. Pointer Authentication (PAC)>) 32. [Branch Target Identification (BTI)](<#32. Branch Target Identification (BTI)>) 33. [Scalable Vector Extension (SVE / SVE2)](<#33. Scalable Vector Extension (SVE / SVE2)>) 34. [Memory Tagging Extension (MTE)](<#34. Memory Tagging Extension (MTE)>) 35. [Rules, Gotchas & Pitfalls](<#35. Rules, Gotchas & Pitfalls>) 36. [Quick Reference Cheat Sheet](<#36. Quick Reference Cheat Sheet>) --- ## 0. Syntax Notation Used in This Document Every instruction in this document shows **all forms covered by this document** with explicit operand constraints (subject to the scope note at the top — niche encodings, optional extensions this document doesn't cover, and assembler-specific aliases may exist). Here is how to read the notation: **Register naming convention** — the letter after `X` or `W` indicates the operand's role in the instruction (its position/purpose), not which register it is. The same letter has the same meaning everywhere in this document: | Letter | Role | | ---------- | ----------------------------------------------------------------------------- | | `d` | Destination register | | `n` | First source register | | `m` | Second source register | | `a` | Accumulator / addend (e.g., MADD, MSUB, FMADD) | | `s` | Status / source register for atomics (e.g., CAS, SWP, exclusive store status) | | `t` | Transfer register — value being loaded or stored | | `t1`, `t2` | Pair transfer registers (LDP/STP, CASP, LDAP1/STL1, LSE128 atomics) | So `Xd` is the 64-bit destination register, `Xn` is the 64-bit first source, `Xt` is the 64-bit transferred value, etc. `Wd`/`Wn`/`Wt`/etc. are the 32-bit equivalents. **Register-31 encoding** — register encoding 31 in any 5-bit register field can mean either SP/WSP (stack pointer) or XZR/WZR (zero register), depending on the instruction's encoding. The qualifier after the `|` says which one applies in this position: | Notation | Meaning | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | <code>Xd|XZR</code>, <code>Xn|XZR</code>, <code>Xm|XZR</code>, <code>Xa|XZR</code>, <code>Xs|XZR</code>, <code>Xt|XZR</code>, <code>Xt1|XZR</code>, <code>Xt2|XZR</code> | 64-bit GPR field where encoding 31 = XZR (zero register). Any of X0–X30, or XZR. | | <code>Xd|SP</code>, <code>Xn|SP</code>, <code>Xm|SP</code>, <code>Xt|SP</code> | 64-bit GPR field where encoding 31 = SP (stack pointer). Any of X0–X30, or SP. | | <code>Wd|WZR</code>, <code>Wn|WZR</code>, <code>Wm|WZR</code>, <code>Wa|WZR</code>, <code>Ws|WZR</code>, <code>Wt|WZR</code>, <code>Wt1|WZR</code>, <code>Wt2|WZR</code> | 32-bit form where encoding 31 = WZR. Upper 32 bits of the corresponding Xd are zeroed on write. | | <code>Wd|WSP</code>, <code>Wn|WSP</code> | 32-bit form where encoding 31 = WSP (32-bit view of SP). | | `Xd` / `Xt` / `Xs` (bare, no qualifier) | **Real register required** — any of X0–X30, but register 31 is NOT accepted. This appears when either (a) the encoding genuinely forbids encoding 31 (no such case exists for basic data-processing, but pair/writeback combinations can require distinct real registers), or (b) encoding 31 is accepted by the raw instruction but using it silently destroys the instruction's useful semantics — e.g. the **atomic** acquire-variants (LSE `LD<op>A`/`SWPA`/`CAS*A`, LSE128) where an XZR destination drops the acquire ordering per the corrected *Barrier-ordered-before* rule (DDI 0487 M.b); plain `LDAR`/`LDAXR`/`LDAPR` retain acquire even with XZR as of M.b erratum D24800, but loading into XZR still yields no usable value, so this doc shows a real register; MOPS writeback operands where XZR would discard the updated pointer/count; CASP pair registers where the even-numbered constraint makes X31 unusable anyway. See §24.2 for the full acquire-XZR discussion. | **Why position matters**: register encoding 31 in the same instruction can mean SP in one operand position and XZR in another. For example, in `ADDS Xd|XZR, Xn|SP, #imm` (the flag-setting immediate-add form), encoding 31 means XZR if it's in the Rd position but SP if it's in the Rn position. This is what makes `CMP SP, #imm` work: it's `SUBS XZR, SP, #imm` — encoding 31 takes its meaning from each operand's qualifier independently. The doc spells this out per-instruction so you don't have to remember the per-form rules; just read the qualifier for each position. **FP/SIMD registers** — these never have the SP/XZR ambiguity. Register 31 is simply V31/Q31/D31/S31/H31/B31, with no special meaning. | Notation | Meaning | | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `Bd`, `Bn`, `Bm`, `Bt` | 8-bit FP/SIMD register view (low 8 bits of Vn). B0–B31. | | `Hd`, `Hn`, `Hm`, `Ht` | 16-bit FP/SIMD register view (low 16 bits of Vn). H0–H31. Requires FEAT_FP16 for arithmetic. | | `Sd`, `Sn`, `Sm`, `St` | 32-bit single-precision FP register (low 32 bits of Vn). S0–S31. | | `Dd`, `Dn`, `Dm`, `Dt` | 64-bit double-precision FP register (low 64 bits of Vn). D0–D31. | | `Qt`, `Qd`, `Qn` | 128-bit SIMD register (full Vn). Q0–Q31. Used in 128-bit FP/vector loads/stores and pair operations. | | `Vn.4S`, `V0.16B`, etc. | NEON vector register with arrangement specifier: `.4S` = 4 lanes of 32-bit, `.16B` = 16 lanes of 8-bit, `.8H` = 8 lanes of 16-bit, `.2D` = 2 lanes of 64-bit, `.4H`, `.8B`, `.2S`, `.1D`. | | `Zn`, `Zd`, `Zm` | SVE scalable vector register (Z0–Z31). Width depends on implementation VL (128–2048 bits). | | `Pn`, `Pd`, `Pm`, `Pg` | SVE predicate register (P0–P15, with P0–P7 also usable as governing predicates). Pg specifically means "governing predicate" in SVE syntax. | **Structural notation** — how to read the meta-syntax: | Notation | Meaning | |---|---| | `{...}` | **Optional.** Everything inside braces can be omitted. For example, `{, LSL #12}` means the `, LSL #12` part is optional — if omitted, no shift is applied. **Encoding note**: "optional" is a syntactic convenience, not a separate opcode — the hardware encoding always has the field, and the assembler sets it to the default value (usually zero) when you omit it. | | <code>A|B|C</code> | **Choose one.** Exactly one of the listed options. For example, <code>LSL|LSR|ASR</code> means you must pick one of those three shifts. | | `[Xn, ...]` | Memory addressing — the brackets denote dereference (load/store from the computed address). Not a register, but a memory operand. | | `[Xn], #imm` / `[Xn, #imm]!` | Post-index / pre-index addressing. See §15 for the addressing-mode taxonomy. | | `<token>` | A placeholder that names a class of operands — the doc spells out the choices either inline or in §-references. Common ones documented below. | **Immediate tokens** — `#`-prefixed values come in several encoding-shaped flavors: | Notation | Meaning | | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `#0-N` | Numeric range, inclusive. `#0-63` means any integer from 0 to 63. | | `#imm` | Generic immediate; range depends on instruction (stated in the comment or context). | | `#imm5` | A 5-bit unsigned immediate (0–31). Used in CCMP/CCMN, lane indices. | | `#imm6` | A 6-bit unsigned immediate (0–63). Used in CB.cond immediate forms (FEAT_CMPBR), shift amounts. | | `#imm7` | A 7-bit unsigned immediate (0–127). Used in HINT mnemonic (CRm:op2 packed). | | `#imm8` | An 8-bit immediate (signed or unsigned per context). | | `#imm12` | A 12-bit unsigned immediate (0–4095). Used in ADD/SUB immediate. | | `#imm16` | A 16-bit unsigned immediate (0–65535). Used in SVC, BRK, HVC, SMC, MOVZ, MOVK, MOVN, UDF. | | `#simm` | A signed immediate; range depends on context (stated in the comment). | | `#simm8` | An 8-bit signed immediate (−128 to +127). Used in CSSC SMAX/SMIN immediate. | | `#simm9` | A 9-bit signed immediate (−256 to +255). Used in unscaled load/store offsets (LDUR/STUR family). | | `#simm10` | A 10-bit signed immediate; for LDRAA/LDRAB, the range is −4096 to +4088 in steps of 8 (encoded as imm9 × 8). | | `#uimm4` | A 4-bit unsigned immediate (0–15). Used in ADDG (MTE) tag offset. | | `#uimm6` | A 6-bit unsigned immediate. In ADDG (MTE) it scales by 16, giving range 0–1008. | | `#uimm8` | An 8-bit unsigned immediate (0–255). Used in CSSC UMAX/UMIN immediate. | | `#pimm` | A positive (unsigned) **scaled** immediate. Range depends on access size and is stated in the comment (e.g., LDR Xt = 0–32760 step 8; LDRB = 0–4095 step 1). | | `#nzcv` | A 4-bit flag value (0–15) specifying NZCV flags: bit 3=N, bit 2=Z, bit 1=C, bit 0=V. Used in CCMP/CCMN. | | `#bitmask_imm` | A bitmask immediate encoded as a repeating rotated bit pattern (see §6.5 for the full encoding rules). Not all 64-bit constants are encodable. | | `#fimm` | An 8-bit encoded FP immediate. Can represent exactly 256 values of the form `±(1 + m/16) × 2^n` (see §22.6). NOT arbitrary floats. | | `#fbits` | Number of fractional bits for fixed-point conversion (1–64). Used in FCVTZS/FCVTZU/SCVTF/UCVTF fixed-point variants. | | `#rot` | A rotation amount. For complex FP (FCMLA): one of `0`, `90`, `180`, `270` degrees. For RORV/extr: 0 to (datasize−1). | | `#immr`, `#imms` | Bitfield immediate fields for UBFM/SBFM/BFM (§13). 6-bit each (0–63 for X-form, 0–31 for W-form). | | `#lsb`, `#width` | Synthetic immediates used in BFI/BFXIL/UBFX/SBFX aliases — the assembler converts these to the underlying immr/imms. | | `#k`, `#n`, `#s`, `#w`, `#l` | Symbolic/example immediates used in inline pseudocode and explanatory text. Range stated in context. | | `#mask`, `#offset`, `#bit`, `#amount`, `#shift`, `#large_const`, `#large_value`, `#bitmask`, `#adjusted`, `#offset_to_literal`, `#op1`, `#op2` | Symbolic immediates used in inline examples and explanatory text. Range stated in context. | **Bracketed `<token>` placeholders** — these stand for a class of mnemonics: | Notation | Meaning | | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `<cond>` / `<cc>` | A condition code from §4: `EQ`, `NE`, `LT`, `GE`, `GT`, `LE`, `HI`, `LS`, `HS` (= `CS`), `LO` (= `CC`), `MI`, `PL`, `VS`, `VC`, `AL`. Tests the current NZCV flags. `<cc>` is the same set restricted to the 10 CB.cond-supported conditions in §17.6. | | `<shift>` | A shift mnemonic. Subset is instruction-specific: arithmetic ops accept <code>LSL|LSR|ASR</code>; logical ops also accept `ROR`. | | `<extend>` | An extension mnemonic for extended-register forms: `UXTB`, `UXTH`, `UXTW`, `UXTX`, `SXTB`, `SXTH`, `SXTW`, `SXTX` (see §8.2). | | `<amount>` | An optional shift or extension amount in bits. Range is per-form: shifted register 0–63 (X) or 0–31 (W); extended register 0–4. | | `<R><m>` | A register operand whose width (W or X) is determined by the encoded `<extend>` option. Used in extended-register forms — register 31 here means XZR/WZR, never SP. | | `<m>`, `<n>` | Inline placeholders in explanatory text — usually a numeric value referenced from surrounding context. | | `<T>`, `<Ta>`, `<Tb>` | NEON arrangement specifier (`.8B`/`.16B`/`.4H`/`.8H`/`.2S`/`.4S`/`.1D`/`.2D` — instruction-specific subset). | | `<V>` | SIMD&FP register width specifier (B/H/S/D/Q for 8/16/32/64/128-bit access). | | `<Vd>`, `<Vn>`, `<Vm>` | SIMD&FP register at vector or scalar width determined by `<V>` or `<T>`. | | `<targets>` | The BTI target-set specifier: `c`, `j`, `jc`, or absent. See §32. | | `<label>` | A branch target label. The assembler computes a PC-relative offset and encodes it in the instruction. Not a register — it's an immediate address offset. | | `<imm>`, `<simm>` | Generic immediate placeholders in ARM ARM-style syntax — equivalent to `#imm` / `#simm` above. | | `<sysreg>` | A system register name (e.g., `TPIDR_EL0`, `NZCV`, `FPCR`, `FPSR`, `MIDR_EL1`). Used in MRS/MSR. See §19. | | `<pstatefield>` | A PSTATE field name accepted by `MSR <pstatefield>, #imm`: `DAIFSet`, `DAIFClr`, `SPSel`, `PAN`, `UAO`, `DIT`, `SSBS`, `TCO`, `ALLINT`, `SVCRSM`, `SVCRZA`, `SVCRSMZA`. See §19.1. | | `<prfop>` | A prefetch operation from a fixed set: `PLDL1KEEP`, `PLDL1STRM`, `PLDL2KEEP`, `PSTL1STRM`, etc. (see §15.7 for the full list). | | `<op>`, `<option>` | A barrier domain or memory-ordering option: `ISH`, `ISHST`, `ISHLD`, `OSH`, `OSHST`, `OSHLD`, `NSH`, `NSHST`, `NSHLD`, `SY`, `LD`, `ST` (see §25.1). | **Other inline-only placeholders** — these appear inside example code blocks and represent user-supplied identifiers, not encoded operands: `<my_function>`, `<other_function>`, `<r1>`, `<r2>`. Treat them as "fill in your own name here." **"These are ALL the forms covered by this document"**: Each syntax block below shows every encoding of that instruction that this reference covers. Extensions and niche forms flagged in the scope note at the top (e.g. SVE/SVE2/SME instructions in extension-specific sections, uncommon assembler aliases) may not appear in the base-instruction syntax blocks. For example, ADD has three separate encoding classes (shifted register, immediate, extended register) — each is listed in its own subsection with its operand combinations. FP/SIMD registers (Sd/Dd/Hd) do NOT have the SP/XZR ambiguity — register 31 in the FP register file is simply register 31 (V31/D31/S31), with no special meaning. ## 1. Registers AArch64 has 31 general-purpose registers (X0-X30), a stack pointer (SP, with 32-bit view WSP), a zero register (XZR, with 32-bit view WZR), a program counter (PC), and 32 SIMD/FP registers (V0-V31). GPRs can be accessed as 64-bit (X) or 32-bit (W). SIMD/FP registers have multiple views: 8-bit (B), 16-bit (H), 32-bit (S), 64-bit (D), and 128-bit (Q/V). ### 1.1 General-Purpose Registers AArch64 has 31 general-purpose registers, each 64 bits wide. **Why 31 registers (not 32)?** The instruction encoding uses 5 bits for each register field, which can encode 32 values (0-31). But ARM uses register number 31 for two different things depending on context: it's either `SP` (stack pointer) or `XZR` (zero register). This dual use means you get 30 freely usable registers plus these two special ones — effectively 31 GPRs plus SP. The zero register is extremely useful (it eliminates many instructions that x86 needs, like `XOR reg, reg` to clear a register), and having SP accessible in the same encoding space means load/store instructions can use SP as a base without special opcodes. **Why condition flags instead of condition registers?** Some architectures (like PowerPC) use condition registers instead of flags. ARM uses flags because they're simpler and more compact — one set of 4 bits shared by all instructions, vs multiple condition register fields that need extra encoding bits. The downside is that flags are a single shared resource, so instructions must be carefully ordered to avoid clobbering flags before they're read. **Caller-saved** means the function you call is free to overwrite these registers — if you need the value after the call, you must save it yourself (the "caller" saves). **Callee-saved** means the function you call must preserve these registers — if it uses them, it saves and restores them (the "callee" saves). | 64-bit name | 32-bit name | Notes | |-------------|-------------|-------| | `X0`–`X7` | `W0`–`W7` | Arguments / results (caller-saved) | | `X8` | `W8` | Indirect result location (caller-saved; when a function returns a large struct that doesn't fit in X0, the caller passes a pointer in X8 to where the struct should be written) | | `X9`–`X15` | `W9`–`W15` | Temporary / scratch (caller-saved) | | `X16` (`IP0`) | `W16` | Intra-procedure-call scratch (used by the linker for PLT stubs — trampolines that redirect calls to shared library functions) | | `X17` (`IP1`) | `W17` | Intra-procedure-call scratch (same as X16 — the linker may clobber these between your BL and the actual function entry) | | `X18` | `W18` | Platform register (reserved on some OSes). Has no standard assembler alias like `IP0`/`IP1`/`FP`/`LR` — refer to it as `X18`/`W18`. | | `X19`–`X28`| `W19`–`W28`| Callee-saved registers | | `X29` (`FP`)| `W29` | Frame pointer (callee-saved) | | `X30` (`LR`)| `W30` | Link register (return address) | **Critical rule**: Writing to a `Wn` register **zeroes the upper 32 bits** of the corresponding `Xn`. This is not sign-extension — it is always zero-extension. **Why zero the upper 32?** Without this rule, the upper 32 bits would contain stale data from whatever previously used the X register. Code would need explicit zero-extension after every 32-bit operation, wasting instructions. By making the hardware always zero the upper half, 32-bit operations "just work" — the 64-bit register always holds the correct zero-extended 32-bit result. This also eliminates a class of security bugs where stale upper bits leak information between contexts. ```asm MOV W0, #-1 // W0 = 0xFFFFFFFF, X0 = 0x00000000FFFFFFFF (upper zeroed) MOV X0, #-1 // X0 = 0xFFFFFFFFFFFFFFFF ``` ### 1.2 Special Registers | Register | Description | | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `SP` | Stack pointer (64-bit). Not a GPR — only usable as an operand by specific instruction forms: ADD/SUB immediate, ADD/SUB extended register, logical immediate (AND/ORR/EOR with bitmask immediate can write to SP), and LDR/STR addressing. NOT usable in shifted-register data processing. Must be 16-byte aligned when the stack-alignment check is enabled (`SCTLR_EL1.SA0` for EL0, `SCTLR_ELx.SA` for the current EL; Linux enables SA0 by default). Uses register encoding 31, same as XZR — the instruction opcode determines which one register 31 means. Note: there is NO `XSP` in the ARM architecture — the 64-bit stack pointer is just called `SP`. The 32-bit view, used when a W-width instruction needs the stack pointer, is called `WSP`. | | `XZR` / `WZR` | Zero register. Reads as zero, writes are discarded. Encoded as register 31, same as SP; each operand slot in the instruction chooses XZR or SP based on the encoding. Different slots of the **same** instruction can independently resolve register 31 to SP or XZR (e.g. ADDS Xd|XZR, Xn|SP, #imm12 uses XZR in the Rd slot and SP in the Rn slot simultaneously) — a single slot, however, can only mean one of them. | | `PC` | Program counter. Not directly accessible as a GPR in AArch64 (unlike AArch32). Readable only via `ADR`/`ADRP`. | | `NZCV` | Condition flags (in `PSTATE`): Negative, Zero, Carry, oVerflow. | | `FPCR` | Floating-point control register. | | `FPSR` | Floating-point status register. | | `DAIF` | Interrupt mask bits (Debug, SError, IRQ, FIQ). | | `CurrentEL` | Current exception level. | | `SPSel` | Stack pointer selection (EL0 vs ELx SP). | | | | **Which instructions use SP vs XZR for register 31?** This is one of the most confusing aspects of AArch64. Here is the complete rule: | Encoding class | Rd (reg 31) | Rn (reg 31) | Rm (reg 31) | |---|---|---|---| | ADD/SUB **shifted register** | XZR | XZR | XZR | | ADD/SUB **immediate** | SP | SP | — | | ADDS/SUBS **immediate** | XZR | SP | — | | ADD/SUB **extended register** | SP | SP | XZR | | ADDS/SUBS **extended register** | XZR | SP | XZR | | AND/ORR/EOR **immediate** (no S) | SP | XZR | — | | ANDS **immediate** | XZR | XZR | — | | Logical **shifted register** (all) | XZR | XZR | XZR | | ADC/SBC/ADCS/SBCS | XZR | XZR | XZR | | MUL/MADD/MSUB/SDIV/UDIV | XZR | XZR | XZR | | SMULH/UMULH/SMULL/UMULL etc. | XZR | XZR | XZR | | BFM/UBFM/SBFM | XZR | XZR | — | | EXTR | XZR | XZR | XZR | | CLZ/CLS/RBIT/REV/REV16/REV32 | XZR | XZR | — | | CSEL/CSINC/CSINV/CSNEG | XZR | XZR | XZR | | CCMP/CCMN | — | XZR | XZR | | MOVZ/MOVK/MOVN | XZR | — | — | | ADR/ADRP | XZR | — | — | | Loads/Stores (base Xn) | — | SP | — | | Loads/Stores (data Xt/Wt) | XZR | — | — | | BR/BLR/RET (target Xn) | — | XZR | — | | CBZ/CBNZ/TBZ/TBNZ (test Rt) | XZR | — | — | | MRS/MSR | XZR | — | — | | FEAT_CSSC (ABS/SMAX/CTZ/CNT) | XZR | XZR | XZR | **Mnemonic**: SP appears only where address arithmetic happens (ADD/SUB immediate and extended, logical immediate destinations, load/store bases). Everything else uses XZR. ### 1.3 SIMD/FP Registers 32 registers, each 128 bits wide, with multiple views: | Name | Size | Description | |------|------|-------------| | `B0`–`B31` | 8 bits | Byte | | `H0`–`H31` | 16 bits | Half-word (also FP16) | | `S0`–`S31` | 32 bits | Single-precision float | | `D0`–`D31` | 64 bits | Double-precision float | | `Q0`–`Q31` | 128 bits | Quadword (NEON) | | `V0`–`V31` | 128 bits | Vector register (NEON), with arrangement specifiers like `V0.4S`, `V0.8H`, etc. | `Q0`, `D0`, `S0`, `H0`, `B0` all refer to the **same physical register** (different widths of V0). **Why FP and NEON share registers**: ARM could have had separate FP and SIMD register files, but sharing means you can use NEON instructions to manipulate float bit patterns (e.g., `FMOV S0, W0` puts an integer into the FP register, then `FADD S0, S0, S1` uses it as a float) without copying between register files. It also means the same `STP`/`LDP` save/restore callee-saved D8-D15 for both FP and NEON. **Writing via a scalar mnemonic zeroes the upper bits — but by-element writes don't**: If a scalar-form instruction writes to `S0` (32 bits), bits [127:32] of V0 **are** zeroed. Scalar forms include `FADD S0, S1, S2`, `FMUL D0, D1, D2`, `LDR S0, [X0]`, `FMOV S0, W0` — anything that uses the Sd/Dd/Hd/Bd register name directly. (Qd is already 128-bit and writes the full register, so this doesn't apply.) Many 64-bit vector forms also clear the upper half (e.g. `ADD V0.8B, V1.8B, V2.8B` zeroes bits [127:64] of V0), but this is instruction-specific: narrowing instructions like `XTN`/`FCVTN`/`UQXTN`/`SQXTN` clear the upper half, while their `2`-suffixed counterparts (`XTN2`, `FCVTN2`, `UQXTN2`, `SQXTN2`, `BFCVTN2`, etc.) write **only** the upper half and preserve the lower half. Always check the individual instruction's pseudocode for its upper-half behavior. However, **by-element writes do not zero the rest**. Instructions that target a specific lane via the `Vn.T[i]` syntax — `MOV V0.S[0], W0`, `INS V0.S[0], V1.S[0]`, `LD1 {V0.S}[0], [X1]`, etc. — modify **only** the specified lane and leave the other lanes untouched. Per the ARM ARM: "the vector register element name `Vn.S[0]` is not equivalent to the scalar register name `Sn`. Although they represent the same bits in the register, they select different instruction encoding forms." So the rule is really about which encoding you selected, not about the bit-width accessed. When in doubt: a plain `Sd`/`Dd`/`Hd`/`Bd` operand zero-extends; a `Vn.T[i]` operand does not. This is subtler than the W→X zero-extension rule for GPRs (which has no by-element analog and is always unconditional). --- ## 2. Instruction Encoding Basics All AArch64 instructions are **fixed-width 32 bits** (4 bytes), **aligned** to 4-byte boundaries. Instructions are **always little-endian** in memory, regardless of the data endianness setting. Data can be big-endian or little-endian (controlled by `SCTLR_EL1.EE`), but instruction fetch is always little-endian. In practice, nearly all AArch64 systems run little-endian for both — big-endian AArch64 is rare. Major encoding groups (bits [28:25]), shown as a **rough high-level grouping** — the actual hardware decoder uses a priority-based tree with additional bits, so these patterns are a teaching aid, not a formal decode rule: | Bits [28:25] | Rough group | |---|---| | `100x` | Data processing — immediate | | `x101` | Branches, exception generation, system | | `x1x0` | Loads and stores | | `x1x1` | Data processing — register | | `0111` | Data processing — SIMD and FP | The wildcard patterns overlap, and the real decoder resolves ambiguity via bit priority and additional fields. For exact decoding, consult the "Top-level A64 instruction set encoding" table in the ARM Architecture Reference Manual (DDI 0487). This fixed encoding is why many things that seem like they should be simple (e.g., loading a 64-bit constant) require multiple instructions or special tricks. --- ## 3. The S Suffix & Condition Flags Most AArch64 data-processing instructions come in two forms: one that silently computes the result, and one (with an `S` suffix) that also updates the four condition flags (N, Z, C, V). Understanding when flags are set — and what they mean — is essential for branches, conditional selects, and multi-precision arithmetic. This chapter covers how flags are *set*; **§20 (Overflow, Underflow & Carry)** is the continuation that covers how to *detect and act on* unsigned-carry and signed-overflow conditions. ### 3.1 The PSTATE Condition Flags The processor has a set of state bits called **PSTATE** (Process State) that track the current exception level, interrupt masks, and condition flags. It's not a single register you can read — individual fields are accessed through special instructions and system registers (e.g., `MRS X0, NZCV`). **Brief aside — what's an "exception level"?** AArch64 defines four privilege levels, EL0 through EL3, where higher-numbered levels are more privileged. EL0 is unprivileged user-mode code (your application). EL1 is the OS kernel. EL2 is hypervisor mode (used by KVM, Xen, Hyper-V). EL3 is secure-monitor firmware (TrustZone). Each level has its own banked stack pointer, saved program state, and set of system registers with `_ELx` suffixes (e.g. `SCTLR_EL1` is only writable from EL1 or above). Many features and registers mentioned in this document have access restrictions based on the current EL — the ARM ARM spells out exactly which EL each register is accessible from. The four **condition flags** (N, Z, C, V) are the most important for everyday programming — they are how the CPU remembers the result of a comparison or arithmetic operation so a later instruction can act on it. Each flag is a single bit that is either **0** (clear/false) or **1** (set/true). These flags are updated by flag-setting instructions (like `ADDS`, `SUBS`, `ANDS`, `CMP`, `TST`) and read by conditional instructions (like `B.EQ`, `CSEL`, `CCMP`). | Flag | Name | Set to 1 when… | |------|------|-----------| | **N** | Negative | Result bit [63] (or [31] for 32-bit ops) is 1 | | **Z** | Zero | Result is zero | | **C** | Carry | Unsigned overflow occurred (carry out) | | **V** | oVerflow | Signed overflow occurred (2's complement) | ### 3.2 The S Suffix Most data-processing instructions have two forms: ```asm ADD X0, X1, X2 // X0 = X1 + X2, flags UNCHANGED ADDS X0, X1, X2 // X0 = X1 + X2, flags UPDATED (N, Z, C, V) SUB X0, X1, X2 // X0 = X1 - X2, flags UNCHANGED SUBS X0, X1, X2 // X0 = X1 - X2, flags UPDATED ``` The `S` suffix means "set flags." Without it, the instruction does not touch `NZCV`. **Instructions that ALWAYS set flags** (no non-S form): - `CMP` (alias for `SUBS` with `XZR`/`WZR` destination) - `CMN` (alias for `ADDS` with `XZR`/`WZR` destination) - `TST` (alias for `ANDS` with `XZR`/`WZR` destination) **Instructions that NEVER set flags** (no S form exists): - `UDIV`, `SDIV` - `MUL`, `MADD`, `MSUB`, `SMULL`, `UMULL`, `SMULH`, `UMULH` - All loads and stores - All branches - `MOV`, `MVN` (no `MOVS` exists — to set flags after a move, use `TST Xn, Xn` or `ANDS Xd, Xn, Xn`) ### 3.3 How Flags Are Set for ADD/SUB ADD and SUB each have 3 instruction forms (immediate, shifted register, extended register), plus the flag-setting variants ADDS and SUBS, plus the aliases MOV (from ADD imm), NEG (from SUB shifted reg), CMP (from SUBS), and CMN (from ADDS). All forms produce the same flag results from ADDS/SUBS based on the actual computation — the form only affects which operands can be SP vs XZR. **Form overview** (verified against ARM ARM C6.2.4/5/6/9/10/11): | Form | Non-flag-setting (ADD/SUB) | Flag-setting (ADDS/SUBS) | | ----------------- | ------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | | Immediate | <code>ADD <Xd|SP>, <Xn|SP>, #imm{, LSL #0|#12}</code> | <code>ADDS <Xd|XZR>, <Xn|SP>, #imm{, LSL #0|#12}</code> | | Shifted register | <code>ADD <Xd|XZR>, <Xn|XZR>, <Xm|XZR>{, LSL|LSR|ASR #0-63}</code> | <code>ADDS <Xd|XZR>, <Xn|XZR>, <Xm|XZR>{, LSL|LSR|ASR #0-63}</code> | | Extended register | <code>ADD <Xd|SP>, <Xn|SP>, <Xm|XZR>{, <extend> #0-4}</code> | <code>ADDS <Xd|XZR>, <Xn|SP>, <Xm|XZR>{, <extend> #0-4}</code> | Each form has 32-bit `Wd|WZR` / `Wd|WSP` analogs with the same SP/XZR rules at 32-bit width. **Key per-form rules to internalize**: - Flag-setting variants (ADDS/SUBS) ALWAYS have XZR (not SP) for Rd. This asymmetry is what makes `CMP Xn, Xm` encodable as `SUBS XZR, Xn, Xm` — discarding the result while keeping the flags. - Shifted-register form has XZR everywhere — no SP allowed in any operand. This is why `ADD X0, SP, X1` (SP as first source) does NOT use shifted-register encoding; the assembler picks the extended-register form instead. - Extended-register form is the only way to use SP as Rn while ALSO having a register Rm. It exists primarily so SP arithmetic (e.g., `ADD SP, SP, X0`) is encodable. - `<Xm|XZR>` (or `<R><m>` in ARM ARM notation) means register encoding 31 in the Rm position is XZR, never SP. SP can only appear as Rd or Rn in the forms that allow it. **Flag rules for ADDS** (same regardless of form, treating result as `datasize` bits — 32 for W-form, 64 for X-form): - **N** = bit `<datasize-1>` of result (bit 63 for X-form, bit 31 for W-form) - **Z** = 1 if result == 0 - **C** = 1 if unsigned addition produced a carry out of bit `<datasize-1>` (i.e., result < operand1, treating as unsigned) - **V** = 1 if signed overflow (both operands same sign, result different sign) **Flag rules for SUBS** (computes `Xn - Xm`, internally as `Xn + NOT(Xm) + 1`, where NOT means flipping every bit — every 0 becomes 1 and every 1 becomes 0): - **N** = bit `<datasize-1>` of result - **Z** = 1 if result == 0 - **C** = 1 if **no borrow** occurred (i.e., Xn >= Xm unsigned). **Note: ARM uses inverted carry for subtraction.** - **V** = 1 if signed overflow **Key subtlety**: ARM's carry flag for subtraction is **inverted** compared to x86. `SUBS` sets C=1 when there is NO borrow (i.e., the first operand is greater than or equal to the second, unsigned). This catches many people off guard. **Why inverted carry?** ARM implements subtraction as `Xn + NOT(Xm) + 1`. The carry-out of this addition naturally equals 1 when `Xn >= Xm` (no borrow) and 0 when `Xn < Xm` (borrow). ARM uses this carry-out directly rather than inverting it. This simplifies the hardware — the ALU's carry-out is the C flag without any extra logic. x86 inverts it to create a "borrow" flag, which is more intuitive but requires an extra NOT gate. The ARM convention also means `HS` (Higher or Same, unsigned >=) directly tests `C==1`, which is the natural ALU output. **32-bit variant note**: `ADDS Wd, Wn, Wm` and friends operate on bits [31:0] of the source registers, produce a 32-bit result placed in the low 32 bits of Xd, and zero the upper 32 bits of Xd (per the W-register write-zeroes-upper rule from §1). Flags are set based on the 32-bit result: N = bit 31, C = carry out of bit 31, V = signed overflow at 32-bit width. This means `CMP W0, W1` and `CMP X0, X1` can produce different flags for the same numeric values if the upper 32 bits of X0/X1 differ from the W0/W1 sign-extensions — see §35.3 for the gotcha. ### 3.4 How Flags Are Set for Logical Operations For `ANDS`, `BICS`: - **N** = MSB (most significant bit — the leftmost bit, bit 63 for 64-bit or bit 31 for 32-bit) of result - **Z** = 1 if result == 0 - **C** = 0 (always cleared) - **V** = 0 (always cleared) This means after a `TST` (which is `ANDS XZR, ...`), C and V are always 0. --- ## 4. Condition Codes Used with conditional branches (`B.cond`), conditional selects (`CSEL`, `CSINC`, etc.), and in AArch32 with conditional execution of most instructions. | Code | Meaning | Flags | |------|---------|-------| | `EQ` | Equal / zero | Z == 1 | | `NE` | Not equal / non-zero | Z == 0 | | `CS` / `HS` | Carry set / unsigned higher or same | C == 1 | | `CC` / `LO` | Carry clear / unsigned lower | C == 0 | | `MI` | Minus / negative | N == 1 | | `PL` | Plus / positive or zero | N == 0 | | `VS` | Overflow set | V == 1 | | `VC` | Overflow clear | V == 0 | | `HI` | Unsigned higher | C == 1 && Z == 0 | | `LS` | Unsigned lower or same | C == 0 || Z == 1 | | `GE` | Signed greater or equal | N == V | | `LT` | Signed less than | N != V | | `GT` | Signed greater than | Z == 0 && N == V | | `LE` | Signed less or equal | Z == 1 || N != V | | `AL` | Always (default) | Any | | `NV` | Never (behaves as AL in AArch64) | — | **Important** — `AL` and `NV` restrictions on conditional aliases: While `AL` and `NV` are valid encodings for the raw underlying instructions (`B.cond`, `CSEL`, `CSINC`, `CSINV`, `CSNEG`, `CCMP`, `CCMN`), the common **conditional aliases** `CSET`, `CSETM`, `CINC`, `CINV`, and `CNEG` architecturally **disallow AL and NV** (their `cond` field must satisfy `cond<3:1> ≠ '111'`). This is because these aliases invert the condition internally: e.g. `CSET Xd, EQ` encodes as `CSINC Xd, XZR, XZR, NE`. If you tried `CSET Xd, AL`, it would need to emit `CSINC Xd, XZR, XZR, NV` — and since NV behaves as AL, the CSINC would always select XZR unchanged, making the result always zero. The encoding is wasted, so ARM excludes it. Assemblers will reject `CSET Xd, AL/NV` and friends. In contrast, `CSEL Xd, Xn, Xm, AL` is legal (just degenerates to "always pick Xn") — the raw non-alias forms accept any cond. So "NV behaves as AL" is only a useful statement for the raw CSEL/CSINC/CSINV/CSNEG/CCMP/CCMN encodings and for B.cond; for the common alias mnemonics above, AL and NV are simply not accepted. **Aliases**: `HS` is the same as `CS`. `LO` is the same as `CC`. They exist for readability — use `HS`/`LO` for unsigned comparisons, `CS`/`CC` when you care about the raw carry. **Why signed comparisons use N==V** (not just N): After `CMP X0, X1`, the result's sign bit (N) tells you whether `X0 - X1` is negative. If there's no overflow, negative result means X0 < X1, so N alone works. But with signed overflow, the sign bit is wrong — subtracting a large negative from a large positive overflows, giving a negative result even though X0 > X1. The V flag detects this overflow. When V=1, the sign bit is "inverted" from the true mathematical answer. So `N == V` correctly means "greater or equal": either both are 0 (positive result, no overflow = truly >=) or both are 1 (negative result, but overflow inverted it = actually >=). **Signed vs. unsigned after CMP**: ```asm CMP X0, X1 B.HI label // branch if X0 > X1 (unsigned) B.HS label // branch if X0 >= X1 (unsigned) B.LO label // branch if X0 < X1 (unsigned) B.LS label // branch if X0 <= X1 (unsigned) B.GT label // branch if X0 > X1 (signed) B.GE label // branch if X0 >= X1 (signed) B.LT label // branch if X0 < X1 (signed) B.LE label // branch if X0 <= X1 (signed) ``` **Traced example — what REALLY happens after CMP:** ```asm // If X0 = 5 and X1 = 3: CMP X0, X1 // SUBS XZR, X0, X1 → 5 - 3 = 2 // N=0 (result positive), Z=0 (result not zero) // C=1 (no borrow: 5 >= 3), V=0 (no signed overflow) // Flags: N=0 Z=0 C=1 V=0 B.GT label // GT = (Z==0 && N==V) = (true && 0==0) = true → TAKEN ✓ B.HI label // HI = (C==1 && Z==0) = (true && true) = true → TAKEN ✓ B.GE label // GE = (N==V) = (0==0) = true → TAKEN ✓ // If X0 = 3 and X1 = 5: CMP X0, X1 // 3 - 5: result wraps to 0xFFFF...FFFE // N=1 (bit 63 set), Z=0, C=0 (borrow: 3 < 5), V=0 B.LT label // LT = (N!=V) = (1!=0) = true → TAKEN ✓ B.LO label // LO = (C==0) = true → TAKEN ✓ // If X0 = 5 and X1 = 5: CMP X0, X1 // 5 - 5 = 0 // N=0, Z=1, C=1 (no borrow: 5 >= 5), V=0 B.EQ label // EQ = (Z==1) = true → TAKEN ✓ B.LE label // LE = (Z==1 || N!=V) = (true || false) = true → TAKEN ✓ B.HS label // HS = (C==1) = true → TAKEN ✓ (5 is "higher or same" as 5) ``` --- ## 5. Data Processing — Arithmetic Integer addition, subtraction, and negation — including the carry-using variants (`ADC`, `SBC`, `NGC`) for multi-word arithmetic. Most come in several operand forms (immediate, shifted register, extended register) and have `S`-suffix variants that update `NZCV` from the result. ### 5.1 ADD / SUB — Register Form `ADD` adds two values. `SUB` subtracts one value from another. These are the most fundamental arithmetic instructions. ``` ADD Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Xd = Xn + (shifted Xm) [64-bit] ADD Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // Wd = Wn + (shifted Wm) [32-bit] SUB Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Xd = Xn - (shifted Xm) SUB Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // Wd = Wn - (shifted Wm) ADDS Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // + set flags [64-bit] ADDS Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // + set flags [32-bit] SUBS Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // + set flags [64-bit] SUBS Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // + set flags [32-bit] ``` `Xd` is the destination, `Xn` is the first source, `Xm` is the second source. The `{...}` part is optional — if omitted, the instruction is a plain register add/subtract with no shift. In the encoding, `ADD X0, X1, X2` and `ADD X0, X1, X2, LSL #0` are the same machine code — the encoding always has a 2-bit shift-type field and a 6-bit amount field; "no shift" is just `LSL #0` (shift type = 00, amount = 000000). In the **shifted register** encoding, register 31 in any field means **XZR** (the zero register), NOT SP. So `ADD X0, XZR, X5` is valid (adds zero + X5 → moves X5 into X0), but you **cannot** use SP here — for that you need the immediate or extended register encoding (§5.2 / §5.3). When a disassembler shows `add x0, xzr, x0`, that's this encoding with Xn = XZR. **Why no ROR?** Arithmetic shifted register only allows LSL, LSR, and ASR — NOT ROR. (Likely rationale, not documented ARM design intent: rotate-then-add has no common use case in compiled code, so the encoding space was used for other purposes.) Logical instructions (AND/ORR/EOR/BIC, §6) DO allow ROR because rotate-and-mask is useful in crypto and hash functions. **What the register form REALLY does — traced:** ```asm // Plain register (no shift): // If X1 = 100, X2 = 25: ADD X0, X1, X2 // X0 = 100 + 25 = 125 // With shift — "add X1 plus (X2 shifted left by 2)": // If X1 = 0x1000 (base address), X2 = 5 (index): ADD X0, X1, X2, LSL #2 // X0 = 0x1000 + (5 << 2) = 0x1000 + 20 = 0x1014 // This computes base + index*4 (array of 4-byte elements) // With ASR — useful for signed division combined with addition: // If X1 = 100, X2 = -8: ADD X0, X1, X2, ASR #1 // X0 = 100 + (-8 >> 1) = 100 + (-4) = 96 ``` **32-bit note**: When using Wd forms, flags reflect the 32-bit result (N = bit 31, C/V from 32-bit arithmetic). The upper 32 bits of the Xd register are always zeroed — this is true for ALL instructions that write to a W register. ### 5.2 ADD / SUB — Immediate Form The immediate form adds or subtracts a constant value encoded directly in the instruction. This is the most common form for stack adjustments (`ADD SP, SP, #16`), small increments, and compile-time-known offsets. Register 31 means **SP** here (not XZR), which is why `ADD SP, SP, #16` works. ``` ADD Xd|SP, Xn|SP, #imm12{, LSL #12} // 64-bit; imm12 = 0–4095, optionally shifted left by 12 ADD Wd|WSP, Wn|WSP, #imm12{, LSL #12} // 32-bit; same encoding constraints SUB Xd|SP, Xn|SP, #imm12{, LSL #12} SUB Wd|WSP, Wn|WSP, #imm12{, LSL #12} ADDS Xd|XZR, Xn|SP, #imm12{, LSL #12} // + set flags (Rd is XZR not SP) [64-bit] ADDS Wd|WZR, Wn|WSP, #imm12{, LSL #12} // + set flags [32-bit] SUBS Xd|XZR, Xn|SP, #imm12{, LSL #12} // + set flags [64-bit] SUBS Wd|WZR, Wn|WSP, #imm12{, LSL #12} // + set flags [32-bit] ``` The immediate is a **12-bit unsigned value** (0–4095), optionally shifted left by 12 bits. This encoding is identical for both 32-bit and 64-bit forms — the same range of immediates is available. So the encodable values are `0–4095` OR `0–4095 shifted left by 12` (i.e., multiples of 4096 up to 4095×4096 = 16,773,120). **What the hardware actually encodes**: The instruction has a 12-bit immediate field and a 1-bit shift flag. The shift flag is either 0 (no shift) or 1 (shift the 12-bit value left by 12 positions). When you write a large number like `#0x123000`, the assembler breaks it down for you — it figures out that 0x123000 = 0x123 shifted left by 12, so it stores `imm12 = 0x123` with the shift flag set. You never see this in source code, but you might see it in a disassembler. ```asm // What you write: // What the hardware actually encodes: ADD X0, X1, #42 // imm12 = 42, shift = 0 → X1 + 42 ADD X0, X1, #0x1000 // imm12 = 1, shift = 1 → X1 + (1 << 12) = X1 + 4096 ADD X0, X1, #0x123000 // imm12 = 0x123, shift = 1 → X1 + (0x123 << 12) ADD X0, X1, #5000 // ERROR: 5000 = 0x1388 // 0x1388 > 4095, so it doesn't fit in 12 bits unshifted // 0x1388 is not a multiple of 4096, so shift doesn't help // The assembler cannot encode this — it will error ``` A disassembler may show `ADD X0, X1, #0x123, LSL #12` instead of `ADD X0, X1, #0x123000` — they mean the same thing, it's just showing the raw encoding fields. The assembler may silently convert `ADD Xd, Xn, #-5` into `SUB Xd, Xn, #5` if the negative immediate can be encoded as a positive immediate of the opposite instruction. This is a common assembler convenience. ### 5.3 ADD / SUB — Extended Register Form The extended register form sign-extends or zero-extends a narrow value (8/16/32-bit) from the second source register to the full width, optionally shifts it left by 0–4, then adds/subtracts. This is how the hardware computes array addresses like `base + (int32_index * element_size)` in one instruction. Register 31 means **SP** in Rd/Rn (so `ADD SP, SP, X0` works) but **XZR** in Rm. When you write `ADD X0, SP, X1, LSL #3`, the assembler automatically picks this encoding (not shifted register), because SP is only valid here. ``` ADD Xd|SP, Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // 64-bit, 32-bit index ADD Xd|SP, Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} // 64-bit, 64-bit index ADD Wd|WSP, Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} SUB Xd|SP, Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} SUB Xd|SP, Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} // 64-bit, 64-bit index SUB Wd|WSP, Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} ADDS Xd|XZR, Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // + set flags (Rd=XZR not SP) ADDS Xd|XZR, Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} ADDS Wd|WZR, Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} SUBS Xd|XZR, Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // + set flags SUBS Xd|XZR, Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} SUBS Wd|WZR, Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} ``` The extend operations are: | Extend | Meaning | |--------|---------| | `UXTB` | Unsigned extend byte (bits [7:0]) | | `UXTH` | Unsigned extend halfword (bits [15:0]) | | `UXTW` | Unsigned extend word (bits [31:0]) | | `UXTX` | Unsigned extend doubleword (bits [63:0], effectively no extension) | | `SXTB` | Signed extend byte | | `SXTH` | Signed extend halfword | | `SXTW` | Signed extend word | | `SXTX` | Signed extend doubleword | | `LSL` | Alias for UXTX (64-bit) or UXTW (32-bit) — same encoding. **Only the preferred spelling when Rd or Rn is SP/WSP**; otherwise the ARM ARM uses `UXTX`/`UXTW`. Not shown in the syntax lines above because it is not a separate encoding — `LSL #3` and `UXTX #3` produce identical machine code. | The `{#0-4}` shift is applied **after** extension — it left-shifts the extended value by 0–4 positions (i.e., multiply by 1, 2, 4, 8, or 16). The braces mean the amount is optional for all extend types: `ADD X0, X1, W2, SXTW` is valid and means the same as `ADD X0, X1, W2, SXTW #0`. In the hardware encoding, the 3-bit `imm3` field always exists — "omitting" the amount just means the assembler sets `imm3 = 000`. There is no separate encoding without a shift. **LSL in extended register — the full story, ISA-level vs assembler-level**: At the **ISA level** there is no separate "LSL" extend option. The encoding has only 8 values in the 3-bit `option` field (UXTB/UXTH/UXTW/UXTX/SXTB/SXTH/SXTW/SXTX) plus a 3-bit `imm3` shift amount architecturally constrained to 0–4 (shift>4 is UNDEFINED per ARM ARM pseudocode: `if shift > 4 then UNDEFINED`). `LSL` is not one of those 8 options — it is purely an *assembler spelling* for `option=011` (UXTX) in 64-bit forms, or `option=010` (UXTW) in 32-bit forms. UXTX #n on a 64-bit operand and UXTW #n on a 32-bit operand are pure left-shifts (no width change) because there are no bits above the operand to zero-extend. So `UXTX #3` of a 64-bit Xm is `LSL #3` of Xm — same operation, same bits. **What the assembler does**: GAS and LLVM use the `LSL` spelling only when **at least one of Rd or Rn is SP/WSP** and the `Rm` operand width matches the data width (Xm for 64-bit, Wm for 32-bit). Outside that case, they emit the `UXTX` / `UXTW` spelling. Both assemblers require an explicit `#<n>` after `LSL` — bare `LSL` is a syntax error. The architectural reason this convention exists: when `option=011` (64-bit) or `option=010` (32-bit) with `imm3=0`, the ARM ARM preferred disassembly is to *omit the extend entirely* (you just see `ADD X0, SP, X1`) because UXTX #0 on a 64-bit operand is a no-op. That leaves `LSL #<n>` with n ∈ 1..4 as the only reason to write `LSL` at all. **Practical rules for writing/reading code**: - `ADD X0, SP, X1, LSL #3` — accepted; encodes option=011 imm3=3. Preferred spelling because Rn=SP. - `ADD X0, SP, X1, LSL #0` — accepted by most assemblers; encodes option=011 imm3=0; preferred-disassembles back as `ADD X0, SP, X1` (no extend shown). - `ADD X0, SP, X1, LSL` — REJECTED by GAS/LLVM (syntax error — `#<n>` is required). - `ADD X0, SP, X1` — accepted; same machine code as `ADD X0, SP, X1, LSL #0` or `ADD X0, SP, X1, UXTX #0`. - `ADD X0, X1, X2, UXTX #3` — accepted; uses UXTX spelling because no SP is involved. Same machine code as the hypothetical `LSL #3` would be, but LSL spelling is not preferred here. - `ADD X0, X1, X2, LSL #3` — accepted, but this is the *shifted-register* encoding (a DIFFERENT instruction class), not the extended-register encoding. The assembler chose shifted-reg because no SP is involved. Same visible result for this particular case, different bits. This is extremely useful for array indexing: ```asm // X1 = base address, W2 = 32-bit index // Access array of 8-byte elements: base + sign_extend(index) * 8 LDR X0, [X1, W2, SXTW #3] // #3 means shift left by 3 = multiply by 8 // In ADD form: ADD X0, X1, W2, SXTW #3 // X0 = X1 + sign_extend(W2) << 3 ``` **What the extended register form REALLY does — traced step by step:** ```asm // ADD X0, X1, W2, SXTW #3 // If X1 = 0x00010000 (base address = 65536) and W2 = 0xFFFFFFFE (-2 as signed 32-bit): // // Step 1: Sign-extend W2 from 32 to 64 bits: // W2 = 0xFFFFFFFE → sign bit (bit 31) = 1 → extend with 1s // Extended = 0xFFFFFFFF_FFFFFFFE = -2 as 64-bit // // Step 2: Shift left by 3 (multiply by 8): // 0xFFFFFFFF_FFFFFFFE << 3 = 0xFFFFFFFF_FFFFFFF0 = -16 // // Step 3: Add to X1: // X0 = 0x00010000 + 0xFFFFFFFF_FFFFFFF0 = 0x00000000_0000FFF0 // = 0x10000 - 16 = 0xFFF0 = 65520 // // This is: base_address + (signed_index * element_size) // It computed &array[-2] for 8-byte elements — going backwards 2 elements from the base. // Without SXTW, you'd need to sign-extend manually: SXTW X3, W2 // X3 = sign_extend(W2) ADD X0, X1, X3, LSL #3 // X0 = X1 + X3 * 8 // The extended register form does both in one instruction. ``` **Why this form exists**: Array indexing with 32-bit indices into 64-bit address space is extremely common. C/C++ code uses `int` (32-bit) for array indices, but pointers are 64-bit. The extended register form does the sign/zero extension AND the element-size multiplication in a single instruction, saving 1-2 instructions per array access. **Note on SP**: Both the immediate form and the extended register form accept `SP` as source and destination (register 31 = SP). The shifted register encoding uses XZR (not SP) for register 31. In practice, if you write `ADD X0, SP, X1, LSL #2`, the assembler automatically selects the extended register encoding (where `LSL` is an alias for `UXTX`), so it works. This is why a disassembler may show `add x0, sp, x0` or `add x0, sp, x0, lsl #2` — these use the **extended register** encoding (not shifted register), because SP can only appear in that encoding. The distinction only matters if you're hand-encoding machine code. **How to tell which encoding was used from disassembly:** ```asm // 8b0003e0 add x0, xzr, x0 ← Shifted register encoding (Rn field = 31 = XZR) // 8b2063e0 add x0, sp, x0 ← Extended register encoding (Rn field = 31 = SP, extend = UXTX) // 8b206be0 add x0, sp, x0, lsl #2 ← Extended register encoding (Rn = SP, extend = UXTX #2) // These are DIFFERENT opcodes even though they both say "add". ``` ### 5.4 ADC / SBC — Add/Subtract with Carry `ADC` (Add with Carry) adds two registers **plus** the current carry flag (C). The carry flag is a single bit in the processor's flags register — it is either 0 or 1, and it was set by a previous flag-setting instruction like `ADDS` or `SUBS`. This lets you chain additions across multiple registers to handle numbers bigger than 64 bits. `SBC` (Subtract with Carry) subtracts using the carry flag as a "borrow" indicator. It computes `Xn + NOT(Xm) + C`. When C=1 (no borrow from previous subtraction), this simplifies to `Xn - Xm`. When C=0 (there was a borrow), this gives `Xn - Xm - 1`, propagating the borrow. ``` ADC Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn + Xm + C [64-bit] ADC Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn + Wm + C [32-bit] SBC Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn + NOT(Xm) + C [64-bit] SBC Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn + NOT(Wm) + C [32-bit] ADCS Xd|XZR, Xn|XZR, Xm|XZR // + set flags [64-bit] ADCS Wd|WZR, Wn|WZR, Wm|WZR // + set flags [32-bit] SBCS Xd|XZR, Xn|XZR, Xm|XZR // + set flags [64-bit] SBCS Wd|WZR, Wn|WZR, Wm|WZR // + set flags [32-bit] ``` Here, `C` is the carry flag value (0 or 1) from the PSTATE condition flags, as set by the most recent flag-setting instruction. No shift or immediate forms exist for ADC/SBC. These are essential for **multi-word arithmetic** (e.g., 128-bit addition): ```asm // 128-bit addition: (X1:X0) + (X3:X2) -> (X1:X0) ADDS X0, X0, X2 // add low 64 bits, set carry ADC X1, X1, X3 // add high 64 bits + carry ``` **What this REALLY does — traced with values:** ```asm // Add 0x00000000_00000001:FFFFFFFF_FFFFFFFE + 0x00000000_00000000:00000000_00000005 // X1:X0 = 0x0000000000000001 : 0xFFFFFFFFFFFFFFFE // X3:X2 = 0x0000000000000000 : 0x0000000000000005 ADDS X0, X0, X2 // 0xFFFFFFFFFFFFFFFE + 5 = 0x0000000000000003 (wraps! C=1) ADC X1, X1, X3 // 0x0000000000000001 + 0 + 1(carry) = 0x0000000000000002 // Result: X1:X0 = 0x0000000000000002:0000000000000003 = correct 128-bit sum ``` SBC is used similarly for multi-word subtraction. Note that SBC uses the carry flag in the same inverted sense as ARM subtraction: C=1 means no borrow. **What SBC REALLY does — traced:** ```asm // 128-bit subtraction: (X1:X0) - (X3:X2) → (X1:X0) // Subtract 0x00000000_00000000:0000000000000002 - 0x00000000_00000000:0000000000000005 SUBS X0, X0, X2 // 0x0000000000000002 - 5 = wraps to 0xFFFFFFFFFFFFFFFD // C=0 (borrow occurred: 2 < 5) SBC X1, X1, X3 // X1 + NOT(X3) + C = 0 + NOT(0) + 0 = 0 + 0xFFFFFFFFFFFFFFFF + 0 // = 0xFFFFFFFFFFFFFFFF // Result: 0xFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFD = correct (2 - 5 = -3 as 128-bit signed) ``` ### 5.5 NEG / NEGS — Negate `NEG` computes the two's complement negation of a value — it flips the sign. `NEG Xd, Xm` is equivalent to `0 - Xm`. It is an alias for `SUB Xd, XZR, Xm` (subtracting the value from the zero register). ``` NEG Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Alias: SUB Xd|XZR, XZR, Xm|XZR{, shift} [64-bit] NEG Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // Alias: SUB Wd|WZR, WZR, Wm|WZR{, shift} [32-bit] NEGS Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Alias: SUBS Xd|XZR, XZR, Xm|XZR{, shift} NEGS Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // Alias: SUBS Wd|WZR, WZR, Wm|WZR{, shift} ``` ### 5.6 NGC / NGCS — Negate with Carry `NGC` negates a value while incorporating the carry flag, used for multi-word negation. It is an alias for `SBC Xd, XZR, Xm`, which computes `0 + NOT(Xm) + C`. In a multi-word negate, the first word uses `NEGS` (which sets the carry), and subsequent words use `NGC` to propagate the borrow. ``` NGC Xd|XZR, Xm|XZR // Alias for: SBC Xd|XZR, XZR, Xm|XZR [64-bit] NGC Wd|WZR, Wm|WZR // Alias for: SBC Wd|WZR, WZR, Wm|WZR [32-bit] NGCS Xd|XZR, Xm|XZR // Alias for: SBCS Xd|XZR, XZR, Xm|XZR NGCS Wd|WZR, Wm|WZR ``` Useful in multi-word negation. --- ## 6. Data Processing — Logical Bitwise operations for masking, setting, clearing, and toggling bits. These use a special "bitmask immediate" encoding that can represent many (but not all) bit patterns. ### 6.1 Basic Logical Instructions These perform bitwise operations — they operate on each bit position independently: - `AND`: Result bit is 1 only if **both** input bits are 1. Used for masking (extracting specific bits). - `ORR`: Result bit is 1 if **either or both** input bits are 1 (inclusive OR). Used for setting specific bits. - `EOR`: Result bit is 1 if the input bits are **different** (exclusive OR). Used for toggling bits and simple encryption. - `ANDS`: Same as AND, but also updates the condition flags (N, Z, C=0, V=0). ``` AND Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn & (shifted Xm) [64-bit] AND Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} // Wd = Wn & (shifted Wm) [32-bit] ORR Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn | (shifted Xm) [64-bit] ORR Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} EOR Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn ^ (shifted Xm) [64-bit] EOR Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} ANDS Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // AND + set flags [64-bit] ANDS Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} // AND + set flags [32-bit] ``` **Why all four shifts?** Unlike arithmetic instructions (ADD/SUB, which only allow LSL/LSR/ASR), logical instructions also allow **ROR** because rotate-and-mask patterns are common in cryptography, hash functions, and CRC computations. There is **no ORRS or EORS** instruction in AArch64. Only `ANDS` and `BICS` have flag-setting variants. If you need flags after ORR/EOR, follow with `TST` or `CMP`. ### 6.2 BIC — Bit Clear `BIC` stands for "Bit Clear." It ANDs the first operand with the bitwise NOT of the second — every bit that is 1 in Xm gets cleared (set to 0) in the result. Think of it as using Xm as a mask of which bits to turn off. ``` BIC Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn & ~(shifted Xm) [64-bit] BIC Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} // Wd = Wn & ~(shifted Wm) [32-bit] BICS Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // same + set flags [64-bit] BICS Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} ``` **Important**: `BIC` only has a shifted-register form, not an immediate form. To clear bits by immediate, use `AND` with the inverted bitmask: ```asm // BIC X0, X0, #0xFF ← ILLEGAL, no BIC immediate form AND X0, X0, #0xFFFFFFFFFFFFFF00 // Correct: AND with inverted mask ``` ### 6.3 ORN / EON — OR-NOT / XOR-NOT `ORN` performs OR with the bitwise NOT of the second operand: it flips every bit of Xm, then ORs the result with Xn (`Xd = Xn | ~Xm`). `EON` does the same but with XOR: `Xd = Xn ^ ~Xm`. These save an instruction when you need to NOT a value before ORing or XORing it. ``` ORN Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn | ~(shifted Xm) [64-bit] ORN Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} EON Xd|XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Xd = Xn ^ ~(shifted Xm) [64-bit] EON Wd|WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} ``` These have no flag-setting forms and no immediate forms. ### 6.4 MVN — Move NOT (Bitwise NOT) `MVN` (Move NOT) flips every bit in the source register: every 0 becomes 1, every 1 becomes 0. This is called a bitwise NOT (written as `~Xm` in C). It is an alias for `ORN Xd, XZR, Xm` — ORing zero with `~Xm` just gives `~Xm`. ``` MVN Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Alias: ORN Xd|XZR, XZR, Xm|XZR{, shift} [64-bit] MVN Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} // Alias: ORN Wd|WZR, WZR, Wm|WZR{, shift} [32-bit] ``` **32-bit note**: `MVN W0, W1` inverts 32 bits and zeroes the upper 32 of X0. Different from `MVN X0, X1` which inverts all 64 bits. ### 6.5 Logical — Immediate Form The immediate form performs bitwise AND/ORR/EOR with a constant encoded in the instruction. Unlike ADD/SUB immediate (which uses a simple 12-bit value), logical immediate uses a special **bitmask encoding** that can represent many useful bit patterns (masks, alternating bits, aligned ranges) but NOT arbitrary constants. Register 31 in Rd means **SP** (not XZR) for the non-flag-setting forms, making `AND SP, X0, #0xFFF...` valid. Register 31 in Rn means **XZR**. ``` AND Xd|SP, Xn|XZR, #bitmask_imm // 64-bit; Rd is SP, Rn is XZR AND Wd|WSP, Wn|WZR, #bitmask_imm ORR Xd|SP, Xn|XZR, #bitmask_imm ORR Wd|WSP, Wn|WZR, #bitmask_imm EOR Xd|SP, Xn|XZR, #bitmask_imm EOR Wd|WSP, Wn|WZR, #bitmask_imm ANDS Xd|XZR, Xn|XZR, #bitmask_imm // flag-setting; Rd is XZR not SP [64-bit] ANDS Wd|WZR, Wn|WZR, #bitmask_imm ``` The bitmask immediate is **not** an arbitrary constant. Since every instruction must fit in 32 bits and the opcode and register fields already take up most of that space, there aren't enough bits left to store an arbitrary 64-bit constant. Instead, ARM uses a clever encoding that can represent a useful subset of bit patterns — things like masks, alternating bits, and aligned ranges — using only 13 bits. **How it works, step by step:** 1. **Pick an element size** `e`: must be 2, 4, 8, 16, 32, or 64 bits. 2. **Start with** `s` consecutive 1-bits at the bottom of an `e`-bit element, where `s` is at least 1 and at most `e-1`. (You can't have all zeros or all ones within the element.) For example, with `e=8` and `s=3`, you start with `00000111`. 3. **Right-rotate** that pattern within the `e`-bit element by `r` positions, where `r` is 0 to `e-1`. Bits that fall off the right wrap around to the left. For example, rotating `00000111` right by 1 gives `10000011` (the bottom 1 wraps to the top). 4. **Replicate** the `e`-bit element across the full 64-bit (or 32-bit) register. For example, with `e=8` and pattern `10000011`, you get `10000011_10000011_10000011_10000011_10000011_10000011_10000011_10000011` = `0x8383838383838383`. **Worked examples:** ``` // Example 1: e=64, s=8, r=0 // Step 2: 64-bit element with 8 ones at bottom: 0x00000000000000FF // Step 3: rotate right by 0 → unchanged: 0x00000000000000FF // Step 4: element is already 64 bits, no replication needed // Result: 0x00000000000000FF (= 0xFF) // Example 2: e=64, s=8, r=8 // Step 2: 8 ones at bottom: 0x00000000000000FF // Step 3: rotate right by 8 → 0xFF00000000000000 // (the 8 set bits at positions 0-7 wrap around to positions 56-63) // Step 4: no replication (64-bit element) // Result: 0xFF00000000000000 // Example 3: e=8, s=4, r=0 // Step 2: 8-bit element with 4 ones at bottom: 00001111 // Step 3: rotate right by 0 → 00001111 // Step 4: replicate 8 times: 0x0F0F0F0F0F0F0F0F // Result: 0x0F0F0F0F0F0F0F0F // Example 4: e=2, s=1, r=0 // Step 2: 2-bit element with 1 one at bottom: 01 // Step 3: rotate right by 0 → 01 // Step 4: replicate 32 times: 01010101...01 = 0x5555555555555555 // Result: 0x5555555555555555 // Example 5: e=2, s=1, r=1 // Step 2: 2-bit element with 1 one at bottom: 01 // Step 3: rotate right by 1 → 10 (the 1 wraps from bottom to top) // Step 4: replicate 32 times: 10101010...10 = 0xAAAAAAAAAAAAAAAA // Result: 0xAAAAAAAAAAAAAAAA // Example 6: e=64, s=32, r=32 // Step 2: 32 ones at bottom: 0x00000000FFFFFFFF // Step 3: rotate right by 32 → 0xFFFFFFFF00000000 // Step 4: no replication // Result: 0xFFFFFFFF00000000 ``` **What the hardware actually encodes**: The instruction stores these three parameters in 13 bits: 6 bits for `imms` (encodes both `e` and `s`), 6 bits for `immr` (encodes `r`), and 1 bit called `N` (helps determine `e`). The exact encoding is complex — consult the ARM ARM for the decode table — but the concept above is what it represents. **32-bit vs 64-bit difference**: For Wd forms, the element size can only go up to 32 (not 64), and the pattern replicates to fill 32 bits. This means some bitmask immediates valid for Xd are NOT valid for Wd: ```asm ORR X0, XZR, #0xFFFFFFFF00000000 // VALID: element=64, ones=32, rotate=32 ORR W0, WZR, #0xFFFFFFFF00000000 // ILLEGAL: needs element=64, but 32-bit only allows up to 32 ORR W0, WZR, #0xFFFF0000 // VALID: element=32, ones=16, rotate=16 ``` **Quick test — is my value encodable?** Ask: can I describe it as a run of consecutive 1-bits, optionally rotated, within a 2/4/8/16/32/64-bit chunk, tiled across the register? If yes, it's encodable. If not (like `0x12345678` or `5`), it's not. Not encodable: `0x12345678`, `5`, `0x1234`, anything without a repeating-rotated-ones pattern. Note that `AND`/`ORR`/`EOR` immediate forms accept **SP** as the destination register (not XZR), while `ANDS` immediate accepts XZR but NOT SP. This is because register 31 means different things in different contexts: in most instructions it means XZR (the zero register), but in certain instructions like `ADD` immediate and logical immediate (non-flag-setting), it means SP (the stack pointer). The hardware uses the opcode to decide which interpretation to use. --- ## 7. Shift & Rotate Operations Shift instructions move all bits in a register left or right by a specified number of positions. They are fundamental to assembly — used for multiplication/division by powers of 2, bit extraction, and building complex values. **Why shifts matter**: Shifting left by N is the same as multiplying by 2^N (but much faster — one cycle vs many for a multiply). Shifting right divides by 2^N. Compilers use shifts extensively: `x * 12` becomes `(x << 3) + (x << 2)` (two shifts and an add), which is faster than a multiply on many cores. Shifts are also how you access individual bits and build/parse packed data formats (network headers, pixel formats, bitfields). ### 7.1 Dedicated Shift Instructions These shift a register by a constant amount known at assemble time. They are all **aliases** — the hardware encodes them as bitfield (UBFM/SBFM) or extract (EXTR) instructions. The assembler and disassembler translate between the friendly names and the raw encodings automatically. ``` LSL Xd|XZR, Xn|XZR, #0-63 // Logical Shift Left (immediate) [alias: UBFM] LSL Wd|WZR, Wn|WZR, #0-31 LSR Xd|XZR, Xn|XZR, #0-63 // Logical Shift Right (immediate) [alias: UBFM] LSR Wd|WZR, Wn|WZR, #0-31 ASR Xd|XZR, Xn|XZR, #0-63 // Arithmetic Shift Right (immediate) [alias: SBFM] ASR Wd|WZR, Wn|WZR, #0-31 ROR Xd|XZR, Xn|XZR, #0-63 // Rotate Right (immediate) [alias: EXTR] ROR Wd|WZR, Wn|WZR, #0-31 ``` **These are all aliases.** They are not separate opcodes — the hardware encodes them as bitfield or extract instructions: | Instruction | Actually encodes as | |---|---| | `LSL Xd, Xn, #s` | `UBFM Xd, Xn, #(-s MOD 64), #(63-s)` | | `LSR Xd, Xn, #s` | `UBFM Xd, Xn, #s, #63` | | `ASR Xd, Xn, #s` | `SBFM Xd, Xn, #s, #63` | | `ROR Xd, Xn, #s` | `EXTR Xd, Xn, Xn, #s` | | `LSL Wd, Wn, #s` | `UBFM Wd, Wn, #(-s MOD 32), #(31-s)` | | `LSR Wd, Wn, #s` | `UBFM Wd, Wn, #s, #31` | | `ASR Wd, Wn, #s` | `SBFM Wd, Wn, #s, #31` | | `ROR Wd, Wn, #s` | `EXTR Wd, Wn, Wn, #s` | ### 7.2 Variable (Register) Shifts These shift a register by an amount stored in another register (determined at runtime). The shift amount is taken modulo the register width: `Xm MOD 64` for 64-bit, `Wm MOD 32` for 32-bit — so shifting by 65 is the same as shifting by 1. The real instruction mnemonics are LSLV/LSRV/ASRV/RORV; the assembler accepts LSL/LSR/ASR/ROR with three register operands as aliases. ``` LSL Xd|XZR, Xn|XZR, Xm|XZR // Alias for: LSLV Xd|XZR, Xn|XZR, Xm|XZR [64-bit] LSL Wd|WZR, Wn|WZR, Wm|WZR // Alias for: LSLV Wd|WZR, Wn|WZR, Wm|WZR [32-bit] LSR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: LSRV Xd|XZR, Xn|XZR, Xm|XZR LSR Wd|WZR, Wn|WZR, Wm|WZR ASR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: ASRV Xd|XZR, Xn|XZR, Xm|XZR ASR Wd|WZR, Wn|WZR, Wm|WZR ROR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: RORV Xd|XZR, Xn|XZR, Xm|XZR ROR Wd|WZR, Wn|WZR, Wm|WZR ``` `LSLV`, `LSRV`, `ASRV`, `RORV` are the real instruction mnemonics (all register fields use XZR for register 31, never SP). The assembler resolves `LSL Xd, Xn, Xm` (three registers) as `LSLV` and `LSL Xd, Xn, #imm` (register + immediate) as `UBFM`. The shift amount uses only the lower 6 bits of `Xm` (for 64-bit) or lower 5 bits of `Wm` (for 32-bit). The actual shift is `Xm MOD 64` or `Wm MOD 32`. **32-bit note**: Even though the shift register is `Wm`, only its low 5 bits matter. `LSL W0, W1, W2` where W2=33 shifts by 33 MOD 32 = 1. ### 7.3 Shift Semantics - **LSL #n**: Shifts left, filling vacated bits with 0. Bits shifted out of the MSB are lost. Equivalent to unsigned multiply by 2^n (with truncation). - **LSR #n**: Shifts right, filling vacated bits with 0. Equivalent to unsigned divide by 2^n (truncating toward zero). - **ASR #n**: Shifts right, filling vacated bits with copies of the original MSB (sign bit). Equivalent to signed divide by 2^n (truncating toward negative infinity, NOT toward zero — this differs from C's `/` operator for negative numbers). - **ROR #n**: Rotates right — bits shifted out the bottom re-enter at the top. No information is lost. ```asm // ASR rounding toward -infinity example: // If X0 = -7 (0xFFFFFFFFFFFFFFF9) ASR X1, X0, #1 // X1 = -4 (not -3!) // -7 / 2 = -3.5, rounded toward -infinity = -4 // C's -7/2 = -3 (rounded toward 0) ``` --- ## 8. Shifted Register & Extended Register Forms Many instructions accept a modified second operand — shifted or extended before the operation. This happens in the same cycle as the main operation (the "barrel shifter") at no extra cost. ### 8.1 Shifted Register Operand Many data-processing instructions accept a final operand of the form `Xm, <shift> #<amount>`. In **all** shifted register encodings, register 31 means **XZR** (never SP): ``` <op> Xd|XZR, Xn|XZR, Xm|XZR, LSL #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, LSR #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, ASR #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, ROR #amount // Only for logical ops (AND/ORR/EOR/BIC/ORN/EON) ``` The shift is applied to `Xm` before the operation. This is called a "barrel shift" — the hardware has a dedicated shifter circuit built into the data path, so the shift happens in the same clock cycle as the main operation (no extra cost on most implementations). **Which instructions support which shifts:** | Instruction class | LSL | LSR | ASR | ROR | |---|---|---|---|---| | ADD/SUB (shifted reg) | ✓ | ✓ | ✓ | ✗ | | AND/ORR/EOR/BIC/ORN/EON | ✓ | ✓ | ✓ | ✓ | | CMP/CMN (shifted reg form, alias of SUBS/ADDS) | ✓ | ✓ | ✓ | ✗ | | TST (shifted reg form, alias of ANDS) | ✓ | ✓ | ✓ | ✓ | | NEG/NEGS (alias of SUB/SUBS with XZR) | ✓ | ✓ | ✓ | ✗ | | MVN (alias of ORN with XZR) | ✓ | ✓ | ✓ | ✓ | ### 8.2 Extended Register Operand Only ADD/SUB (and their S variants) and CMP/CMN support extended register. In the extended register encoding, register 31 in the `Xd` and `Xn` positions means **SP** (not XZR), while register 31 in the `Rm` position means **XZR**. For the flag-setting variants (ADDS/SUBS/CMP/CMN), `Xd` uses XZR instead of SP: ``` ADD Xd|SP, Xn|SP, Wm|WZR, UXTB {#0-4} // Zero-extend byte from Wm, then shift left by 0–4 ADD Xd|SP, Xn|SP, Wm|WZR, SXTW {#0-4} // Sign-extend word from Wm, then shift left by 0–4 SUB Xd|SP, Xn|SP, Wm|WZR, SXTW {#0-4} // Subtract the extended value CMP Xn|SP, Wm|WZR, SXTW {#0-4} // Compare with extended value (SUBS XZR, ...) ``` The `{#0-4}` shift is applied **after** extension: `#0` = no shift (×1), `#1` = ×2, `#2` = ×4, `#3` = ×8, `#4` = ×16. The amount is optional — when omitted, the hardware encodes it as `#0` (the encoding always has a 3-bit shift field; "omitting" just means that field is zero). This covers all common C data type sizes. This exists specifically for array indexing and address arithmetic. **UXTB vs UXTH vs UXTW vs UXTX**: Each extracts a different-width chunk from the bottom of the register. UXTB takes 8 bits, UXTH takes 16, UXTW takes 32, UXTX takes all 64 (effectively just a shift). The S variants (SXTB, SXTH, SXTW, SXTX) do the same but sign-extend instead of zero-extend. The same encoding viewed per-instruction — a per-extend semantics table, a fully traced `SXTW #3` walk-through (`W2 = 0xFFFFFFFE`), and the LSL-aliases-UXTX/UXTW disassembly story — lives in **§5.3 (ADD/SUB — Extended Register Form)**; this section is just that mechanism described as a generic operand modifier. When the *same* extends appear as standalone instructions (e.g. `SXTW Xd, Wn` with no base register), they are `SBFM`/`UBFM` aliases covered in **§12 (Sign/Zero Extension)** and **§13 (Bitfield Operations)**. So "extend" shows up in three places: as an operand modifier here, as a standalone alias in §12, and as the bitfield encoding underneath in §13. ### 8.3 When to Use Which Form - **Shifted register**: When you need to combine an ALU operation with a shift (common in hash functions, crypto, bitfield manipulation). - **Extended register**: When mixing 32-bit and 64-bit values, or computing addresses from an index that's smaller than 64 bits. - **Immediate**: When the constant fits the encoding constraints. **Side-by-side: three ways to compute `base + offset * 8`:** ```asm // All three compute X0 = X1 + X2*8, but suit different situations: // 1. Shifted register (when X2 is already 64-bit): ADD X0, X1, X2, LSL #3 // X0 = X1 + (X2 << 3) // Use when: X2 is a 64-bit value // 2. Extended register (when index is 32-bit): ADD X0, X1, W2, SXTW #3 // X0 = X1 + sign_extend(W2) << 3 // Use when: W2 is a 32-bit signed index (like C's int) ADD X0, X1, W2, UXTW #3 // X0 = X1 + zero_extend(W2) << 3 // Use when: W2 is a 32-bit unsigned index (like C's unsigned) // 3. Immediate (when offset is a constant): ADD X0, X1, #40 // X0 = X1 + 40 // Use when: the offset is known at compile time (40 = 5*8) ``` **What shifted register REALLY does — traced example:** ```asm // ADD X0, X1, X2, LSL #3 // "Add X1 and (X2 shifted left by 3)" — one instruction, one cycle // // If X1 = 0x1000 (base address) and X2 = 5 (index): // X2 LSL 3 = 5 × 8 = 40 = 0x28 // X0 = 0x1000 + 0x28 = 0x1028 // This computes &array[5] for an 8-byte element array in one instruction. // AND X0, X1, X2, ROR #16 // "AND X1 with (X2 rotated right by 16)" // // If X1 = 0xFFFF0000FFFF0000 and X2 = 0x00FF00FF00FF00FF: // X2 ROR 16 = 0x00FF00FF00FF00FF rotated right 16 = 0x00FF00FF00FF00FF (symmetric!) // X0 = X1 & (X2 ROR 16) — useful in crypto/hash mixing ``` --- ## 9. Move Instructions & Aliases `MOV` is the most heavily aliased instruction in AArch64 — it maps to different real instructions depending on the operand. Understanding these aliases is essential for reading disassembly. ### 9.1 MOV — The Most Aliased Instruction `MOV` in AArch64 is **never** its own instruction. It always assembles as something else: | What you write | What it actually is | When | |---|---|---| | MOV Xd|XZR, Xm|XZR | ORR Xd|XZR, XZR, Xm|XZR | Register-to-register move (shifted-reg encoding; reg 31 = XZR) | | MOV Wd|WZR, Wm|WZR | ORR Wd|WZR, WZR, Wm|WZR | 32-bit (zeroes upper 32) | | MOV Xd|XZR, #imm | MOVZ Xd|XZR, #imm | If imm fits in 16 bits at some position | | MOV Wd|WZR, #imm | MOVZ Wd|WZR, #imm | Same (only 2 positions: LSL #0, #16) | | MOV Xd|XZR, #imm | MOVN Xd|XZR, #adjusted | If NOT(imm) fits in 16 bits | | MOV Wd|WZR, #imm | MOVN Wd|WZR, #adjusted | NOT applied at 32-bit width | | MOV Xd|SP, #imm | ORR Xd|SP, XZR, #bitmask_imm | If imm is a valid bitmask immediate (note: Rd = **SP** here, not XZR!) | | MOV Wd|WSP, #imm | ORR Wd|WSP, WZR, #bitmask_imm | 32-bit bitmask immediate | | MOV Xd|SP, SP | ADD Xd|SP, SP, #0 | Moving from SP (immediate encoding; reg 31 = SP in both Rd and Rn) | | MOV SP, Xn|SP | ADD SP, Xn|SP, #0 | Moving to SP | ### 9.2 MOVZ — Move Wide with Zero ``` MOVZ Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: place 16-bit value at one of 4 positions MOVZ Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: only 2 positions available ``` Places a 16-bit immediate into the specified 16-bit slot and **zeroes** all other bits. When the `LSL` part is omitted, the immediate is placed at bits [15:0] (the lowest slot) — the encoding has a 2-bit `hw` field that selects the slot, and omitting `LSL` sets `hw = 00`. **32-bit constraint**: The Wd form only allows `LSL #0` or `LSL #16` (two 16-bit slots in a 32-bit register). The Xd form allows `LSL #0`, `#16`, `#32`, or `#48` (four slots). ```asm MOVZ X0, #0xABCD, LSL #16 // X0 = 0x00000000ABCD0000 MOVZ W0, #0xABCD, LSL #16 // W0 = 0xABCD0000, X0 = 0x00000000ABCD0000 (upper zeroed) MOVZ W0, #0xABCD, LSL #32 // ILLEGAL — only 0/16 for Wd ``` ### 9.3 MOVK — Move Wide with Keep ``` MOVK Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: insert 16-bit value at one of 4 positions MOVK Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: only 2 positions available ``` Places a 16-bit immediate into the specified slot, **keeping** all other bits unchanged. Same encoding as MOVZ — when the `LSL` part is omitted, the immediate targets bits [15:0] (`hw = 00`). **32-bit constraint**: Same as MOVZ — the Wd form only has two 16-bit slots. Building a 32-bit constant requires at most 2 instructions: ```asm // Load 0x12345678 into W0: MOVZ W0, #0x5678 // W0 = 0x00005678 MOVK W0, #0x1234, LSL #16 // W0 = 0x12345678 // Load 0x123456789ABCDEF0 into X0 (needs 4): MOVZ X0, #0xDEF0 // X0 = 0x000000000000DEF0 MOVK X0, #0x9ABC, LSL #16 // X0 = 0x000000009ABCDEF0 MOVK X0, #0x5678, LSL #32 // X0 = 0x00005678_9ABCDEF0 MOVK X0, #0x1234, LSL #48 // X0 = 0x12345678_9ABCDEF0 ``` ### 9.4 MOVN — Move Wide with NOT ``` MOVN Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: place, then bitwise-NOT all 64 bits MOVN Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: NOT applies to 32-bit result, upper 32 zeroed ``` Like MOVZ but inverts **all** bits after placing the immediate. When the `LSL` part is omitted, the immediate targets bits [15:0] (`hw = 00`), then the entire register-width result is bitwise-NOTed. Useful for loading values like -1, -2, etc. **32-bit form**: The NOT applies to the 32-bit result, and the upper 32 bits of Xd are zeroed (standard W-register write behavior): ```asm MOVN X0, #0 // X0 = ~0x0000000000000000 = 0xFFFFFFFFFFFFFFFF = -1 MOVN X0, #1 // X0 = ~0x0000000000000001 = 0xFFFFFFFFFFFFFFFE = -2 MOVN W0, #0 // W0 = ~0x00000000 = 0xFFFFFFFF, X0 = 0x00000000FFFFFFFF (NOT -1 as 64-bit!) MOVN W0, #1 // W0 = ~0x00000001 = 0xFFFFFFFE, X0 = 0x00000000FFFFFFFE ``` **RE (reverse engineering) trap**: `MOVN W0, #0` gives `X0 = 0x00000000FFFFFFFF`, NOT `0xFFFFFFFFFFFFFFFF`. If the compiler wants 64-bit -1, it uses `MOVN X0, #0`. Seeing `MOVN W0` in disassembly means the original code was working with 32-bit types. ### 9.5 MOV (bitmask immediate) When a `MOV Xd, #imm` has an immediate that is a valid bitmask immediate, the assembler encodes it as: ``` ORR Xd|SP, XZR, #bitmask_imm // Note: Rd=SP in logical immediate (non-S)! ORR Wd|WSP, WZR, #bitmask_imm ``` ```asm MOV X0, #0xFF // → ORR X0, XZR, #0xFF MOV X0, #0xAAAAAAAAAAAAAAAA // → ORR X0, XZR, #0xAAAAAAAAAAAAAAAA ``` ### 9.6 LDR (literal) for Arbitrary Constants When no encoding trick works, the assembler uses a **literal pool** load. A literal pool is a small area of constant data that the assembler places in memory near your code (usually right after a function). Instead of encoding the constant inside the instruction, the CPU loads it from this nearby data using a PC-relative load: ```asm LDR X0, =0x123456789ABCDEF0 // Pseudo-instruction // Assembler places the constant in a nearby literal pool and generates: // LDR X0, [PC, #offset_to_literal] ``` The `=` syntax is a GNU assembler (gas) convenience. Some assemblers use different syntax. **Why literal pools exist**: AArch64 instructions are fixed at 32 bits, so there's simply not enough room to embed a full 64-bit constant. The best the ISA can do inline is MOVZ+MOVK (up to 4 instructions = 16 bytes of code). A literal pool load uses just 1 instruction (4 bytes of code) + 8 bytes of data, which is smaller for complex constants and faster to execute. **How the assembler decides**: When you write `LDR X0, =val`, the assembler checks if `val` can be encoded more efficiently as a `MOV` (via MOVZ, MOVN, or bitmask immediate). If so, it emits the `MOV` instead. Only if no single-instruction encoding works does it fall back to a literal pool load. Some assemblers (like LLVM's integrated assembler) are smarter than others about this. **Literal pool range**: The LDR (literal) instruction uses a 19-bit signed offset (±1 MB). The assembler must place the literal pool close enough to the load. For large functions, it may need to insert pools mid-function (after unconditional branches, so execution doesn't fall into the data). **Multiple loads of the same constant**: The assembler typically deduplicates — if you write `LDR X0, =0x1234` in three places, only one copy of 0x1234 appears in the literal pool. ### 9.7 ADR / ADRP — PC-Relative Address Loading ``` ADR Xd|XZR, label // Xd = PC + offset (±1 MB range, byte-aligned) ADRP Xd|XZR, label // Xd = (PC & ~0xFFF) + (offset << 12) (±4 GB, page-aligned) ``` `ADR` loads the exact address of a label into a register, using a 21-bit signed offset from PC (±1 MB range). `ADRP` loads the address of the **4 KB page** containing the label. Memory is divided into 4096-byte (0x1000) pages. `ADRP` zeroes the bottom 12 bits of PC (the `& ~0xFFF` part — `~0xFFF` is `0xFFFFFFFFFFFFF000`, a mask that clears the low 12 bits) and then adds a page-granularity offset. This gives ±4 GB range but only page-level precision. You then use `ADD` with `:lo12:` to add back the offset within the page: ```asm ADRP X0, my_global // X0 = page containing my_global ADD X0, X0, :lo12:my_global // X0 = exact address of my_global LDR X1, [X0] // X1 = value at my_global ``` This ADRP+ADD pattern is the standard way to access global variables in position-independent code (PIC) — code that works correctly regardless of where the OS loads it in memory. Since ADRP computes addresses relative to PC, the code doesn't contain any hardcoded absolute addresses. --- ## 10. Comparison & Test Instructions **Why separate comparison instructions exist**: You could compare two values using `SUBS` and ignoring the result, but the comparison instructions (`CMP`, `CMN`, `TST`) make intent clear and — crucially — write to the zero register instead of a GPR. This means they don't consume a register for an unwanted result. `CMP X0, X1` is literally `SUBS XZR, X0, X1` — the subtraction happens, flags are set, and the result is discarded into XZR. ### 10.1 CMP — Compare `CMP` subtracts the second operand from the first and sets the condition flags (N, Z, C, V) based on the result, but **discards the result** — it writes to the zero register. It is used before conditional branches or conditional selects to set up the flags. `CMP Xn, Xm` is an alias for `SUBS XZR, Xn, Xm`. ``` CMP Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Alias: SUBS XZR, Xn|XZR, Xm|XZR{, shift} [shifted-reg] CMP Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // 32-bit shifted-reg CMP Xn|SP, #imm12{, LSL #12} // Alias for: SUBS XZR, Xn|SP, #imm12{, LSL #12} [immediate] CMP Wn|WSP, #imm12{, LSL #12} // 32-bit immediate CMP Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // extended-reg CMP Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} // extended-reg, 64-bit Rm CMP Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // 32-bit extended-reg ``` The result is discarded (written to XZR/WZR), only flags are kept. **32-bit note**: `CMP Wn, Wm` sets flags based on 32-bit subtraction: N = bit 31 of result, C/V from 32-bit arithmetic. This matters for signed comparisons — `CMP W0, W1; B.GT` checks if W0 > W1 as signed 32-bit values, regardless of the upper 32 bits of X0/X1. ### 10.2 CMN — Compare Negative `CMN` ("Compare Negative") **adds** the two operands and sets flags, discarding the result. It is an alias for `ADDS XZR, Xn, Xm`. It is useful when you want to compare against a negative number — since `CMP` can only encode positive immediates, `CMN X0, #5` effectively tests if `X0 == -5`. ``` CMN Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} // Alias: ADDS XZR, Xn|XZR, Xm|XZR{, shift} [shifted-reg] CMN Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} // 32-bit shifted-reg CMN Xn|SP, #imm12{, LSL #12} // Alias for: ADDS XZR, Xn|SP, #imm12{, LSL #12} [immediate] CMN Wn|WSP, #imm12{, LSL #12} // 32-bit immediate CMN Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // extended-reg CMN Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} // extended-reg, 64-bit Rm CMN Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} // 32-bit extended-reg ``` `CMN` is like `CMP` but adds instead of subtracts. Equivalent to `CMP Xn, #-Xm` in terms of flag setting (but NOT identical for all edge cases due to signed overflow differences). Use case: `CMP X0, #-5` can't be encoded (negative immediate), but `CMN X0, #5` can. ### 10.3 TST — Test Bits `TST` performs a bitwise AND of two operands and sets flags, discarding the result. It is an alias for `ANDS XZR, Xn, op2`. Used to check if specific bits are set: after `TST X0, #1`, the zero flag (Z) tells you whether bit 0 was set (Z=0 means the bit was set; Z=1 means it was clear). ``` TST Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} // Alias: ANDS XZR, Xn|XZR, Xm|XZR{, shift} [shifted-reg] TST Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} // 32-bit shifted-reg TST Xn|XZR, #bitmask_imm // Alias for: ANDS XZR, Xn|XZR, #bitmask_imm [immediate] TST Wn|WZR, #bitmask_imm // 32-bit immediate ``` Sets N and Z based on the AND result (C and V are cleared). ```asm TST X0, #1 // Test bit 0 (is X0 odd?) B.NE is_odd // branch if bit was set (Z==0) TST X0, #0xF // Test lower nibble B.EQ lower_zero // branch if lower nibble is all zeros ``` ### 10.4 CCMP / CCMN — Conditional Compare `CCMP` checks a condition (from the current flags), and only performs its comparison if the condition is true. If the condition is false, it sets the flags to a value you choose via `#nzcv` (a 4-bit constant: bit 3=N, bit 2=Z, bit 1=C, bit 0=V). This lets you chain multiple comparisons into compound boolean expressions (AND / OR) without any branches. `CCMN` is the same but adds instead of subtracts (like CMN vs CMP). ``` CCMP Xn|XZR, Xm|XZR, #nzcv, cond // If cond true: compare Xn, Xm. Else: flags = #nzcv. [64-bit] CCMP Wn|WZR, Wm|WZR, #nzcv, cond CCMP Xn|XZR, #imm5, #nzcv, cond // Same with 5-bit immediate (0–31) [64-bit] CCMP Wn|WZR, #imm5, #nzcv, cond CCMN Xn|XZR, Xm|XZR, #nzcv, cond // Conditional CMN [64-bit] CCMN Wn|WZR, Wm|WZR, #nzcv, cond CCMN Xn|XZR, #imm5, #nzcv, cond // immediate form [64-bit] CCMN Wn|WZR, #imm5, #nzcv, cond ``` **This is one of the most powerful and unique instructions in AArch64.** It enables complex compound conditions without branches. The idea: CCMP checks a condition first. If that condition is true, it performs a normal comparison and sets flags. If the condition is false, it sets the flags to a value you specify in the `#nzcv` operand — this lets you control the outcome of the final branch. **Gotcha**: CCMP reads the current NZCV flags to evaluate `cond`, so a flag-setting instruction (CMP, SUBS, ANDS, TST, or another CCMP) **must** come before it. CCMP without a prior flag-setting instruction reads whatever stale flags happen to be in PSTATE — a bug that's hard to catch because it might work by accident during testing. ```asm // Equivalent of: if (x == 5 && y == 10) CMP X0, #5 CCMP X1, #10, #0, EQ // Only compare X1 with 10 if X0==5; else set flags=0000 (NE) B.EQ both_match // Walkthrough: // If X0 == 5: EQ is true → CCMP compares X1 vs 10 → B.EQ taken only if X1 == 10 // If X0 != 5: EQ is false → flags set to #0 (all zero, Z=0) → B.EQ not taken (Z must be 1 for EQ) // Equivalent of: if (x == 5 || y == 10) CMP X0, #5 CCMP X1, #10, #0b0100, NE // Only compare X1 with 10 if X0!=5; else set Z=1 (EQ) B.EQ either_match // Walkthrough: // If X0 == 5: NE is false → flags set to #0b0100 (Z=1) → B.EQ taken (first condition was true) // If X0 != 5: NE is true → CCMP compares X1 vs 10 → B.EQ taken only if X1 == 10 ``` The `#nzcv` operand is a 4-bit value specifying the flag state if the condition is false: bit 3 = N, bit 2 = Z, bit 1 = C, bit 0 = V. CCMP chains can implement arbitrary boolean combinations of comparisons without branching. --- ## 11. Multiply & Divide ARM multiply and divide instructions. Unlike x86, ARM division never traps — divide-by-zero returns 0, and there is no remainder instruction (you compute it with MSUB). ### 11.1 MUL / MADD / MSUB `MUL` multiplies two registers and stores the low 64 (or 32) bits of the result. For multiplying two 64-bit values, the full mathematical result could be 128 bits, but `MUL` only keeps the low 64 — the same bits whether the inputs are signed or unsigned. `MADD` (Multiply-Add) computes `Xa + Xn × Xm` in one instruction. `MSUB` (Multiply-Subtract) computes `Xa - Xn × Xm`. `MUL` is actually an alias for `MADD` with the accumulator set to the zero register. ``` MUL Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn * Xm (low 64 bits) [64-bit] MUL Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn * Wm (low 32 bits) [32-bit] // Both alias for: MADD Rd, Rn, Rm, RZR MADD Xd|XZR, Xn|XZR, Xm|XZR, Xa|XZR // Xd = Xa + (Xn * Xm) [64-bit] MADD Wd|WZR, Wn|WZR, Wm|WZR, Wa|WZR // Wd = Wa + (Wn * Wm) [32-bit] MSUB Xd|XZR, Xn|XZR, Xm|XZR, Xa|XZR // Xd = Xa - (Xn * Xm) [64-bit] MSUB Wd|WZR, Wn|WZR, Wm|WZR, Wa|WZR // Wd = Wa - (Wn * Wm) [32-bit] MNEG Xd|XZR, Xn|XZR, Xm|XZR // Xd = -(Xn * Xm) Alias: MSUB Xd|XZR, Xn|XZR, Xm|XZR, XZR MNEG Wd|WZR, Wn|WZR, Wm|WZR // Wd = -(Wn * Wm) Alias: MSUB Wd|WZR, Wn|WZR, Wm|WZR, WZR ``` These produce the **low** 64 bits of the 128-bit product. They work for both signed and unsigned (the low bits are the same for both). **None of these set flags.** There is no `MULS` in AArch64. **Overflow behavior**: All multiply instructions silently wrap on overflow — there is no trap, no flag, no indication. `MADD` computes the mathematically exact `Xa + (Xn × Xm)` and then truncates to the low 64 (or 32) bits. If you need to detect multiply overflow, use `UMULH`/`SMULH` (§11.2) and check whether the high half is zero (unsigned) or all-sign-bits (signed). ### 11.2 Wide Multiply (64×64→128) When you multiply two 64-bit numbers, the result can be up to 128 bits. `MUL` gives you the low 64 bits. `SMULH` (Signed Multiply High) and `UMULH` (Unsigned Multiply High) give you the **upper** 64 bits. Together, `MUL` + `UMULH` (or `SMULH`) give you the full 128-bit product. ``` SMULH Xd|XZR, Xn|XZR, Xm|XZR // Xd = high 64 bits of signed(Xn) * signed(Xm) UMULH Xd|XZR, Xn|XZR, Xm|XZR // Xd = high 64 bits of unsigned(Xn) * unsigned(Xm) ``` To get a full 128-bit product: ```asm // Unsigned 128-bit: X1:X0 = X2 * X3 MUL X0, X2, X3 // low 64 bits UMULH X1, X2, X3 // high 64 bits ``` ### 11.3 Long Multiply (32×32→64) These multiply two 32-bit values and produce a full 64-bit result, with no overflow possible. `SMULL` treats the inputs as signed; `UMULL` treats them as unsigned. The result is always in a 64-bit X register. Useful when you know the inputs are 32-bit but need the full product. **Alias vs real**: `SMULL` and `UMULL` are assembler **aliases** for `SMADDL`/`UMADDL` with the accumulator wired to `XZR` (so `SMADDL Xd|XZR, Wn|WZR, Wm|WZR, XZR` is what the CPU actually sees, which is `0 + Wn×Wm`). The underlying `SMADDL`/`UMADDL`/`SMSUBL`/`UMSUBL` are the real instructions — all four accept a full 4-operand form `Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR`. Correspondingly, `SMNEGL`/`UMNEGL` are aliases for `SMSUBL`/`UMSUBL` with `Xa=XZR` (computing `0 - Wn×Wm`). ``` SMULL Xd|XZR, Wn|WZR, Wm|WZR // Xd = sign_extend(Wn) * sign_extend(Wm) // Alias for: SMADDL Xd|XZR, Wn|WZR, Wm|WZR, XZR UMULL Xd|XZR, Wn|WZR, Wm|WZR // Xd = zero_extend(Wn) * zero_extend(Wm) // Alias for: UMADDL Xd|XZR, Wn|WZR, Wm|WZR, XZR SMADDL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa + sign_extend(Wn) * sign_extend(Wm) UMADDL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa + zero_extend(Wn) * zero_extend(Wm) SMSUBL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa - sign_extend(Wn) * sign_extend(Wm) UMSUBL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa - zero_extend(Wn) * zero_extend(Wm) SMNEGL Xd|XZR, Wn|WZR, Wm|WZR // Alias for: SMSUBL Xd|XZR, Wn|WZR, Wm|WZR, XZR UMNEGL Xd|XZR, Wn|WZR, Wm|WZR // Alias for: UMSUBL Xd|XZR, Wn|WZR, Wm|WZR, XZR ``` ### 11.4 Division `UDIV` divides unsigned integers. `SDIV` divides signed integers. Both truncate toward zero (drop the fractional part). Unlike x86, ARM division **never raises an exception** — dividing by zero simply returns 0. There is no remainder instruction; you compute it as `remainder = dividend - (quotient * divisor)` using `MSUB`. **The other x86→ARM division gotcha — SDIV INT_MIN ÷ (−1)**: On x86, `IDIV` of the most-negative signed value by −1 overflows (the mathematical result `+2^63` doesn't fit in a signed 64-bit register) and raises `#DE` → SIGFPE. On AArch64, per ARM ARM C3.4.8, `SDIV` of INT_MIN by −1 **silently returns INT_MIN** with no exception, no flag, no indication of any kind. This is a real porting hazard: C code like `if (b == 0) error(); else return a / b;` is safe on ARM against divide-by-zero but still produces a silently wrong result for `INT_MIN / -1` if you were relying on x86's signal. To be correct cross-platform, guard both cases explicitly: `if (b == 0 || (a == INT_MIN && b == -1)) error();`. **Why no remainder instruction?** Division hardware already computes both quotient and remainder internally, but exposing both from one instruction would require 2 destination registers, which ARM's encoding doesn't support. Instead, compilers emit `UDIV` + `MSUB` — the CPU can often fuse or optimize this pair internally. **Why no flags from multiply/divide?** The fact: MUL/MADD/MSUB/UMULH/SMULH/UDIV/SDIV don't set flags in AArch64. Likely rationale (inference, not documented ARM design intent): multiply overflow is ambiguous (do you mean the low-half overflowed, or the full product didn't fit?), and divide-by-zero is handled by returning 0 in hardware rather than raising a flag. If you need overflow detection, check explicitly with `UMULH`/`SMULH` and compare against zero or against the sign extension of the low half. ``` UDIV Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn / Xm (unsigned, truncate toward zero) [64-bit] UDIV Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn / Wm [32-bit] SDIV Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn / Xm (signed, truncate toward zero) [64-bit] SDIV Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn / Wm [32-bit] ``` **No flags are set. No exceptions on divide-by-zero.** Division by zero returns **0** in AArch64. **32-bit overflow**: `SDIV Wd` of `INT32_MIN / -1` returns `INT32_MIN` (0x80000000). Same wrapping behavior as 64-bit. To get the remainder (modulo), there is no `MOD` instruction. Use: ```asm // X0 = X1 % X2 (unsigned) UDIV X3, X1, X2 // X3 = X1 / X2 MSUB X0, X3, X2, X1 // X0 = X1 - (X3 * X2) = remainder ``` **What MSUB REALLY does here — traced:** ```asm // Compute 17 % 5: // X1 = 17, X2 = 5 UDIV X3, X1, X2 // X3 = 17 / 5 = 3 (truncated) MSUB X0, X3, X2, X1 // X0 = X1 - (X3 * X2) = 17 - (3 * 5) = 17 - 15 = 2 // X0 = 2 ✓ (17 mod 5 = 2) ``` `MSUB Xd, Xn, Xm, Xa` computes `Xa - (Xn × Xm)`. The accumulator `Xa` is the dividend, and `Xn × Xm` is the quotient times divisor — subtracted from the dividend gives the remainder. **Signed division overflow**: `SDIV` of `INT64_MIN / -1` returns `INT64_MIN` (not an exception). The mathematically correct answer (+2^63) doesn't fit in a signed 64-bit integer, so it wraps. --- ## 12. Sign Extension & Zero Extension When you have a small value (e.g., an 8-bit byte) and need to put it in a larger register (e.g., 64-bit), you need to "extend" it. **Zero extension** fills the upper bits with zeros — used for unsigned values. **Sign extension** fills the upper bits with copies of the value's sign bit (the MSB) — used for signed values, preserving the negative/positive meaning. **Why extension is needed**: Registers are 64 bits wide, but data types in real programs are often 8, 16, or 32 bits. When you load a byte from memory into a 64-bit register, the hardware must decide what to put in the other 56 bits. For unsigned values, zeros make the register hold the correct unsigned interpretation (e.g., byte 0xFF = 255). For signed values, sign-extending preserves the mathematical value (e.g., signed byte 0xFF = -1, which sign-extended to 64 bits is 0xFFFFFFFFFFFFFFFF = -1). Using the wrong extension is a common source of bugs — this is why ARM provides both `LDR` (zero-extending) and `LDRSW`/`LDRSH`/`LDRSB` (sign-extending) load instructions. For example, the byte `0x80` (which is -128 as a signed byte): zero-extending gives `0x0000000000000080` (128 unsigned), but sign-extending gives `0xFFFFFFFFFFFFFF80` (-128 signed). ### 12.1 Dedicated Extend Aliases **64-bit destination (extend to 64 bits):** ``` SXTB Xd|XZR, Wn|WZR // Sign-extend byte → 64 bits. Alias: SBFM Xd|XZR, Xn|XZR, #0, #7 SXTH Xd|XZR, Wn|WZR // Sign-extend halfword → 64. Alias: SBFM Xd|XZR, Xn|XZR, #0, #15 SXTW Xd|XZR, Wn|WZR // Sign-extend word → 64. Alias: SBFM Xd|XZR, Xn|XZR, #0, #31 ``` **32-bit destination (extend to 32 bits):** ``` SXTB Wd|WZR, Wn|WZR // Sign-extend byte → 32 bits. Alias: SBFM Wd|WZR, Wn|WZR, #0, #7 SXTH Wd|WZR, Wn|WZR // Sign-extend halfword → 32. Alias: SBFM Wd|WZR, Wn|WZR, #0, #15 UXTB Wd|WZR, Wn|WZR // Zero-extend byte → 32 bits. Alias: UBFM Wd|WZR, Wn|WZR, #0, #7 // (also AND Wd, Wn, #0xFF) UXTH Wd|WZR, Wn|WZR // Zero-extend halfword → 32. Alias: UBFM Wd|WZR, Wn|WZR, #0, #15 UXTW Xd|XZR, Wn|WZR // Zero-extend word → 64 bits. Alias: UBFM Xd|XZR, Xn|XZR, #0, #31 // Rarely needed — any W-register write already zeroes bits [63:32] ``` **RE note — SXTB Wd vs SXTB Xd**: Both exist and produce different results when the byte's sign bit (bit 7) is set: ```asm // If W1 low byte = 0x80: SXTB W0, W1 // W0 = 0xFFFFFF80, X0 = 0x00000000FFFFFF80 (sign to 32, zero to 64) SXTB X0, W1 // X0 = 0xFFFFFFFFFFFFFF80 (sign-extended all the way to 64) ``` The Wd form sign-extends within 32 bits, then the W-register write zeroes the upper 32 — so you get a positive 64-bit value with a negative 32-bit interpretation. **Where are UXTW and UXTX?** - `UXTW` as a standalone instruction **does exist** — it is an alias for `UBFM Xd, Xn, #0, #31` (extract bits [31:0], zero-extend to 64 bits). Assemblers accept `UXTW Xd, Wn`. However, it is **rarely needed** because writing to a W register **automatically** zero-extends to 64 bits. So `MOV W0, W1` already achieves the same result as `UXTW X0, W1`. `UXTW` also appears as a modifier in extended register forms (see §8.2 — the `ADD Xd, Xn, Wm, UXTW` form, where the extension happens as part of the address/arithmetic computation). - `UXTX` is effectively a no-op (64-bit to 64-bit zero extension). **Where is SXTW Wd, Wn?** — It doesn't exist. `SXTW` is inherently a 32→64 operation; extending a 32-bit value to 32 bits is a no-op. ### 12.2 Implicit Extension Remember the fundamental rule: any instruction writing to `Wd` automatically zero-extends the result into `Xd`. This means: ```asm ADD W0, W1, W2 // Result in W0 → upper 32 bits of X0 are zeroed LDR W0, [X1] // Loads 32 bits → upper 32 bits of X0 are zeroed ``` For **sign** extension, you must be explicit: ```asm LDRSW X0, [X1] // Load 32-bit signed, sign-extend to 64 bits LDRSH X0, [X1] // Load 16-bit signed, sign-extend to 64 bits LDRSB X0, [X1] // Load 8-bit signed, sign-extend to 64 bits ``` --- ## 13. Bitfield Operations (BFM family) The BFM (Bitfield Move) family is the Swiss Army knife of ARM. Many instructions you use daily — shifts, extends, bitfield extracts — are actually aliases for these three base instructions. Understanding BFM helps you read disassembly where the disassembler shows the raw instruction instead of the friendly alias. **Why ARM uses this design**: Instead of having separate opcodes for LSL, LSR, ASR, SXTB, UXTB, UBFX, SBFIZ, and a dozen more, ARM encodes them all as variants of three base instructions (UBFM, SBFM, BFM) with different immediate parameters. This saves precious opcode space (remember, everything must fit in 32 bits) and means the hardware only needs one circuit for all bitfield operations. The downside: the relationship between the friendly alias and the actual `immr`/`imms` encoding is confusing. That's what this section explains. A "bitfield" is a contiguous range of bits within a register. These instructions extract, insert, or move bitfields with optional sign or zero extension. The dedicated sign/zero-extend aliases (`SXTB`/`SXTH`/`SXTW`, `UXTB`/`UXTH`/`UXTW`) are `SBFM`/`UBFM` special cases — they get their own treatment in **§12 (Sign/Zero Extension)**, and the *operand-modifier* form of the same extends is in **§8.2**. ### 13.1 The Simple Mental Model **Forget immr/imms for a moment.** At a high level, the BFM family does just two things: 1. **Extract**: Pull a range of bits out of a register, put them at bit 0, and fill the rest (with zeros, sign bits, or leave unchanged). 2. **Insert/Shift**: Take some low bits from a register, shift them left to a new position, and fill the rest. That's it. Every BFM alias — LSL, LSR, ASR, SXTB, UBFX, BFI, etc. — is one of these two operations. The three flavors differ only in what they do with the bits OUTSIDE the field: | Instruction | Bits outside the field | |---|---| | **UBFM** | Filled with **zeros** | | **SBFM** | Filled with copies of the field's **sign bit** (MSB of the extracted field) | | **BFM** | **Left unchanged** from Xd's previous value (insert into existing register) | **The friendly aliases you actually write — with their exact ISA-level encoding formulas**: All of these use the XZR-form for both Rd and Rn (register 31 = XZR, never SP; bitfield ops don't touch SP). | Alias form (what you write) | What it does | Underlying encoding (what the CPU actually sees) | |---|---|---| | UBFX Xd|XZR, Xn|XZR, #lsb, #width | Extract `width` bits starting at bit `lsb`; zero the rest. `lsb+width ≤ 64`; `width ≥ 1`. | UBFM Xd|XZR, Xn|XZR, #lsb, #(lsb + width − 1) | | SBFX Xd|XZR, Xn|XZR, #lsb, #width | Extract, sign-extend. Same constraints. | SBFM Xd|XZR, Xn|XZR, #lsb, #(lsb + width − 1) | | BFXIL Xd|XZR, Xn|XZR, #lsb, #width | Extract, insert at bit 0 of Xd **leaving other Xd bits unchanged**. | BFM Xd|XZR, Xn|XZR, #lsb, #(lsb + width − 1) | | UBFIZ Xd|XZR, Xn|XZR, #lsb, #width | Take low `width` bits of Xn, shift left by `lsb`, zero the rest. `lsb+width ≤ 64`. | UBFM Xd|XZR, Xn|XZR, #(−lsb MOD 64), #(width − 1) | | SBFIZ Xd|XZR, Xn|XZR, #lsb, #width | Same but sign-extend from bit (lsb+width−1). | SBFM Xd|XZR, Xn|XZR, #(−lsb MOD 64), #(width − 1) | | BFI Xd|XZR, Xn|XZR, #lsb, #width | Take low `width` bits of Xn, shift left by `lsb`, **insert** into Xd. | BFM Xd|XZR, Xn|XZR, #(−lsb MOD 64), #(width − 1) | | BFC Xd|XZR, #lsb, #width (ARMv8.2) | Clear `width` bits of Xd starting at bit `lsb`. Alias for BFI with `Xn = XZR`. | BFM Xd|XZR, XZR, #(−lsb MOD 64), #(width − 1) | | LSL Xd|XZR, Xn|XZR, #s | Logical shift left by `s` (0..63). | UBFM Xd|XZR, Xn|XZR, #((−s) MOD 64), #(63 − s) | | LSR Xd|XZR, Xn|XZR, #s | Logical shift right by `s` (0..63). | UBFM Xd|XZR, Xn|XZR, #s, #63 | | ASR Xd|XZR, Xn|XZR, #s | Arithmetic shift right by `s` (0..63). | SBFM Xd|XZR, Xn|XZR, #s, #63 | | SXTB Xd|XZR, Wn|WZR | Sign-extend byte to 64. | SBFM Xd|XZR, Xn|XZR, #0, #7 | | SXTH Xd|XZR, Wn|WZR | Sign-extend halfword to 64. | SBFM Xd|XZR, Xn|XZR, #0, #15 | | SXTW Xd|XZR, Wn|WZR | Sign-extend word to 64. | SBFM Xd|XZR, Xn|XZR, #0, #31 | | UXTB Wd|WZR, Wn|WZR | Zero-extend byte to 32. | UBFM Wd|WZR, Wn|WZR, #0, #7 | | UXTH Wd|WZR, Wn|WZR | Zero-extend halfword to 32. | UBFM Wd|WZR, Wn|WZR, #0, #15 | For **32-bit** (Wd/Wn) variants of UBFX/SBFX/BFI/BFXIL/UBFIZ/SBFIZ/LSL/LSR/ASR, replace 64 with 32 and 63 with 31 in the formulas; all immr/imms values must be in 0..31 (sf=0 with immr>31 or imms>31 is UNDEFINED per ARM ARM). **Why this matters**: the alias constrains the legal `immr`/`imms` combinations more tightly than the raw instruction. The raw `UBFM Xd|XZR, Xn|XZR, #immr, #imms` accepts any immr, imms ∈ 0..63 — that's `64 × 64 = 4096` encodings per register pair. Only some of those encodings are expressible as `UBFX` (requires imms ≥ immr) and a different subset as `UBFIZ` (requires imms < immr). If you try `UBFX Xd|XZR, Xn|XZR, #lsb, #0`, the assembler will reject it because width must be ≥ 1 — but the raw UBFM with the equivalent immr/imms would assemble fine. Same for negative or overflowing `lsb+width`. **You almost never write UBFM/SBFM/BFM directly.** You write the aliases. But disassemblers sometimes show the raw form, so you need to understand how `immr` and `imms` map to the aliases. That's the next subsection. ### 13.2 The immr/imms Encoding (How the Hardware Sees It) The hardware doesn't know about "UBFX" or "LSL" — it only sees `UBFM Xd, Xn, #immr, #imms`. The names stand for: `immr` = **immediate rotate** (how much to rotate the source right), `imms` = **immediate mask size** (how many bits to keep, roughly). ``` SBFM Xd|XZR, Xn|XZR, #immr, #imms // Signed Bitfield Move [64-bit, immr/imms: 0–63] UBFM Xd|XZR, Xn|XZR, #immr, #imms // Unsigned Bitfield Move [64-bit, immr/imms: 0–63] BFM Xd|XZR, Xn|XZR, #immr, #imms // Bitfield Move (insert) [64-bit, immr/imms: 0–63] SBFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] UBFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] BFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] ``` The behavior depends on the relationship between `immr` and `imms`: **When imms >= immr** (the **extract** case): - Extract bits [imms:immr] from Xn — starting at bit position `immr`, up through bit `imms`. Width = `imms - immr + 1`. - Place the field at bit 0 of Xd. - UBFM: zero the rest. SBFM: sign-extend from bit [imms]. BFM: leave Xd's other bits unchanged. **When imms < immr** (the **shift/insert** case): - Take the low `imms + 1` bits from Xn. - Place them at bit position `64 - immr` (effectively shift left by `64 - immr`). - UBFM: zero the rest. SBFM: sign-extend. BFM: leave Xd's other bits unchanged. **Why two cases?** The hardware actually does one thing: rotate Xn right by `immr`, then apply a bitmask of width `imms+1`. Depending on how the rotation and mask interact, this looks like either an "extract from the middle" or a "shift up from the bottom." The two cases are just the two ways the mask can land relative to the rotation. **The raw instructions traced with hex values:** ```asm // ═══════════════════════════════════════════════════════════ // UBFM — Unsigned Bitfield Move (zero-fills non-field bits) // ═══════════════════════════════════════════════════════════ // Case 1: imms >= immr (EXTRACT case) // UBFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8] from X1 (width = 15-8+1 = 8 bits), place at bit 0, zero rest // // If X1 = 0x00000000_0000ABCD: // Bits [15:8] of 0xABCD: 0xAB (binary: 10101011) // Place at bit 0: 0x000000AB // Zero the rest: 0x00000000_000000AB // X0 = 0x00000000_000000AB // This is what the assembler shows as: UBFX X0, X1, #8, #8 // Case 2: imms < immr (INSERT-IN-ZERO / SHIFT case) // UBFM X0, X1, #56, #7 (immr=56, imms=7) // Take bits [7:0] from X1 (width = 7+1 = 8 bits), place at bit 64-56 = 8 // // If X1 = 0x00000000_000000FF: // Bits [7:0]: 0xFF // Place at bit 8: 0x0000FF00 // Zero the rest: 0x00000000_0000FF00 // X0 = 0x00000000_0000FF00 // This is what the assembler shows as: UBFIZ X0, X1, #8, #8 // (or equivalently: LSL X0, X1, #8 if width covered the whole register) ``` ```asm // ═══════════════════════════════════════════════════════════ // SBFM — Signed Bitfield Move (sign-extends from field's top bit) // ═══════════════════════════════════════════════════════════ // Case 1: imms >= immr (EXTRACT + SIGN-EXTEND case) // SBFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8], place at bit 0, sign-extend from bit [imms] = bit 15 // // If X1 = 0x00000000_0000ABCD: // Bits [15:8]: 0xAB (bit 7 of the field = 1 → "negative") // Sign-extend: 0xFFFFFFFF_FFFFFFAB // X0 = 0xFFFFFFFF_FFFFFFAB // This is: SBFX X0, X1, #8, #8 // // If X1 = 0x00000000_00001234: // Bits [15:8]: 0x12 (bit 7 of the field = 0 → "positive") // Sign-extend (no change): 0x00000000_00000012 // This is: SBFX X0, X1, #8, #8 // Special case: SBFM X0, X1, #0, #7 = SXTB X0, W1 // Extract bits [7:0], sign-extend from bit 7 // // If X1 = 0x00000000_000000C0: // Bits [7:0]: 0xC0 (bit 7 = 1 → negative byte) // Sign-extend to 64: 0xFFFFFFFF_FFFFFFC0 = -64 (signed) // Special case: SBFM X0, X1, #0, #31 = SXTW X0, W1 // Extract bits [31:0], sign-extend from bit 31 // // If X1 = 0x00000000_80000000: // Bits [31:0]: 0x80000000 (bit 31 = 1 → negative word) // Sign-extend: 0xFFFFFFFF_80000000 = INT32_MIN as 64-bit ``` ```asm // ═══════════════════════════════════════════════════════════ // BFM — Bitfield Move (INSERT: modifies only the target field in Xd) // ═══════════════════════════════════════════════════════════ // Unlike UBFM/SBFM which write ALL bits of Xd, BFM only modifies the // destination bitfield and leaves all other bits of Xd UNCHANGED. // This is why BFM is used for "insert" operations. // Case 1: imms >= immr (EXTRACT-AND-INSERT-LOW case) // BFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8] from X1, insert at bits [7:0] of X0 (other bits unchanged) // // If X0 = 0xDEADBEEF_DEADBEEF and X1 = 0x00000000_0000ABCD: // Bits [15:8] of X1: 0xAB // Replace bits [7:0] of X0 with 0xAB: // X0 = 0xDEADBEEF_DEADBEAB (only low 8 bits changed!) // This is: BFXIL X0, X1, #8, #8 // Case 2: imms < immr (INSERT-AT-POSITION case) // BFM X0, X1, #56, #7 (immr=56, imms=7) // Take bits [7:0] from X1, insert at bit 8 of X0 (other bits unchanged) // // If X0 = 0xDEADBEEF_DEADBEEF and X1 = 0x00000000_000000FF: // Bits [7:0] of X1: 0xFF // Insert at bits [15:8] of X0: // X0 = 0xDEADBEEF_DEADFFEF (only bits [15:8] changed!) // This is: BFI X0, X1, #8, #8 ``` **Summary: how the three differ on the SAME operation:** ``` // All three extract bits [15:8] from X1 (= 0xAB from 0xABCD) and place at bit 0: // X0 starts as 0xDEADBEEF_DEADBEEF for BFM, doesn't matter for UBFM/SBFM UBFM X0, X1, #8, #15 // X0 = 0x00000000_000000AB (zero-filled) SBFM X0, X1, #8, #15 // X0 = 0xFFFFFFFF_FFFFFFAB (sign-extended, because 0xAB has bit 7 set) BFM X0, X1, #8, #15 // X0 = 0xDEADBEEF_DEADBEAB (only bits [7:0] replaced, rest kept) ``` ### 13.3 Aliases of UBFM **Don't memorize these tables** — use them as a reference when you see raw UBFM/SBFM/BFM in a disassembler and need to figure out which friendly instruction it corresponds to. | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `LSL Rd, Rn, #s` | `UBFM Xd, Xn, #(-s MOD 64), #(63-s)` | `UBFM Wd, Wn, #(-s MOD 32), #(31-s)` | | `LSR Rd, Rn, #s` | `UBFM Xd, Xn, #s, #63` | `UBFM Wd, Wn, #s, #31` | | `UBFX Rd, Rn, #lsb, #w` | `UBFM Xd, Xn, #lsb, #(lsb+w-1)` | `UBFM Wd, Wn, #lsb, #(lsb+w-1)` | | `UBFIZ Rd, Rn, #lsb, #w` | `UBFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `UBFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `UXTB Wd, Wn` | — | `UBFM Wd, Wn, #0, #7` | | `UXTH Wd, Wn` | — | `UBFM Wd, Wn, #0, #15` | ### 13.4 Aliases of SBFM | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `ASR Rd, Rn, #s` | `SBFM Xd, Xn, #s, #63` | `SBFM Wd, Wn, #s, #31` | | `SBFX Rd, Rn, #lsb, #w` | `SBFM Xd, Xn, #lsb, #(lsb+w-1)` | `SBFM Wd, Wn, #lsb, #(lsb+w-1)` | | `SBFIZ Rd, Rn, #lsb, #w` | `SBFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `SBFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `SXTB Wd, Wn` | — | `SBFM Wd, Wn, #0, #7` | | `SXTB Xd, Wn` | `SBFM Xd, Xn, #0, #7` | — | | `SXTH Wd, Wn` | — | `SBFM Wd, Wn, #0, #15` | | `SXTH Xd, Wn` | `SBFM Xd, Xn, #0, #15` | — | | `SXTW Xd, Wn` | `SBFM Xd, Xn, #0, #31` | — (no 32-bit form; SXTW is inherently 32→64) | **RE note**: A disassembler may show `SBFM W0, W1, #0, #7` — that's just `SXTB W0, W1` (sign-extend byte to 32 bits). But `SBFM X0, X1, #0, #7` is `SXTB X0, W1` (sign-extend byte to 64 bits). The register width tells you the target size of the extension. ### 13.5 Aliases of BFM | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `BFI Rd, Rn, #lsb, #w` | `BFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `BFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `BFC Rd, #lsb, #w` (ARMv8.2) | `BFM Xd, XZR, #(-lsb MOD 64), #(w-1)` | `BFM Wd, WZR, #(-lsb MOD 32), #(w-1)` | | `BFXIL Rd, Rn, #lsb, #w` | `BFM Xd, Xn, #lsb, #(lsb+w-1)` | `BFM Wd, Wn, #lsb, #(lsb+w-1)` | ### 13.6 Practical BFM Examples **What each instruction REALLY does — traced with concrete values:** ```asm // ═══ UBFX — Unsigned Bitfield Extract ═══ // "Pull out a range of bits, zero-extend the rest" // UBFX X0, X1, #4, #8 → extract 8 bits starting at bit 4 // // If X1 = 0x00000000_0000ABCD: // Binary of low 16 bits: 1010_1011_1100_1101 // Bits [11:4]: 1011_1100 // Zero-extend to 64 bits: 0x00000000_000000BC // X0 = 0x00000000_000000BC UBFX X0, X1, #4, #8 // ═══ SBFX — Signed Bitfield Extract ═══ // "Pull out a range of bits, sign-extend from the top bit of the field" // // If X1 = 0x00000000_0000ABCD (same value): // Bits [11:4]: 1011_1100 (bit 11 = 1, so the field is "negative") // Sign-extend to 64 bits: 0xFFFFFFFF_FFFFFFBC = -68 (signed) // // If X1 = 0x00000000_00001234: // Bits [11:4]: 0010_0011 (bit 11 = 0, so "positive") // X0 = 0x00000000_00000023 = 35 SBFX X0, X1, #4, #8 ``` ```asm // ═══ BFI — Bitfield Insert ═══ // "Take low bits from source, plug them into a specific position in destination" // BFI X0, X1, #8, #8 → take low 8 bits of X1, insert at bits [15:8] of X0 // // If X0 = 0x00000000_12345678 and X1 = 0x00000000_000000FF: // Low 8 bits of X1: 0xFF // Insert at bits [15:8] of X0: replace the "56" in 0x12345678 // X0 = 0x00000000_1234FF78 BFI X0, X1, #8, #8 // ═══ BFXIL — Bitfield Extract and Insert Low ═══ // "Extract a range from source, insert at bit 0 of destination, keep upper bits" // // If X0 = 0xAAAAAAAA_AAAAAAAA and X1 = 0x00000000_00AB0000: // Bits [23:16] of X1: 0xAB // Insert at bits [7:0] of X0: 0xAAAAAAAA_AAAAAAAB (only low 8 bits changed) BFXIL X0, X1, #16, #8 ``` ```asm // ═══ UBFIZ — Unsigned Bitfield Insert in Zero ═══ // "Take low bits from source, shift them left, zero everything else" // // If X1 = 0x00000000_000000AB: // Low 8 bits: 0xAB, shift left by 16 → X0 = 0x00000000_00AB0000 UBFIZ X0, X1, #16, #8 // ═══ SBFIZ — Signed Bitfield Insert in Zero ═══ // "Take low bits, shift left, sign-extend from the top bit of the field" // // If X1 = 0x00000000_000000FF (low 8 bits: 0xFF, bit 7=1 → "negative"): // Shift left by 16: 0x00000000_00FF0000 // Sign-extend from bit 23: X0 = 0xFFFFFFFF_FFFF0000 // // If X1 = 0x00000000_0000007F (bit 7=0 → "positive"): // Shift left by 16: X0 = 0x00000000_007F0000 (no sign-extension needed) SBFIZ X0, X1, #16, #8 // ═══ Clearing a bitfield ═══ // BFI X0, XZR, #8, #8 → insert 8 zero bits at bits [15:8] // BFC X0, #8, #8 → exactly the same; BFC is the dedicated ARMv8.2 alias // (BFC Xd, #lsb, #width ≡ BFI Xd, XZR, #lsb, #width) // If X0 = 0x00000000_FFFFFFFF → X0 = 0x00000000_FFFF00FF BFI X0, XZR, #8, #8 BFC X0, #8, #8 // Identical encoding; preferred disassembly on ARMv8.2+ ``` **How to read BFM in disassembly**: If you see raw `UBFM X0, X1, #4, #11`, check: is imms >= immr? Yes (11 >= 4), so it's an extract: bits [11:4], width = 11-4+1 = 8. This is `UBFX X0, X1, #4, #8`. If you see `UBFM X0, X1, #60, #3`, check: imms < immr? Yes (3 < 60), so it's an insert-in-zero: low 4 bits shifted left by 64-60 = 4. This is `UBFIZ X0, X1, #4, #4`. ```asm // === 32-bit equivalents === // Same operations but with 32-bit registers — upper 32 of Xd always zeroed // If W1 = 0x0000ABCD: UBFX W0, W1, #4, #8 // W0 = 0x000000BC, X0 = 0x00000000_000000BC SBFX W0, W1, #4, #8 // W0 = 0xFFFFFFBC (sign-extend to 32), X0 = 0x00000000_FFFFFFBC // RE trap: SBFX W0 vs SBFX X0 // SBFX W0, W1, #4, #8 → sign-extends to bit 31, then upper 32 of X0 zeroed // SBFX X0, X1, #4, #8 → sign-extends all the way to bit 63 // These give DIFFERENT results when the sign bit (bit 11) is set! // If X1 = 0x00000000_0000ABCD: // SBFX W0 → W0 = 0xFFFFFFBC, X0 = 0x00000000_FFFFFFBC (positive 64-bit!) // SBFX X0 → X0 = 0xFFFFFFFF_FFFFFFBC (negative 64-bit!) ``` ### 13.7 EXTR — Extract from Pair ``` EXTR Xd|XZR, Xn|XZR, Xm|XZR, #0-63 // 64-bit: treat Xn:Xm as a 128-bit value (Xn is the high half, // Xm is the low half), then extract 64 bits starting at bit #lsb EXTR Wd|WZR, Wn|WZR, Wm|WZR, #0-31 // 32-bit: treat Wn:Wm as a 64-bit value, extract 32 bits at #lsb ``` `#lsb` is the bit position in the low register (Xm/Wm) where extraction starts (0–63 for 64-bit, 0–31 for 32-bit). The result is bits [lsb+63 : lsb] of the 128-bit concatenation (wrapping from Xm into Xn). When `Xn == Xm` (or `Wn == Wm`), this is `ROR Rd, Rn, #lsb` (rotate right). **Traced example:** ```asm // If X1 = 0x00000000_000ABCDE and X2 = 0x12345678_9ABC0000: EXTR X0, X1, X2, #20 // Concatenation X1:X2 = 0x00000000000ABCDE:123456789ABC0000 (128 bits) // Extract 64 bits starting at bit 20 of the low register (X2): // Bottom 44 bits: X2[63:20] = 0x12345678_9ABC0000 >> 20 = 0x123456789AB // Top 20 bits: X1[19:0] = 0xABCDE // Combined: 0xABCDE_123456789AB → X0 = 0xABCDE123456789AB // In practice: EXTR shifts X2 right by 20, and fills the vacated top 20 bits // with the bottom 20 bits of X1. // Rotate right (Xn == Xm): // If X0 = 0x00000000_0000000F: EXTR X0, X0, X0, #4 // ROR X0, X0, #4 // Bits 3:0 (= 0xF) rotate to bits 63:60 // X0 = 0xF000000000000000 ``` ```asm // 64-bit rotate right X0 by 5: EXTR X0, X0, X0, #5 // same as ROR X0, X0, #5 // 32-bit rotate right W0 by 5: EXTR W0, W0, W0, #5 // same as ROR W0, W0, #5 — upper 32 of X0 zeroed // Extract 64 bits from the middle of X1:X2 EXTR X0, X1, X2, #20 // bits [83:20] of the 128-bit value X1:X2 ``` --- ## 14. Bit Manipulation Instructions Instructions for counting, reversing, and manipulating individual bits. These are essential for bitmap operations, hash functions, and low-level data structure manipulation. ### 14.1 CLZ — Count Leading Zeros `CLZ` counts how many consecutive zero bits there are starting from the most significant bit (left side). For example, `CLZ` of `0x00F0...` would be 8 (eight zeros before the first 1). If the entire register is zero, the result is 64 (or 32 for Wd). Useful for finding the position of the highest set bit. ``` CLZ Xd|XZR, Xn|XZR // Xd = number of leading zero bits in Xn (0-64) CLZ Wd|WZR, Wn|WZR // Wd = number of leading zero bits in Wn (0-32) ``` If `Xn == 0`, result is 64 (or 32 for Wn). Use case: finding the highest set bit, computing floor(log2(x)): ```asm // floor(log2(X0)) = 63 - CLZ(X0), for X0 > 0 CLZ X1, X0 MOV X2, #63 SUB X1, X2, X1 // X1 = floor(log2(X0)) ``` ### 14.2 CLS — Count Leading Sign Bits `CLS` counts the number of leading sign bits in a register, minus 1. A "leading sign bit" is a bit that matches the MSB, counting from the top. For positive numbers (MSB=0), it counts leading zeros minus 1. For negative numbers (MSB=1), it counts leading ones minus 1. The result tells you how many redundant sign bits there are — useful for determining how many bits are actually needed to represent a value. ``` CLS Xd|XZR, Xn|XZR // Count leading bits that match the sign bit, minus 1 (range 0–63) CLS Wd|WZR, Wn|WZR // Same for 32-bit (range 0–31) ``` ### 14.3 RBIT — Reverse Bits `RBIT` reverses the order of all bits in a register — bit 0 swaps with bit 63, bit 1 with bit 62, etc. The main use case is computing a count of trailing zeros (CTZ): reverse the bits with `RBIT`, then count leading zeros with `CLZ` — the leading zeros of the reversed value equal the trailing zeros of the original. ``` RBIT Xd|XZR, Xn|XZR // Reverse all 64 bits (bit 0 ↔ bit 63, etc.) RBIT Wd|WZR, Wn|WZR // Reverse all 32 bits ``` Useful for CRC calculations and trailing-zero counts: ```asm // Count trailing zeros (CTZ) — baseline approach: RBIT X1, X0 // Reverse bits CLZ X1, X1 // Count leading zeros of reversed = trailing zeros of original ``` **With FEAT_CSSC:** A dedicated `CTZ Xd, Xn` / `CTZ Wd, Wn` instruction exists, eliminating the RBIT+CLZ sequence. ### 14.4 REV — Reverse Bytes `REV` reverses the byte order of a register — this converts between little-endian and big-endian. `REV16` reverses bytes within each 16-bit halfword. `REV32` reverses bytes within each 32-bit word (Xd form only, since `REV Wd` already does 32-bit reversal). ``` REV Xd|XZR, Xn|XZR // Reverse byte order (64-bit endian swap) REV Wd|WZR, Wn|WZR // Reverse byte order (32-bit endian swap) REV16 Xd|XZR, Xn|XZR // Reverse bytes within each 16-bit halfword (64-bit) REV16 Wd|WZR, Wn|WZR // Reverse bytes within each 16-bit halfword (32-bit) REV32 Xd|XZR, Xn|XZR // Reverse bytes within each 32-bit word (64-bit ONLY — no Wd form) ``` **Note**: `REV32` only has an Xd form because `REV Wd, Wn` already does a 32-bit byte swap. `REV32 Xd, Xn` swaps bytes within each 32-bit half independently. ```asm // If X0 = 0x0102030405060708: REV X1, X0 // X1 = 0x0807060504030201 (full 64-bit byte swap) REV W1, W0 // W1 = 0x08070605 (32-bit byte swap of low word) // X1 = 0x0000000008070605 (upper zeroed) REV16 X1, X0 // X1 = 0x0201040306050807 (swap within each 16-bit chunk) REV32 X1, X0 // X1 = 0x0403020108070605 (swap within each 32-bit chunk) ``` ### 14.5 CNT — Population Count **With FEAT_CSSC (optional from ARMv8.7-A):** A scalar `CNT` instruction exists: ```asm CNT Xd|XZR, Xn|XZR // Xd = popcount(Xn) CNT Wd|WZR, Wn|WZR // Wd = popcount(Wn) ``` **Without FEAT_CSSC (baseline AArch64):** No scalar popcount exists. Use the NEON workaround: ```asm // Count set bits in X0: FMOV D0, X0 // Move X0 into SIMD register D0 CNT V0.8B, V0.8B // Count bits in each byte (NEON vector CNT) ADDV B0, V0.8B // Sum all byte counts UMOV W1, V0.B[0] // Move result to GPR ``` ### 14.6 FEAT_CSSC — Common Short Sequence Compression FEAT_CSSC (optional from ARMv8.7-A / ARMv9.2-A, mandatory from ARMv8.9-A / ARMv9.4-A) adds scalar instructions that previously required multi-instruction sequences. These exist because compilers kept generating the same 2-4 instruction patterns, so ARM added single instructions to replace them. ``` // Absolute value (previously: CMP + CNEG, 2 instructions) ABS Xd|XZR, Xn|XZR // Xd = |Xn| (signed absolute value) ABS Wd|WZR, Wn|WZR // Min/Max (previously: CMP + CSEL, 2 instructions each) SMAX Xd|XZR, Xn|XZR, Xm|XZR // Xd = max(Xn, Xm) signed SMAX Wd|WZR, Wn|WZR, Wm|WZR SMIN Xd|XZR, Xn|XZR, Xm|XZR // Xd = min(Xn, Xm) signed SMIN Wd|WZR, Wn|WZR, Wm|WZR UMAX Xd|XZR, Xn|XZR, Xm|XZR // Xd = max(Xn, Xm) unsigned UMAX Wd|WZR, Wn|WZR, Wm|WZR UMIN Xd|XZR, Xn|XZR, Xm|XZR // Xd = min(Xn, Xm) unsigned UMIN Wd|WZR, Wn|WZR, Wm|WZR // Also with immediate: SMAX Xd|XZR, Xn|XZR, #simm8 // Signed max with 8-bit signed immediate (-128 to 127) SMAX Wd|WZR, Wn|WZR, #simm8 SMIN Xd|XZR, Xn|XZR, #simm8 // Signed min with 8-bit signed immediate (-128 to 127) SMIN Wd|WZR, Wn|WZR, #simm8 UMAX Xd|XZR, Xn|XZR, #uimm8 // Unsigned max with 8-bit unsigned immediate (0 to 255) UMAX Wd|WZR, Wn|WZR, #uimm8 UMIN Xd|XZR, Xn|XZR, #uimm8 // Unsigned min with 8-bit unsigned immediate (0 to 255) UMIN Wd|WZR, Wn|WZR, #uimm8 // Count trailing zeros (previously: RBIT + CLZ, 2 instructions) CTZ Xd|XZR, Xn|XZR // Xd = number of trailing zeros (0-64) CTZ Wd|WZR, Wn|WZR // Wd = number of trailing zeros (0-32) // Scalar population count (previously: FMOV + CNT + ADDV + UMOV, 4 instructions) CNT Xd|XZR, Xn|XZR // Xd = popcount(Xn) CNT Wd|WZR, Wn|WZR ``` All Wd forms follow the standard W-register rule: upper 32 bits of Xd are zeroed. None of these set flags. **Why these exist**: Compilers emit CMP+CSEL for min/max thousands of times in typical code. SMAX/SMIN/UMAX/UMIN cut the count in half, improving both code size and throughput. Similarly, CTZ (count trailing zeros) is used in every `ffs()`-style operation and bitmap scanner. --- ## 15. Load & Store Instructions Loads copy data from memory into a register. Stores copy data from a register into memory. The syntax `[Xn]` means "the memory address stored in register Xn." Think of the square brackets as a dereference — like `*ptr` in C. ### 15.1 Basic Loads These read data from memory at the address in `Xn` and place it into the destination register. The base register `Xn` can be **SP** (stack pointer) — this is how stack-relative loads work (e.g., `LDR X0, [SP, #8]`). The destination `Xt` can be **XZR** — loading into XZR discards the value (used for prefetch side-effects or consuming cache lines). Smaller loads (byte, halfword, word) are automatically zero-extended or sign-extended to fill the full register. **LDR — Load Register (64-bit):** ``` LDR Xt|XZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 8, 0–32760; when omitted, offset is 0) LDR Xt|XZR, [Xn|SP, #simm9]! // Pre-index (−256 to +255) LDR Xt|XZR, [Xn|SP], #simm9 // Post-index (−256 to +255) LDR Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #3|SXTX #0|SXTX #3}] // Register offset (64-bit index) LDR Xt|XZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #3|UXTW #0|UXTW #3] // Extended register (32-bit index) LDR Xt|XZR, label // PC-relative literal (±1 MB) ``` **LDR — Load Register (32-bit, zero-extends to 64):** ``` LDR Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 4, 0–16380; when omitted, offset is 0) LDR Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDR Wt|WZR, [Xn|SP], #simm9 // Post-index LDR Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #2|SXTX #0|SXTX #2}] // Register offset LDR Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #2|UXTW #0|UXTW #2] // Extended register LDR Wt|WZR, label // PC-relative literal ``` **LDRH — Load Halfword (16-bit, zero-extends to 32/64):** ``` LDRH Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 2, 0–8190; when omitted, offset is 0) LDRH Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRH Wt|WZR, [Xn|SP], #simm9 // Post-index LDRH Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset LDRH Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register ``` **LDRB — Load Byte (8-bit, zero-extends to 32/64):** ``` LDRB Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (0–4095, no scaling; when omitted, offset is 0) LDRB Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRB Wt|WZR, [Xn|SP], #simm9 // Post-index LDRB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRB Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register ``` **LDRSW — Load Signed Word (32-bit, sign-extends to 64):** ``` LDRSW Xt|XZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 4, 0–16380; when omitted, offset is 0) LDRSW Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSW Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSW Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #2|SXTX #0|SXTX #2}] // Register offset LDRSW Xt|XZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #2|UXTW #0|UXTW #2] // Extended register LDRSW Xt|XZR, label // PC-relative literal ``` **LDRSH — Load Signed Halfword (16-bit, sign-extends to 32 or 64):** ``` LDRSH Xt|XZR, [Xn|SP{, #pimm}] // Sign-extend to 64 (multiple of 2, 0–8190; when omitted, offset is 0) LDRSH Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSH Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSH Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset LDRSH Xt|XZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register LDRSH Wt|WZR, [Xn|SP{, #pimm}] // Sign-extend to 32 (multiple of 2, 0–8190; when omitted, offset is 0) LDRSH Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRSH Wt|WZR, [Xn|SP], #simm9 // Post-index LDRSH Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset LDRSH Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register ``` **LDRSB — Load Signed Byte (8-bit, sign-extends to 32 or 64):** ``` LDRSB Xt|XZR, [Xn|SP{, #pimm}] // Sign-extend to 64; unsigned offset (0–4095; when omitted, offset is 0) LDRSB Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSB Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSB Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRSB Xt|XZR, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register LDRSB Wt|WZR, [Xn|SP{, #pimm}] // Sign-extend to 32; unsigned offset (0–4095; when omitted, offset is 0) LDRSB Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRSB Wt|WZR, [Xn|SP], #simm9 // Post-index LDRSB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRSB Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register ``` **SIMD/FP Loads (all widths × all addressing modes):** ``` // 32-bit single-precision (St) — all addressing modes: LDR St, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 4, 0–16380; when omitted, offset is 0) LDR St, [Xn|SP, #simm9]! // Pre-index (−256 to +255) LDR St, [Xn|SP], #simm9 // Post-index LDR St, [Xn|SP, Xm|XZR{, LSL #0|LSL #2|SXTX #0|SXTX #2}] // Register offset LDR St, [Xn|SP, Wm|WZR, SXTW #0|SXTW #2|UXTW #0|UXTW #2] // Extended register LDR St, label // PC-relative literal // 64-bit double-precision (Dt) — same modes, different scaling: LDR Dt, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 8, 0–32760; when omitted, offset is 0) LDR Dt, [Xn|SP, #simm9]! // Pre-index LDR Dt, [Xn|SP], #simm9 // Post-index LDR Dt, [Xn|SP, Xm|XZR{, LSL #0|LSL #3|SXTX #0|SXTX #3}] // Register offset (#3 = scale by 8) LDR Dt, [Xn|SP, Wm|WZR, SXTW #0|SXTW #3|UXTW #0|UXTW #3] // Extended register LDR Dt, label // PC-relative literal // 128-bit quad (Qt) — same modes, 16-byte scaling: LDR Qt, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 16, 0–65520; when omitted, offset is 0) LDR Qt, [Xn|SP, #simm9]! // Pre-index LDR Qt, [Xn|SP], #simm9 // Post-index LDR Qt, [Xn|SP, Xm|XZR{, LSL #0|LSL #4|SXTX #0|SXTX #4}] // Register offset (#4 = scale by 16) LDR Qt, [Xn|SP, Wm|WZR, SXTW #0|SXTW #4|UXTW #0|UXTW #4] // Extended register LDR Qt, label // PC-relative literal // 8-bit (Bt) — all addressing modes: LDR Bt, [Xn|SP{, #pimm}] // Unsigned offset (0–4095, no scaling; when omitted, offset is 0) LDR Bt, [Xn|SP, #simm9]! // Pre-index LDR Bt, [Xn|SP], #simm9 // Post-index LDR Bt, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset (no shift — byte access) LDR Bt, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register // 16-bit (Ht) — all addressing modes: LDR Ht, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 2, 0–8190; when omitted, offset is 0) LDR Ht, [Xn|SP, #simm9]! // Pre-index LDR Ht, [Xn|SP], #simm9 // Post-index LDR Ht, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset (#1 = scale by 2) LDR Ht, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register ``` Note: `LDRSH Wd` vs `LDRSH Xd` — the register width determines whether sign extension goes to 32 or 64 bits. The `Xd` variant sign-extends all the way to 64 bits; the `Wd` variant sign-extends to 32, then the W-register write zeroes the upper 32. ### 15.2 Basic Stores These write data from a register into memory. Only the relevant low bytes are written — there is no sign extension for stores. **STR — Store Register (64-bit):** ``` STR Xt|XZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 8, 0–32760; when omitted, offset is 0) STR Xt|XZR, [Xn|SP, #simm9]! // Pre-index (−256 to +255) STR Xt|XZR, [Xn|SP], #simm9 // Post-index (−256 to +255) STR Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #3|SXTX #0|SXTX #3}] // Register offset STR Xt|XZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #3|UXTW #0|UXTW #3] // Extended register ``` **STR — Store Register (32-bit):** ``` STR Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 4, 0–16380; when omitted, offset is 0) STR Wt|WZR, [Xn|SP, #simm9]! // Pre-index STR Wt|WZR, [Xn|SP], #simm9 // Post-index STR Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #2|SXTX #0|SXTX #2}] // Register offset STR Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #2|UXTW #0|UXTW #2] // Extended register ``` **STRH — Store Halfword (16-bit):** ``` STRH Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 2, 0–8190; when omitted, offset is 0) STRH Wt|WZR, [Xn|SP, #simm9]! // Pre-index STRH Wt|WZR, [Xn|SP], #simm9 // Post-index STRH Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset STRH Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register ``` **STRB — Store Byte (8-bit):** ``` STRB Wt|WZR, [Xn|SP{, #pimm}] // Unsigned offset (0–4095, no scaling; when omitted, offset is 0) STRB Wt|WZR, [Xn|SP, #simm9]! // Pre-index STRB Wt|WZR, [Xn|SP], #simm9 // Post-index STRB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset STRB Wt|WZR, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register ``` **SIMD/FP Stores (all widths × all addressing modes):** ``` // 32-bit single-precision (St): STR St, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 4, 0–16380; when omitted, offset is 0) STR St, [Xn|SP, #simm9]! // Pre-index STR St, [Xn|SP], #simm9 // Post-index STR St, [Xn|SP, Xm|XZR{, LSL #0|LSL #2|SXTX #0|SXTX #2}] // Register offset STR St, [Xn|SP, Wm|WZR, SXTW #0|SXTW #2|UXTW #0|UXTW #2] // Extended register // 64-bit double-precision (Dt): STR Dt, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 8, 0–32760; when omitted, offset is 0) STR Dt, [Xn|SP, #simm9]! // Pre-index STR Dt, [Xn|SP], #simm9 // Post-index STR Dt, [Xn|SP, Xm|XZR{, LSL #0|LSL #3|SXTX #0|SXTX #3}] // Register offset STR Dt, [Xn|SP, Wm|WZR, SXTW #0|SXTW #3|UXTW #0|UXTW #3] // Extended register // 128-bit quad (Qt): STR Qt, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 16, 0–65520; when omitted, offset is 0) STR Qt, [Xn|SP, #simm9]! // Pre-index STR Qt, [Xn|SP], #simm9 // Post-index STR Qt, [Xn|SP, Xm|XZR{, LSL #0|LSL #4|SXTX #0|SXTX #4}] // Register offset STR Qt, [Xn|SP, Wm|WZR, SXTW #0|SXTW #4|UXTW #0|UXTW #4] // Extended register // 8-bit (Bt) and 16-bit (Ht): // 8-bit (Bt): STR Bt, [Xn|SP{, #pimm}] // Unsigned offset (0–4095, no scaling; when omitted, offset is 0) STR Bt, [Xn|SP, #simm9]! // Pre-index STR Bt, [Xn|SP], #simm9 // Post-index STR Bt, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset (no shift — byte access) STR Bt, [Xn|SP, Wm|WZR, SXTW #0|UXTW #0] // Extended register // 16-bit (Ht): STR Ht, [Xn|SP{, #pimm}] // Unsigned offset (multiple of 2, 0–8190; when omitted, offset is 0) STR Ht, [Xn|SP, #simm9]! // Pre-index STR Ht, [Xn|SP], #simm9 // Post-index STR Ht, [Xn|SP, Xm|XZR{, LSL #0|LSL #1|SXTX #0|SXTX #1}] // Register offset (#1 = scale by 2) STR Ht, [Xn|SP, Wm|WZR, SXTW #0|SXTW #1|UXTW #0|UXTW #1] // Extended register ``` There are no "sign-extending" stores — stores just write the low bytes, sign doesn't matter. ### 15.3 Addressing Modes All loads and stores need a memory address. AArch64 provides several ways to compute that address, called **addressing modes**. The syntax `[Xn, #offset]` means "the address in Xn plus offset." In all load/store addressing modes, the base register `Xn` can be **SP** (register 31 = SP in this context). The offset register `Xm` or `Wm` uses XZR/WZR for register 31. | Mode | Syntax | Effective address | Base updated? | |---|---|---|---| | Base register | [Xn|SP] | Xn | No | | Immediate offset | [Xn|SP, #imm] | Xn + imm | No | | Pre-index | [Xn|SP, #imm]! | Xn + imm | Yes, **before** access | | Post-index | [Xn|SP], #imm | Xn | Yes, **after** access | | Register offset | [Xn|SP, Xm|XZR] | Xn + Xm | No | | Shifted register | [Xn|SP, Xm|XZR, LSL #s] | Xn + (Xm << s) | No | | Extended register | [Xn|SP, Wm|WZR, SXTW {#s}] | Xn + sign_extend(Wm) << s | No | | PC-relative literal | `label` | PC + offset | No | **Note** on `[Xn]` vs `[Xn, #0]`: For LDR/STR/LDP/STP and their variants, writing `[Xn|SP]` without an offset is assembler shorthand for `[Xn|SP, #0]` — the hardware always encodes an offset field, it's just set to zero. There is no separate "base register" encoding. However, for exclusive loads/stores (LDXR/STXR), acquire/release (LDAR/STLR), and LSE atomics (LDADD/CAS/SWP), `[Xn|SP]` IS the real and only form — these instructions genuinely have no offset field in the encoding. **Immediate offset** (`LDR Xd, [Xn, #imm]`): The most common form. The offset is added to the base register to compute the address. **What the hardware actually encodes**: The instruction has a 12-bit unsigned offset field, but the value stored is **divided by the access size**. For 64-bit LDR, the hardware stores `offset ÷ 8`, so the byte offset you write must be a multiple of 8 (range 0–32760). For 32-bit LDR, it stores `offset ÷ 4` (range 0–16380). For LDRH, `offset ÷ 2` (range 0–8190). For LDRB, the offset is unscaled (range 0–4095). This is why `LDR X0, [X1, #7]` is illegal — 7 is not a multiple of 8. Use `LDUR` for non-multiples and negative offsets (see §15.4). ```asm // What you write: // What the hardware encodes: LDR X0, [X1, #0] // imm12 = 0 → address = X1 + 0 LDR X0, [X1, #8] // imm12 = 1 → address = X1 + (1 × 8) = X1 + 8 LDR X0, [X1, #32760] // imm12 = 4095 → address = X1 + (4095 × 8) LDR X0, [X1, #7] // ERROR: 7 is not a multiple of 8 LDR W0, [X1, #4] // imm12 = 1 → address = X1 + (1 × 4) = X1 + 4 LDRB W0, [X1, #100] // imm12 = 100 → address = X1 + 100 (byte, no scaling) ``` For negative offsets or offsets that aren't multiples of the access size, use `LDUR`/`STUR` instead (§15.4) — they use an unscaled signed 9-bit offset. **Pre-index** (`[Xn, #imm]!`): The `!` means "update the base register." The base is updated to `Xn + imm` **before** the memory access. Used for "push" operations. The offset is a signed 9-bit value (−256 to +255), NOT the scaled 12-bit field — this is why you can write `STR X0, [SP, #-16]!` with a negative offset. **Post-index** (`[Xn], #imm`): The offset is outside the brackets. The memory access uses the original Xn, then Xn is updated to `Xn + imm` **after** the access. Used for "pop" operations. Same signed 9-bit range as pre-index. ```asm // Pre-index: update base BEFORE the access LDR X0, [X1, #16]! // X1 = X1 + 16, then load from new X1 // Post-index: update base AFTER the access LDR X0, [X1], #16 // Load from X1, then X1 = X1 + 16 ``` **Stack push/pop patterns:** ```asm // Push X0 onto stack (pre-decrement): STR X0, [SP, #-16]! // SP -= 16, then store X0 // Pop X0 from stack (post-increment): LDR X0, [SP], #16 // Load X0, then SP += 16 ``` **Traced stack walkthrough:** ``` // Initial: SP = 0x1000, X0 = 0xDEAD, X1 = 0xBEEF STR X0, [SP, #-16]! // Step 1: SP = 0x1000 - 16 = 0xFF0 // Step 2: Store 0xDEAD at address 0xFF0 // Memory at 0xFF0: 0xDEAD STR X1, [SP, #-16]! // SP = 0xFF0 - 16 = 0xFE0, store 0xBEEF at 0xFE0 // Stack: 0xFE0→0xBEEF, 0xFF0→0xDEAD LDR X2, [SP], #16 // Load from 0xFE0 → X2 = 0xBEEF, then SP = 0xFE0 + 16 = 0xFF0 LDR X3, [SP], #16 // Load from 0xFF0 → X3 = 0xDEAD, then SP = 0xFF0 + 16 = 0x1000 // Stack restored: SP back to 0x1000, values popped in reverse order ``` **Register offset with shift/extend**: In the register-offset and extended-register forms, the shift `#s` can be either **0** (unshifted) or the **log2 of the access size** (scaled). No other values are encodable — the encoding uses a single bit (`S`): `S=0` means unshifted, `S=1` means scaled by access size. The shift amount can also be omitted entirely (e.g., `SXTW` instead of `SXTW #0`) — the encoding just sets `S=0`. | Access | Scaled shift | Valid `#s` values | |---|---|---| | LDR Xt (64-bit) | `#3` (×8) | `#0` or `#3` | | LDR Wt (32-bit) | `#2` (×4) | `#0` or `#2` | | LDRH (16-bit) | `#1` (×2) | `#0` or `#1` | | LDRB (8-bit) | `#0` (×1) | `#0` only | ```asm // Scaled: X2 is an element index, hardware multiplies by element size LDR X0, [X1, X2, LSL #3] // X0 = mem[X1 + X2*8] // Unshifted: X2 is a raw byte offset LDR X0, [X1, X2] // X0 = mem[X1 + X2] // Extended register with scaling (32-bit index into 64-bit address space): LDR W0, [X1, W3, SXTW #2] // W0 = mem[X1 + sign_extend(W3)*4] // Extended register without scaling: LDR W0, [X1, W3, SXTW] // W0 = mem[X1 + sign_extend(W3)] ``` ### 15.4 LDUR / STUR — Unscaled Offset **The problem LDUR solves**: Regular `LDR Xd, [Xn, #offset]` uses a **scaled** 12-bit unsigned offset — the hardware stores `offset ÷ access_size`, so for a 64-bit load the byte offset must be a multiple of 8, for a 32-bit load a multiple of 4, etc. This means `LDR X0, [X1, #5]` is **illegal** — 5 is not a multiple of 8. Similarly, `LDR X0, [X1, #-8]` is illegal because the 12-bit field is unsigned (no negatives). `LDUR` and `STUR` solve both problems: they use an **unscaled** signed 9-bit offset, meaning the offset is a raw byte count (not divided by anything) and can be negative. ``` LDUR Xt|XZR, [Xn|SP{, #simm9}] // Load 64 bits from address Xn + offset (−256 to +255; when omitted, offset is 0) LDUR Wt|WZR, [Xn|SP{, #simm9}] // Load 32 bits, unscaled offset (−256 to +255; zero-extends to 64; when omitted, offset is 0) STUR Xt|XZR, [Xn|SP{, #simm9}] // Store 64 bits to address Xn + offset (−256 to +255; when omitted, offset is 0) STUR Wt|WZR, [Xn|SP{, #simm9}] // Store 32 bits, unscaled offset (−256 to +255; when omitted, offset is 0) LDURB Wt|WZR, [Xn|SP{, #simm9}] // Load byte, unscaled offset (−256 to +255; when omitted, offset is 0) LDURH Wt|WZR, [Xn|SP{, #simm9}] // Load halfword, unscaled offset (−256 to +255; when omitted, offset is 0) LDURSW Xt|XZR, [Xn|SP{, #simm9}] // Load signed word, sign-extend to 64, unscaled (−256 to +255; when omitted, offset is 0) LDURSB Xt|XZR, [Xn|SP{, #simm9}] // Load signed byte, sign-extend to 64, unscaled (−256 to +255; when omitted, offset is 0) LDURSB Wt|WZR, [Xn|SP{, #simm9}] // Load signed byte, sign-extend to 32, unscaled (−256 to +255; when omitted, offset is 0) LDURSH Xt|XZR, [Xn|SP{, #simm9}] // Load signed halfword, sign-extend to 64, unscaled (−256 to +255; when omitted, offset is 0) LDURSH Wt|WZR, [Xn|SP{, #simm9}] // Load signed halfword, sign-extend to 32, unscaled (−256 to +255; when omitted, offset is 0) STURB Wt|WZR, [Xn|SP{, #simm9}] // Store byte, unscaled offset (−256 to +255; when omitted, offset is 0) STURH Wt|WZR, [Xn|SP{, #simm9}] // Store halfword, unscaled offset (−256 to +255; when omitted, offset is 0) // FP/SIMD unscaled: LDUR Bt, [Xn|SP{, #simm9}] // Load 8-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) LDUR Ht, [Xn|SP{, #simm9}] // Load 16-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) LDUR St, [Xn|SP{, #simm9}] // Load 32-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) LDUR Dt, [Xn|SP{, #simm9}] // Load 64-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) LDUR Qt, [Xn|SP{, #simm9}] // Load 128-bit SIMD, unscaled (−256 to +255; when omitted, offset is 0) STUR Bt, [Xn|SP{, #simm9}] // Store 8-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) STUR Ht, [Xn|SP{, #simm9}] // Store 16-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) STUR St, [Xn|SP{, #simm9}] // Store 32-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) STUR Dt, [Xn|SP{, #simm9}] // Store 64-bit FP/SIMD, unscaled (−256 to +255; when omitted, offset is 0) STUR Qt, [Xn|SP{, #simm9}] // Store 128-bit SIMD, unscaled (−256 to +255; when omitted, offset is 0) ``` **Traced example — when you NEED LDUR:** ```asm // Struct with packed/unaligned fields: // struct { uint8_t type; uint64_t value; } __attribute__((packed)); // type at offset 0 (1 byte), value at offset 1 (NOT aligned to 8!) // // X0 = pointer to struct LDRB W1, [X0] // W1 = type (offset 0 — fine, byte access is always aligned) LDUR X2, [X0, #1] // X2 = value (offset 1 — NOT a multiple of 8, so LDR can't encode it) // LDUR uses raw byte offset: address = X0 + 1 // Accessing stack locals at negative offsets: // X29 (FP) points to saved frame, locals are below it LDUR X1, [X29, #-8] // Load local at FP-8 (negative offset, LDR can't encode negative) LDUR W2, [X29, #-20] // Load local at FP-20 // Compare: what LDR can and can't do: LDR X0, [X1, #8] // OK: 8 is a multiple of 8, encodable as imm12=1 LDR X0, [X1, #32760] // OK: max scaled offset (4095 × 8) // LDR X0, [X1, #5] // ILLEGAL: 5 is not a multiple of 8 // LDR X0, [X1, #-8] // ILLEGAL: LDR immediate offset is unsigned (no negatives) LDUR X0, [X1, #5] // OK: unscaled, raw byte offset 5 LDUR X0, [X1, #-8] // OK: unscaled, signed offset -8 ``` **Why two separate instructions?** Encoding efficiency. LDR's scaled 12-bit unsigned offset covers a large range (0 to 32,760 for 64-bit) which handles the vast majority of struct field and array accesses. LDUR's 9-bit signed offset covers the remaining cases (negative offsets, offsets that aren't multiples of the access size) in a smaller range (−256 to +255). Having both means common cases (positive, naturally-scaled) get the big range, and uncommon cases still work. Note: "unscaled" refers to the offset encoding, not memory alignment — whether an access faults on an unaligned address depends on `SCTLR.A`, not on whether you used LDR or LDUR. **How the assembler handles the overlap**: For offsets that BOTH can encode (e.g., `#0`, `#8`, `#16`), the assembler typically picks LDR (the scaled form). For negative offsets or non-multiples, it picks LDUR. GNU `as` does this automatically if you just write `LDR X0, [X1, #-8]` — it silently emits LDUR. But in disassembly, you'll see the explicit `LDUR` mnemonic. **Note on pre-index and post-index**: The `[Xn, #imm]!` (pre-index) and `[Xn], #imm` (post-index) forms also use unscaled signed 9-bit offsets — they share the same encoding space as LDUR/STUR. So `STR X0, [SP, #-16]!` works with a negative offset because pre-index uses the 9-bit signed field, not the 12-bit unsigned field. **Writeback with base == destination**: For loads with writeback (`LDR Xt, [Xn, #imm]!` or `LDR Xt, [Xn], #imm`), if Xt and Xn are the **same register**, the loaded value wins — Xn gets the loaded data, NOT the updated address. For stores, if Xt == Xn with writeback, the base is updated (the store reads the old value of Xt before the update). These are CONSTRAINED UNPREDICTABLE in the architecture — they may work on one CPU but not another. Avoid them. **LDR vs LDUR at a glance (for 64-bit access):** | | LDR (scaled) | LDUR (unscaled) | Pre/Post-index | |---|---|---|---| | Offset field | 12-bit unsigned | 9-bit signed | 9-bit signed | | Stored as | offset ÷ 8 | raw bytes | raw bytes | | Range | 0 to +32,760 | −256 to +255 | −256 to +255 | | Must be multiple of 8? | **Yes** | No | No | | Negative offset? | **No** | Yes | Yes | | Updates base register? | No | No | Yes | | Use case | Most struct/array access | Packed structs, negative offsets | Push/pop, walking memory | **LDTR / STTR — Unprivileged Load/Store:** `LDTR` and `STTR` perform loads and stores using the permissions of **EL0 (user mode)**, even when executing at EL1 (kernel). This is how the kernel safely accesses user-provided pointers — if the address is invalid or user-inaccessible, LDTR generates a fault that the kernel can catch, instead of silently accessing kernel memory. ```asm LDTR Xt|XZR, [Xn|SP{, #simm9}] // Load 64-bit with EL0 permission check (offset −256 to +255; when omitted, offset is 0) LDTR Wt|WZR, [Xn|SP{, #simm9}] // Load 32-bit with EL0 permission check (offset −256 to +255; when omitted, offset is 0) STTR Xt|XZR, [Xn|SP{, #simm9}] // Store 64-bit with EL0 permission check (offset −256 to +255; when omitted, offset is 0) STTR Wt|WZR, [Xn|SP{, #simm9}] // Store 32-bit with EL0 permission check (offset −256 to +255; when omitted, offset is 0) LDTRB Wt|WZR, [Xn|SP{, #simm9}] // Byte version (offset −256 to +255; when omitted, offset is 0) LDTRH Wt|WZR, [Xn|SP{, #simm9}] // Halfword version (offset −256 to +255; when omitted, offset is 0) LDTRSW Xt|XZR, [Xn|SP{, #simm9}] // Sign-extending word → 64-bit (offset −256 to +255; when omitted, offset is 0) LDTRSB Xt|XZR, [Xn|SP{, #simm9}] // Sign-extending byte → 64-bit (offset −256 to +255; when omitted, offset is 0) LDTRSB Wt|WZR, [Xn|SP{, #simm9}] // Sign-extending byte → 32-bit (offset −256 to +255; when omitted, offset is 0) LDTRSH Xt|XZR, [Xn|SP{, #simm9}] // Sign-extending halfword → 64-bit (offset −256 to +255; when omitted, offset is 0) LDTRSH Wt|WZR, [Xn|SP{, #simm9}] // Sign-extending halfword → 32-bit (offset −256 to +255; when omitted, offset is 0) STTRB Wt|WZR, [Xn|SP{, #simm9}] // Store byte, unprivileged (offset −256 to +255; when omitted, offset is 0) STTRH Wt|WZR, [Xn|SP{, #simm9}] // Store halfword, unprivileged (offset −256 to +255; when omitted, offset is 0) ``` **Why LDTR exists**: When a user passes a pointer to a syscall, the kernel must validate it. Using regular `LDR` would access the address with kernel privileges — if the user passes a kernel address, the load succeeds and leaks kernel data. `LDTR` uses user-mode permissions, so invalid or privileged addresses fault safely. ### 15.5 LDR (literal) — PC-Relative Load Loads a value from a fixed address relative to the current instruction. The assembler computes the offset from PC to the label automatically. Used to load constants from "literal pools" — small data areas placed near the code. ``` LDR Xt|XZR, label // Load 64 bits from PC + offset (±1 MB) LDR Wt|WZR, label // Load 32 bits from PC + offset LDR Sd, label // Load single-precision FP from PC + offset LDR Dd, label // Load double-precision FP from PC + offset LDR Qd, label // Load 128-bit SIMD from PC + offset ``` ### 15.6 Alignment Requirements AArch64 is generally more tolerant of unaligned access than older ARM, but alignment still matters: **Default behavior**: Loads and stores to naturally-aligned addresses always work. For unaligned accesses, the behavior depends on the `SCTLR_EL1.A` bit (Alignment check enable). When A=0 (the default on Linux), most unaligned accesses work but are potentially slower — the CPU may split them into multiple bus transactions. When A=1, unaligned accesses generate an alignment fault exception. **What "naturally aligned" means**: An N-byte access is naturally aligned when the address is a multiple of N. So a 4-byte LDR W needs address % 4 == 0, an 8-byte LDR X needs address % 8 == 0, and a 16-byte LDP needs address % 8 == 0 (aligned to the element size, not the pair size). **Always require alignment** (regardless of SCTLR.A — these fault even with alignment checking disabled): - `LDXR`/`STXR` (exclusive): must be naturally aligned or you get an alignment fault. The hardware's exclusive monitor only tracks aligned addresses. - `LDAR`/`STLR` (acquire/release): must be naturally aligned. - SP must be 16-byte aligned whenever it is used as a base address — but this check is only active when `SCTLR_EL1.SA0` (for EL0) or `SCTLR_ELx.SA` (for the current EL) is enabled. Linux enables SA0 by default, so EL0 code faults on unaligned SP. Bare-metal or custom kernels may have it disabled. - Atomic instructions (LSE: `LDADD`, `CAS`, `SWP`, etc.): must be naturally aligned. **Follow SCTLR.A** (alignment-checked only when A=1): - `LDP`/`STP`: the address should be aligned to the element size (8 for Xt, 4 for Wt). With SCTLR.A=0 (default on Linux), unaligned LDP/STP to Normal memory is architecturally permitted but may be slower or non-atomic. With SCTLR.A=1, unaligned LDP/STP faults. Best practice: always align. **Why alignment matters for atomics**: The CPU guarantees atomicity only for aligned accesses at the natural size. An 8-byte store to an 8-byte-aligned address is guaranteed to be visible to other cores as a single atomic write. An unaligned 8-byte store might be split into two 4-byte stores, and another core could observe half the old value and half the new value — a torn read. ### 15.7 PRFM — Prefetch Memory `PRFM` (Prefetch Memory) is a **hint** that tells the CPU to start loading data into cache before you actually need it. This can hide memory latency for predictable access patterns. The CPU is free to ignore the hint — it should not be relied on for architectural effects, and in normal operation it does not generate visible faults for invalid addresses, though the ARM ARM does not guarantee this unconditionally across all implementations and configurations. ``` PRFM <prfop>, [Xn|SP{, #pimm}] // Unsigned offset (scaled by 8, range 0–32760; when omitted, offset is 0) PRFM <prfop>, [Xn|SP, Xm|XZR{, LSL #0|LSL #3|SXTX #0|SXTX #3}] // Register offset PRFM <prfop>, [Xn|SP, Wm|WZR, SXTW #0|SXTW #3|UXTW #0|UXTW #3] // Extended register PRFM <prfop>, label // PC-relative literal (±1 MB) PRFUM <prfop>, [Xn|SP{, #simm9}] // Unscaled signed offset (−256 to +255; when omitted, offset is 0) ``` `<prfop>` must be one of the values from this table: | Operation | Meaning | |---|---| | **PLD — Prefetch for Load (data)** | | | `PLDL1KEEP` | Load, L1, temporal (expect reuse) | | `PLDL1STRM` | Load, L1, streaming (one-time use) | | `PLDL2KEEP` | Load, L2, temporal | | `PLDL2STRM` | Load, L2, streaming | | `PLDL3KEEP` | Load, L3, temporal | | `PLDL3STRM` | Load, L3, streaming | | **PLI — Prefetch for Instruction fetch** | | | `PLIL1KEEP` | Instruction, L1, temporal | | `PLIL1STRM` | Instruction, L1, streaming | | `PLIL2KEEP` | Instruction, L2, temporal | | `PLIL2STRM` | Instruction, L2, streaming | | `PLIL3KEEP` | Instruction, L3, temporal | | `PLIL3STRM` | Instruction, L3, streaming | | **PST — Prefetch for Store (get exclusive ownership)** | | | `PSTL1KEEP` | Store, L1, temporal | | `PSTL1STRM` | Store, L1, streaming | | `PSTL2KEEP` | Store, L2, temporal | | `PSTL2STRM` | Store, L2, streaming | | `PSTL3KEEP` | Store, L3, temporal | | `PSTL3STRM` | Store, L3, streaming | The encoding is a 5-bit value: bits [4:3] select the type (00=PLD, 01=PLI, 10=PST), bits [2:1] select the cache level (00=L1, 01=L2, 10=L3), bit [0] selects the policy (0=KEEP temporal, 1=STRM streaming). Values not listed above (6–7, 14–15, 22–31) are reserved and should use the numeric `#imm5` form. **When to use**: Prefetch helps when you know you'll access memory in a predictable pattern (e.g., walking an array) and the access is far enough ahead that the CPU's hardware prefetcher hasn't caught up. Typical use: prefetch the next cache line 2-4 iterations ahead in a loop. ```asm // Prefetch 256 bytes ahead while processing an array: loop: PRFM PLDL1KEEP, [X0, #256] // Hint: fetch data 256 bytes ahead LDR X1, [X0] // Process current element // ... work with X1 ... ADD X0, X0, #8 CMP X0, X2 B.LT loop ``` **Why PST (store prefetch) exists**: When you're about to write to a cache line, the CPU needs exclusive ownership of it (the MESI/MOESI "E" or "M" state). PST tells the CPU to acquire ownership early, avoiding a stall when the store actually happens. Useful for zeroing large buffers or initializing arrays. **PRFM IR — intent-to-read prefetch (FEAT_PCDPHINT, optional from Armv9.0-A)**: DDI 0487 M.b adds an `IR` ("intent to read") variant to `PRFM (immediate)`, governed by FEAT_PCDPHINT. Unlike the ordinary prefetch hints — which just say "I'll touch this address soon, pull it toward this cache level" — the `IR` hint says the current PE *will read* a location that **might not yet hold the value it needs**, and the memory system should optimize for **low-latency delivery once another observer's write lands**. It is a producer/consumer data-placement hint: the consumer marks the line so the eventual producer write is forwarded quickly. The `IR` type carries no `<target>` or `<policy>` (its `Rt[2:0]` are `000`), and on a core without FEAT_PCDPHINT it executes as a NOP. FEAT_PCDPHINT also adds the `STSHH` store hint; PMU event `0x00B3 PRFM_IR_SPEC` counts IR prefetches. ### 15.8 MOPS — Hardware memcpy / memset (FEAT_MOPS) **FEAT_MOPS** (optional from ARMv8.7-A, mandatory from ARMv8.8-A / ARMv9.3-A) adds dedicated instructions that implement `memcpy()`, `memmove()`, and `memset()` in hardware — avoiding the code-size and branch-prediction cost of a hand-tuned software loop. Each logical operation is broken into a **three-instruction sequence** (Prologue → Main → Epilogue) that must appear consecutively in program order. The CPU is allowed to do implementation-defined amounts of work in each step; the instructions are **interruptible** between bytes (architectural state is kept coherent in the GPR operands, so the OS's interrupt handler can save/restore and resume correctly). There are four mnemonic families: | Family | Purpose | Direction | |---|---|---| | `CPYFP` / `CPYFM` / `CPYFE` | memcpy, **forward-only** — no overlap OR dst < src | forward | | `CPYP` / `CPYM` / `CPYE` | memcpy / memmove, **overlap-safe** — CPU picks fwd/rev from the args | either | | `SETP` / `SETM` / `SETE` | memset | byte-at-a-time, forward | | `SETGP` / `SETGM` / `SETGE` | memset with MTE tag update (requires FEAT_MTE) | forward | All registers use **writeback addressing** (the `!` syntax) — the instruction updates its operands in place so the Main stage can pick up exactly where Prologue left off. ```asm // Forward-only memcpy. Use when src and dst are known not to overlap (or dst < src). CPYFP [Xd]!, [Xs]!, Xn! // Prologue: preconditions args, copies an impl-def amount CPYFM [Xd]!, [Xs]!, Xn! // Main: the bulk of the copy (impl-def amount) CPYFE [Xd]!, [Xs]!, Xn! // Epilogue: finishes the copy, clears remaining Xn to 0 // After the sequence: Xd = orig_dst + size, Xs = orig_src + size, Xn = 0. // Overlap-safe memcpy / memmove. CPU inspects src,dst,size and picks forward or reverse. CPYP [Xd]!, [Xs]!, Xn! // Prologue: determine direction, precondition CPYM [Xd]!, [Xs]!, Xn! // Main CPYE [Xd]!, [Xs]!, Xn! // Epilogue // memset — Xs is the byte value to splat (only low 8 bits used). Xs is read-only // (unlike Xd/Xn which are writeback), so Xs=XZR is legitimate and encodes memset-to-zero. SETP [Xd]!, Xn!, Xs|XZR // Prologue SETM [Xd]!, Xn!, Xs|XZR // Main SETE [Xd]!, Xn!, Xs|XZR // Epilogue // Tagged memset (FEAT_MOPS + FEAT_MTE): also writes the allocation tag for each 16-byte granule. SETGP [Xd]!, Xn!, Xs|XZR SETGM [Xd]!, Xn!, Xs|XZR SETGE [Xd]!, Xn!, Xs|XZR ``` **Register constraints** (all three operands must satisfy these): - `Xd`, `Xs`, `Xn` must be **three distinct registers**. Any overlap is CONSTRAINED UNPREDICTABLE. - None of them can be `XZR` — these are writeback operands, and writing back to XZR makes no sense. - None of them can be `SP`. (SP has no writeback form for this instruction class.) - The prologue/main/epilogue **must** use the same three registers. **Why the three-instruction split**: the architecture wants to allow implementations to trade off between in-order and out-of-order dispatch, cache-line alignment handling, and interrupt-responsiveness. By splitting the work across three instructions with writeback, an interrupt between (say) Main and Epilogue leaves the architectural registers holding exactly the "remaining work" — the handler can re-enter the instruction sequence and the Epilogue correctly finishes only what's left. Contrast this with a CISC-style single-instruction memcpy, where an interrupt mid-copy would require hidden microarchitectural state that's hard to virtualize. **Option A vs Option B**: `CPYFP` / `CPYP` / `SETP` set `PSTATE.C` to 0 or 1 to select one of two IMPLEMENTATION DEFINED option encodings ("option A" or "option B") that affect exactly how the P/M/E stages update `Xd`/`Xs`/`Xn` between steps. Both options produce the same final memory state; they differ only in transient register contents. Code that reads `Xd`/`Xs`/`Xn` **between** the three instructions (e.g., in an interrupt handler) must not assume one option or the other. Code that reads them **only after** `CPYFE`/`CPYE`/`SETE` always sees the same final values (dst += size, src += size, Xn = 0). **Detection** — `ID_AA64ISAR2_EL1.MOPS` (bits [19:16]). Linux surfaces this as `HWCAP2_MOPS`. GCC 12+ emits MOPS when built with `-march=armv8.8-a` (or `armv8.7-a+mops`); LLVM 15+ likewise with the same flags. glibc's `memcpy`/`memmove`/`memset` resolvers pick the MOPS implementation at load time when the HWCAP bit is set. **RE tip**: if you see three consecutive instructions `CPYFP`/`CPYFM`/`CPYFE` (or `SETP`/`SETM`/`SETE`) in disassembly with writeback on three distinct X-registers, it's a hardware memcpy or memset. A standalone `CPYFP` without the follow-ups is either dead code, a compiler bug, or an interrupt-interrupted copy caught mid-sequence; running it in isolation would produce IMPLEMENTATION DEFINED results. ### 15.9 LS64 — Atomic 64-Byte Loads/Stores (FEAT_LS64) **FEAT_LS64** (optional from Armv8.6-A) adds instructions that transfer a full 64-byte cache line atomically between memory and 8 consecutive general-purpose registers. The primary use case is interfacing with accelerators (GPUs, NICs, storage controllers) that expose wide MMIO registers expecting an atomic 64-byte write — previously the CPU had to issue 8 separate 8-byte stores with no guarantee that the accelerator would see them as a single unit. ```asm LD64B Xt, [Xn|SP] // Atomic 64-byte load: Xt..X(t+7) ← 64 bytes at [Xn] ST64B Xt, [Xn|SP] // Atomic 64-byte store: 64 bytes at [Xn] ← Xt..X(t+7) ST64BV Xs|XZR, Xt, [Xn|SP] // Atomic 64-byte store with status return (FEAT_LS64_V). Xs receives the // accelerator-defined status value (0 = accepted). Xt..X(t+7) is payload. ST64BV0 Xs|XZR, Xt, [Xn|SP] // Same, but carries the per-thread ACCDATA_EL1 control value alongside // the store (FEAT_LS64_ACCDATA). Used by virtualization to tag stores. // Constraint: Xt must be an even-numbered register AND the resulting 8-register group // Xt..X(t+7) must all be valid writable GPRs. Since X31 is XZR (not writable as a data register // in this context), Xt+7 ≤ 30 is required — so Xt ∈ {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}. // Xt=24 would require X24..X31, trying to write the eighth register into XZR, which is rejected. // Xt=XZR(31) is odd and also rejected. Both Xn and [Xn] must be 64-byte aligned; misaligned // access is CONSTRAINED UNPREDICTABLE. // These are NOT general-purpose memcpy primitives — their utility is tied to accelerator MMIO regions // that the system has marked as LS64-capable in the memory type. Targeting ordinary DRAM works but // doesn't give any atomicity guarantee beyond the usual single-copy atomicity of aligned 8-byte accesses. ``` **Detection** — `ID_AA64ISAR1_EL1.LS64` (bits [63:60]). Values: `0b0001` = LS64 only, `0b0010` = +LS64_V (ST64BV), `0b0011` = +LS64_ACCDATA (ST64BV0). Linux surfaces as `HWCAP2_LS64` / `HWCAP2_LS64_V` / `HWCAP2_LS64_ACCDATA`. --- ## 16. Load/Store Pair, Non-Temporal & Exclusive Extensions to the basic load/store: pair operations (load/store two registers at once), non-temporal hints (bypass cache), and exclusive access (for implementing atomics). ### 16.1 LDP / STP — Load/Store Pair `LDP` (Load Pair) loads two registers from consecutive memory locations in a single instruction. `STP` (Store Pair) stores two registers. These are more efficient than two separate loads/stores, and they are the standard way to save and restore registers in function prologues and epilogues. ``` LDP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Signed offset (multiple of 8, range −512 to +504; when omitted, offset is 0) LDP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index LDP Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index STP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Signed offset (multiple of 8, range −512 to +504; when omitted, offset is 0) STP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index STP Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index ``` **32-bit pair forms:** ``` LDP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // Signed offset (multiple of 4, range −256 to +252; when omitted, offset is 0) LDP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm]! // Pre-index LDP Wt1|WZR, Wt2|WZR, [Xn|SP], #simm // Post-index STP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // Signed offset (multiple of 4, range −256 to +252; when omitted, offset is 0) STP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm]! // Pre-index STP Wt1|WZR, Wt2|WZR, [Xn|SP], #simm // Post-index ``` **LDPSW — Load Pair Signed Word (sign-extend each 32-bit value to 64):** ``` LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Signed offset (multiple of 4, range −256 to +252; when omitted, offset is 0) LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index ``` **FP/SIMD pair forms:** ``` // Single-precision pairs (offset: multiple of 4, range −256 to +252): LDP St1, St2, [Xn|SP{, #simm}] // Signed offset (multiple of 4, range −256 to +252; when omitted, offset is 0) LDP St1, St2, [Xn|SP, #simm]! // Pre-index LDP St1, St2, [Xn|SP], #simm // Post-index STP St1, St2, [Xn|SP{, #simm}] // Signed offset (multiple of 4, range −256 to +252; when omitted, offset is 0) STP St1, St2, [Xn|SP, #simm]! // Pre-index STP St1, St2, [Xn|SP], #simm // Post-index // Double-precision pairs (offset: multiple of 8, range −512 to +504): LDP Dt1, Dt2, [Xn|SP{, #simm}] // Signed offset (multiple of 8, range −512 to +504; when omitted, offset is 0) LDP Dt1, Dt2, [Xn|SP, #simm]! // Pre-index LDP Dt1, Dt2, [Xn|SP], #simm // Post-index STP Dt1, Dt2, [Xn|SP{, #simm}] // Signed offset (multiple of 8, range −512 to +504; when omitted, offset is 0) STP Dt1, Dt2, [Xn|SP, #simm]! // Pre-index STP Dt1, Dt2, [Xn|SP], #simm // Post-index // Quad pairs (offset: multiple of 16, range −1024 to +1008): LDP Qt1, Qt2, [Xn|SP{, #simm}] // Signed offset (multiple of 16, range −1024 to +1008; when omitted, offset is 0) LDP Qt1, Qt2, [Xn|SP, #simm]! // Pre-index LDP Qt1, Qt2, [Xn|SP], #simm // Post-index STP Qt1, Qt2, [Xn|SP{, #simm}] // Signed offset (multiple of 16, range −1024 to +1008; when omitted, offset is 0) STP Qt1, Qt2, [Xn|SP, #simm]! // Pre-index STP Qt1, Qt2, [Xn|SP], #simm // Post-index ``` **What the hardware actually encodes**: Like LDR, the offset is stored divided by the access size. LDP has a 7-bit signed offset field. For 64-bit pairs: the hardware stores `offset ÷ 8`, so the byte offset must be a multiple of 8 (range: −512 to +504, since a 7-bit signed value is −64 to +63, times 8). For 32-bit pairs: `offset ÷ 4` (range: −256 to +252). So `STP X29, X30, [SP, #-16]!` encodes the offset as −16 ÷ 8 = −2. **Gotcha**: `LDP Xt1|XZR, Xt2|XZR, [Xn|SP]` — the two destination registers `Xt1` and `Xt2` **must be different**. `LDP X0, X0, [X1]` is unpredictable (the CPU doesn't know which value to keep in X0). The base register can be **SP** (this is the standard prologue/epilogue pattern). The data registers can be **XZR** (to discard one or both loaded values). **Writeback constraint**: For both LDP and STP with pre/post-index (`!` or post-index form), the base register `Xn` must not be the same as either data register. For LDP, this is because both the loaded value and the updated base address would try to write to the same register. For STP, the CPU might update the base before reading the data register, corrupting the stored value. Violating this is CONSTRAINED UNPREDICTABLE — it may work on one implementation and fail on another. **Function prologue/epilogue pattern:** ```asm // Prologue: save FP and LR STP X29, X30, [SP, #-16]! // Push FP and LR, decrement SP MOV X29, SP // Set frame pointer // Epilogue: restore FP and LR LDP X29, X30, [SP], #16 // Pop FP and LR, increment SP RET ``` **Traced prologue/epilogue (what REALLY happens to memory):** ``` // Initial state: SP = 0x4000, X29 = 0xOLD_FP, X30 = 0xRETURN_ADDR STP X29, X30, [SP, #-16]! // Step 1: SP = 0x4000 - 16 = 0x3FF0 (pre-decrement) // Step 2: mem[0x3FF0] = X29 = 0xOLD_FP (first register at lower address) // mem[0x3FF8] = X30 = 0xRETURN_ADDR (second register at higher address) // SP is now 0x3FF0 MOV X29, SP // X29 = 0x3FF0 (frame pointer points to saved FP/LR pair) // ... function body uses X29-relative offsets for local variables ... LDP X29, X30, [SP], #16 // Step 1: X29 = mem[0x3FF0] = 0xOLD_FP (restore from lower address) // X30 = mem[0x3FF8] = 0xRETURN_ADDR (restore from higher address) // Step 2: SP = 0x3FF0 + 16 = 0x4000 (post-increment, SP restored) RET // Branch to X30 = 0xRETURN_ADDR ``` ### 16.2 LDNP / STNP — Non-Temporal Pair "Non-temporal" means the CPU is told this data won't be needed again soon. The CPU may skip caching it, which avoids polluting the cache during large streaming operations like copying a big buffer. ``` LDNP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // NT load pair 64 (multiple of 8, −512 to +504; when omitted, offset is 0) STNP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // NT store pair 64 (multiple of 8, −512 to +504; when omitted, offset is 0) LDNP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // NT load pair 32 (multiple of 4, −256 to +252; when omitted, offset is 0) STNP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // NT store pair 32 (multiple of 4, −256 to +252; when omitted, offset is 0) LDNP St1, St2, [Xn|SP{, #simm}] // NT load pair FP32 (multiple of 4, −256 to +252; when omitted, offset is 0) LDNP Dt1, Dt2, [Xn|SP{, #simm}] // NT load pair FP64 (multiple of 8, −512 to +504; when omitted, offset is 0) LDNP Qt1, Qt2, [Xn|SP{, #simm}] // NT load pair FP128 (multiple of 16, −1024 to +1008; when omitted, offset is 0) STNP St1, St2, [Xn|SP{, #simm}] // NT store pair FP32 (multiple of 4, −256 to +252; when omitted, offset is 0) STNP Dt1, Dt2, [Xn|SP{, #simm}] // NT store pair FP64 (multiple of 8, −512 to +504; when omitted, offset is 0) STNP Qt1, Qt2, [Xn|SP{, #simm}] // NT store pair FP128 (multiple of 16, −1024 to +1008; when omitted, offset is 0) ``` The offset encoding is the same as LDP/STP (7-bit signed, scaled by element size). Only the signed-offset form exists — no pre-index or post-index. ### 16.3 LDXR / STXR — Exclusive (for atomics) Exclusive loads and stores are the building blocks for lock-free atomic operations. `LDXR` (Load Exclusive) reads a value from memory and sets up an **exclusive monitor** — a hardware mechanism that watches the address. `STXR` (Store Exclusive) attempts to write back — but it only succeeds if no other CPU core has written to that address since the `LDXR`. If it fails, the status register `Ws` is set to 1; if it succeeds, `Ws` is 0. You retry the whole sequence until the store succeeds. **Why this works**: The exclusive monitor is a simple 1-bit flag per core (plus the tracked address). `LDXR` sets the flag. Any write to the tracked **Exclusives Reservation Granule (ERG)** by any core (including DMA devices) clears it. `STXR` checks the flag — if clear, someone else modified the data, so the store is aborted. This is how you build atomic read-modify-write without locks. **ERG size (a real gotcha)**: The ERG is the granularity at which the exclusive monitor tracks addresses. It is **IMPLEMENTATION DEFINED** — the architecture permits granule sizes from 16 bytes to 2 KB (4 to 512 words). Typically it equals the cache line size (often 64 bytes on Cortex-A cores), but you cannot assume this. Read `CTR_EL0.ERG` (bits [23:20]) to get log2(words) of the ERG at runtime. The practical consequence: if two independent lock variables happen to fall within the same ERG, exclusive operations on one will clear the monitor for the other, causing spurious STXR failures and false sharing. Always align lock variables to at least `1 << (CTR_EL0.ERG + 2)` bytes, or a conservative 128/256 bytes for portable code. ``` LDXR Xt|XZR, [Xn|SP] // Load exclusive 64-bit (start exclusive monitor) LDXR Wt|WZR, [Xn|SP] // Load exclusive 32-bit STXR Ws|WZR, Xt|XZR, [Xn|SP] // Store exclusive 64-bit (Ws = 0 if success, 1 if failed) STXR Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive 32-bit LDXRB Wt|WZR, [Xn|SP] // Load exclusive byte STXRB Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive byte LDXRH Wt|WZR, [Xn|SP] // Load exclusive halfword STXRH Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive halfword LDXP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load exclusive pair 64-bit (see atomicity note below) LDXP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load exclusive pair 32-bit STXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store exclusive pair 64-bit STXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store exclusive pair 32-bit ``` **Alignment requirement**: The address `[Xn|SP]` **must be naturally aligned** — aligned to the access size. For single-register forms: 4 bytes for Wt, 8 bytes for Xt, 2 bytes for LDXRH, 1 byte for LDXRB. For pair forms: **8 bytes for LDXP/STXP of Wt1,Wt2** (64-bit pair), **16 bytes for LDXP/STXP of Xt1,Xt2** (128-bit pair). Unaligned exclusive access generates an alignment fault regardless of `SCTLR.A`. This is because the exclusive monitor tracks at ERG granularity (see above — typically but not necessarily cache-line-sized), and unaligned accesses could span two tracked granules, making atomicity impossible. **Rules for the exclusive sequence** (violating these may cause the store to always fail): 1. The LDXR and STXR must target the **same address and size**. 2. Between LDXR and STXR, **avoid** accessing other memory locations — other loads/stores may cause the exclusive monitor to be cleared on some implementations, which makes the STXR fail and forces a retry. The ARM architecture permits (but does not require) the monitor to be cleared by any other memory access, so keeping the sequence to pure register operations maximizes portability and success rate. 3. Do not branch to code that might be context-switched (the OS clears the monitor on context switch via `CLREX`). 4. Keep the sequence **short** — long sequences increase the chance of another core invalidating the monitor. 5. `STXR`'s status register `Ws` **must be a different register** from both `Xt` (data) and `Xn` (address base). If they overlap, the behavior is constrained unpredictable — it might work on one CPU and fail on another. Note: `Ws` can technically be WZR (to discard the status), but (a) you can't check if the store succeeded, making the exclusive useless, and (b) if the base is SP, then Rs=31 and Rn=31 which **violates the Rs≠Rn constraint** — this combination is UNPREDICTABLE. 6. **Don't nest LDXR**: A second `LDXR` to a different address cancels the first exclusive monitor. There's only one monitor per core — the last LDXR wins. **Why LDXR/STXR (instead of just CAS)?** Base ARMv8.0 shipped without single-instruction CAS because the exclusive pair approach is simpler in hardware — the CPU just needs a "monitor" flag per tracked granule (ERG), not a full read-modify-write pipeline. LDXR/STXR also works for arbitrary read-modify-write patterns (not just compare-and-swap). CAS was added later in ARMv8.1 (LSE) because the exclusive retry loop wastes bus bandwidth under high contention — see §24.1. **Classic compare-and-swap (CAS) loop:** ```asm // Atomically increment [X0]: retry: LDXR X1, [X0] // Load exclusive ADD X1, X1, #1 // Modify STXR W2, X1, [X0] // Store exclusive CBNZ W2, retry // Retry if store failed ``` **Traced execution with contention (what REALLY happens):** ``` // [X0] = 42 initially. Core A and Core B both try to increment. Core A: Core B: LDXR X1, [X0] → X1=42 LDXR X1, [X0] → X1=42 (monitor set on ERG) (monitor set on ERG) ADD X1, X1, #1 → X1=43 ADD X1, X1, #1 → X1=43 STXR W2, X1, [X0] W2=0 (success! first to store) [X0] = 43 STXR W2, X1, [X0] W2=1 (FAIL — Core A's store cleared our monitor) CBNZ W2, retry → back to LDXR LDXR X1, [X0] → X1=43 (sees Core A's write) ADD X1, X1, #1 → X1=44 STXR W2, X1, [X0] W2=0 (success) [X0] = 44 // Final: [X0] = 44. Both increments applied. No lost update. ``` **CLREX — Clear Exclusive Monitor:** ``` CLREX // Clear the local exclusive monitor without storing ``` The OS kernel uses `CLREX` during context switches to ensure a thread doesn't carry a stale exclusive state from before it was scheduled out. **Exclusive Pair — 128-bit atomics:** ```asm LDXP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load exclusive pair 64-bit (128 bits total) LDXP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load exclusive pair 32-bit (64 bits total) STXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store exclusive pair 64-bit (Ws = 0 success, 1 fail) STXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store exclusive pair 32-bit LDAXP Xt1, Xt2, [Xn|SP] // Load-acquire exclusive pair 64-bit (see XZR caveat in §24.2) LDAXP Wt1, Wt2, [Xn|SP] // Load-acquire exclusive pair 32-bit STLXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store-release exclusive pair 64-bit STLXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store-release exclusive pair 32-bit ``` These load/store two 64-bit registers as a pair. Used for lock-free 128-bit operations (e.g., doubly-linked list insertion where you need to atomically update both a pointer and a counter). The address must be 16-byte aligned. On ARMv8.1+ with LSE, `CASP` (compare-and-swap pair) is preferred for 128-bit CAS. **Register-distinctness constraints (CONSTRAINED UNPREDICTABLE if violated):** - `LDXP` / `LDAXP`: `Rt1 ≠ Rt2`. `LDXP X0, X0, [X1]` is unpredictable — the two destinations must be different registers. Using XZR for both (encoding 31,31) also violates this. - `STXP` / `STLXP`: `Rs ≠ Rt1`, `Rs ≠ Rt2`, and `Rs ≠ Rn` (same pattern as STXR's rule 5 but extended to the second data register). The special case where `Rs = 31` (WZR) combined with `Rn = 31` (SP) is UNPREDICTABLE on the same grounds as STXR. **Atomicity caveat (important!)**: On base ARMv8.0–8.3, `LDXP` of two Xt registers is **not** single-copy atomic as a 128-bit value — each 64-bit half is atomic separately, but the pair can **tear** (another core's write could land between the two halves). True 128-bit atomicity requires either: - **FEAT_LSE** (optional from ARMv8.0, mandatory from ARMv8.1): use `CASP` for atomic 128-bit CAS. - **FEAT_LSE2** (optional from ARMv8.2, mandatory from ARMv8.4): 16-byte aligned `LDP`/`STP`/`LDXP`/`STXP` of 64-bit register pairs become single-copy atomic for the full 128 bits. - **FEAT_LSE128** (optional from Armv9.3-A): `LDCLRP`/`LDSETP`/`SWPP` give 128-bit atomic RMW. Note: FEAT_LSE128 implies FEAT_LSE, but not FEAT_LSE2. Raymond Chen's aarch64 series puts it plainly: "The entire 128-bit value is not loaded atomically; instead, each 64-bit portion is loaded atomically separately. You can still get tearing between the two registers. [The load is required to be fully atomic starting with ARMv8.4.]" Portable code targeting pre-8.4 cores must use the `LDXP`/`STXP` retry loop or `CASP` (LSE) to simulate atomicity via the monitor — the retry mechanism catches tears because any intervening write clears the monitor and the STXP fails. --- ## 17. Branching & Control Flow Branches change the program counter (PC) — they make the CPU jump to a different instruction instead of continuing to the next one. They implement `if/else`, loops, and function calls. **Why so many branch types?** Each has a different range, and larger ranges require more encoding bits. `B` uses 26 bits for ±128 MB — enough for jumps within any reasonable function or between nearby functions. `B.cond` uses 19 bits for ±1 MB — conditions are usually short-range (within a function). `TBZ`/`TBNZ` uses only 14 bits for ±32 KB — testing a single bit is a tight, local operation. `CBZ`/`CBNZ` exist because "compare to zero and branch" is the single most common branch pattern in compiled code, and fusing it into one instruction saves both code size and branch predictor entries. `BR`/`BLR` use a full register for unlimited range — needed for function pointers, virtual dispatch, and PLT stubs. ### 17.1 Unconditional Branches `B` is a simple jump — go to a label unconditionally. `BL` ("Branch with Link") is a function call — it saves the return address in X30 before jumping, so `RET` can get back. `BR`/`BLR` are the same but take the target address from a register (indirect). ``` B label // Branch (PC-relative, ±128 MB) BL label // Branch with Link: X30 = return address, then branch (±128 MB) BR Xn|XZR // Branch to address in Xn (indirect) BLR Xn|XZR // Branch with Link to address in Xn RET {Xn|XZR} // Return: branch to Xn (default X30) // Functionally identical to BR X30, but hints branch predictor ``` **B vs BL vs BR vs BLR**: `B`/`BL` use an immediate offset (PC-relative, range limited). `BR`/`BLR` use a register (any address in the 64-bit space). `BL`/`BLR` save the return address in X30; `B`/`BR` don't. `RET` is functionally `BR X30` but gives the branch predictor a hint that this is a function return (not a computed jump), improving prediction accuracy. The `{Xn|XZR}` means the operand is optional — if omitted, it defaults to X30. In the encoding, the 5-bit Rn field always exists; `RET` without an operand and `RET X30` produce the same machine code (Rn = 11110). **No conditional call**: AArch64 has **no conditional BL** (no `BL.cond`). You cannot conditionally call a function in one instruction. To call conditionally, branch around the BL: `B.NE skip; BL func; skip:`. This is a deliberate simplification from AArch32 (where almost every instruction could be conditional). **What the hardware actually encodes**: Since all AArch64 instructions are 4 bytes and 4-byte aligned, the branch target is always a multiple of 4 bytes away. So the hardware stores the offset **divided by 4** (the instruction count, not the byte count). A 26-bit signed field holding instruction counts gives a range of ±2^25 instructions = ±33,554,432 instructions × 4 bytes = ±128 MB. The same trick applies to all PC-relative branches: `B.cond` stores a 19-bit instruction count (±1 MB), `TBZ`/`TBNZ` stores a 14-bit instruction count (±32 KB). **BL vs BLR**: Both store the return address in X30 (LR). `BL` is PC-relative, `BLR` is indirect. **RET vs BR X30**: Both branch to X30 (by default), but `RET` tells the branch predictor this is a function return, enabling the return address stack to predict correctly. Always use `RET` for function returns. ### 17.2 Conditional Branches `B.cond` branches only if the condition (based on the NZCV flags) is true. The flags must be set by a prior instruction like `CMP`, `ADDS`, `SUBS`, `TST`, etc. If the condition is false, execution continues to the next instruction. ``` B.cond label // Branch if condition is true (±1 MB range) ``` Where `cond` is any condition code from the table in section 4 (`EQ`, `NE`, `LT`, `GE`, etc.). ### 17.3 Compare and Branch `CBZ` (Compare and Branch if Zero) and `CBNZ` (Compare and Branch if Not Zero) combine a zero-test with a branch in a single instruction. They do NOT set the condition flags — they just test the register and branch. They save you from writing a separate `CMP Xn, #0` + `B.EQ`/`B.NE` pair. ``` CBZ Xn|XZR, label // Branch if Xn == 0 (PC-relative, ±1 MB range) CBNZ Xn|XZR, label // Branch if Xn != 0 (PC-relative, ±1 MB range) CBZ Wn|WZR, label // 32-bit: branch if Wn == 0 (tests only low 32 bits) CBNZ Wn|WZR, label // 32-bit: branch if Wn != 0 ``` These do NOT set flags. They compare to zero and branch in a single instruction, saving a `CMP` + `B.EQ`/`B.NE` pair. ### 17.4 Test Bit and Branch ``` TBZ Xn|XZR, #0-63, label // Branch if bit #bit of Xn is 0 (±32 KB range) TBNZ Xn|XZR, #0-63, label // Branch if bit #bit of Xn is 1 TBZ Wn|WZR, #0-31, label // 32-bit form TBNZ Wn|WZR, #0-31, label ``` **Encoding note**: The register width determines the valid bit range. `TBZ Wn, #bit` requires bit 0–31; `TBZ Xn, #bit` allows 0–63. Some assemblers/disassemblers always show the Xn form when bit >= 32 and the Wn form when bit <= 31, even if you wrote it differently. Very useful for testing a single flag bit: ```asm TBZ X0, #31, positive // Branch if bit 31 (sign bit of 32-bit) is 0 TBNZ X0, #0, is_odd // Branch if bit 0 (LSB — least significant bit, the rightmost bit) is 1 ``` Note the smaller range (±32 KB) compared to B.cond (±1 MB) or B (±128 MB). ### 17.5 Branch Ranges Summary | Instruction | Range | |---|---| | `B` / `BL` | ±128 MB | | `B.cond` / `CBZ` / `CBNZ` | ±1 MB | | `TBZ` / `TBNZ` | ±32 KB | | `CB<cc>` / `CBH<cc>` / `CBB<cc>` (FEAT_CMPBR) | ±1 KB | | `BR` / `BLR` / `RET` | Full 64-bit address space | If a conditional branch target is out of range, the assembler/linker may invert the condition and use a trampoline: ```asm // Instead of: B.EQ far_away (out of range) B.NE skip B far_away // unconditional B has ±128 MB range skip: ``` --- ### 17.6 Compare-and-Branch (FEAT_CMPBR) `CB<cc>` fuses a compare and a conditional branch into a single instruction (optional from Armv9.3-A, mandatory from Armv9.6-A). It reads two GPRs — or one GPR and a 6-bit immediate — performs the comparison internally, and branches on the result. Like `CBZ`/`CBNZ` and `TBZ`/`TBNZ`, it does **NOT** touch the condition flags; this is a pure control-flow op. > **Errata note (DDI 0487 M.b, R25595):** the published M.b manual prints "OPTIONAL from Armv9.5"; the Known-Issues list corrects the optional floor to **Armv9.3** (it remains mandatory from Armv9.6). The same erratum adds the implication: **if FEAT_CMPBR is implemented, then FEAT_CSSC (§14.6) is also implemented** — so a core with `CB<cc>` is guaranteed to have `ABS`/`CNT`/`SMAX`/`UMIN`/etc. too. ``` CB<cc> Xn|XZR, Xm|XZR, label // Branch if Xn <cc> Xm (±1 KB range) CB<cc> Wn|WZR, Wm|WZR, label // 32-bit register-register compare CB<cc> Xn|XZR, #0-63, label // 64-bit register-immediate compare CB<cc> Wn|WZR, #0-63, label // 32-bit register-immediate compare CBH<cc> Wn|WZR, Wm|WZR, label // Compare low 16 bits of Wn and Wm CBB<cc> Wn|WZR, Wm|WZR, label // Compare low 8 bits of Wn and Wm // (CBH/CBB have register form only — no immediate variant) ``` Where `cc` is one of ten condition codes: `GT`, `GE`, `LT`, `LE`, `HI`, `HS` (= `CS`), `LO` (= `CC`), `LS`, `EQ`, `NE`. Unlike `B.cond`, no `MI`/`PL`/`VS`/`VC` — those test the N and V flags, which don't exist for a fused compare that doesn't produce flag output. **Encoding note — register forms** (CB, CBB, CBH): at the ISA level, only 6 distinct condition encodings exist — `GT, GE, HI, HS, EQ, NE`. The other 4 are ARM-ARM-documented pseudo-instructions (not toolchain extensions) that swap the two source operands and rewrite the condition to its mirror: | Source written | Encoded as | Reason | |---|---|---| | `CBLT Xn, Xm, label` | `CBGT Xm, Xn, label` | Xn < Xm ↔ Xm > Xn | | `CBLE Xn, Xm, label` | `CBGE Xm, Xn, label` | Xn ≤ Xm ↔ Xm ≥ Xn | | `CBLO Xn, Xm, label` | `CBHI Xm, Xn, label` | Xn unsigned< Xm ↔ Xm unsigned> Xn | | `CBLS Xn, Xm, label` | `CBHS Xm, Xn, label` | Xn unsigned≤ Xm ↔ Xm unsigned≥ Xn | `EQ` and `NE` need no remapping — equality is symmetric. **Encoding note — immediate form** (CB only): at the ISA level, only 6 distinct condition encodings exist — `GT, LT, HI, LO, EQ, NE` (note: `LT`/`LO` are primitives here, not `GE`/`HS`, because operand swap isn't available against an immediate). The other 4 are ARM-ARM-documented pseudo-instructions rewritten by adjusting the immediate: | Source written | Encoded as | Reason | |---|---|---| | `CBGE Xn, #k, label` | `CBGT Xn, #(k−1), label` | Xn ≥ k ↔ Xn > k−1 | | `CBHS Xn, #k, label` | `CBHI Xn, #(k−1), label` | Xn unsigned≥ k ↔ Xn unsigned> k−1 | | `CBLE Xn, #k, label` | `CBLT Xn, #(k+1), label` | Xn ≤ k ↔ Xn < k+1 | | `CBLS Xn, #k, label` | `CBLO Xn, #(k+1), label` | Xn unsigned≤ k ↔ Xn unsigned< k+1 | The `k±1` adjustment shifts the accepted source-immediate range per-condition — the raw encoding field is always 6-bit unsigned (0..63), but what you can write at the source level differs: | Condition | Accepted source immediate `#k` | Encoded field | |---|---|---| | `EQ`, `NE`, `GT`, `LT`, `HI`, `LO` | `#0 .. #63` (native) | k | | `GE`, `HS` | `#1 .. #64` | k − 1 | | `LE`, `LS` | `#−1 .. #62` | k + 1 | Immediates outside these ranges have no `CB<cc>` encoding at all — not in any form. The raw encoding field is 6 bits; after the `k±1` adjustment for GE/HS/LE/LS, the boundary-inclusive source ranges above are the maximum expressible. If a source program needs a condition against an out-of-range immediate, the value must be materialized into a register and the register form used instead (or a CMP + B.cond sequence at ±1 MB range). **Edge cases and why out-of-range ones don't exist:** - `CBGE Xn, #0` (signed "Xn ≥ 0", i.e. non-negative) — not encodable as `CB<cc>` because it would need `CBGT Xn, #−1`, which the encoding can't represent. Semantically equivalent to `TBZ Xn, #63, label` (sign bit clear), which has a wider ±32 KB range anyway. - `CBLT Xn, #0` (signed "Xn < 0", i.e. negative) — IS encodable (LT is a primitive with range 0..63, so k=0 is fine). Semantically equivalent to `TBNZ Xn, #63, label`, though TBNZ's ±32 KB range may be preferable. - `CBHS Xn, #0` (unsigned "Xn ≥ 0") — a semantic tautology (always true for any Xn). Not encodable as `CB<cc>` at all (the HS pseudo-range starts at 1); if the branch is truly needed at runtime, emit an unconditional `B label` instead. - `CBLS Xn, #−1` (unsigned "Xn ≤ −1") — semantically vacuous (no unsigned value is ≤ −1). The branch is never taken. This form exists only because the LS pseudo-range mathematically includes −1 as a boundary; no meaningful program writes it. - `CBGE Xn, #64` / `CBLE Xn, #−1` — boundary values at the edges of the pseudo ranges. `CBGE Xn, #64` encodes as `CBGT Xn, #63`; `CBLE Xn, #−1` encodes as `CBLT Xn, #0`. - **Target outside ±1 KB** — no `CB<cc>` encoding reaches that far. The instruction has a 9-bit signed PC-relative offset × 4, giving a hard ±1 KB limit that is a property of the instruction encoding, not of the toolchain. The standard fallback idiom is `CMP Xn, #k` + `B.cond label` (B.cond's 19-bit offset reaches ±1 MB), or a two-stage trampoline beyond that. Source-to-encoding asymmetry is invisible in high-level code but matters when reading raw disassembly. Because only the 6 primitive encodings exist in the instruction stream, a disassembler reading bytes sees the primitive form, not what the programmer wrote. You'll see `CBGT Xm, Xn` for code written as `CBLT Xn, Xm`, or `CBLT Xn, #15` for code written as `CBLE Xn, #14`. Work backward through the pseudo tables to recover the original intent. Detection: `ID_AA64ISAR2_EL1.CSSC` (shared with FEAT_CSSC — both features live on the same ID register field). ```asm CBGT X0, X1, loop_top // Branch if X0 > X1 (signed) CBEQ W0, #0, done // Branch if W0 == 0 (immediate form) CBHI X5, #10, overflow // Branch if X5 > 10 (unsigned) CBBNE W0, W1, mismatch // Compare low bytes, branch if not equal CBHEQ W0, W1, match // Compare low halfwords, branch if equal CBLE X0, X1, small // Pseudo: encodes as CBGE X1, X0 (operand swap) ``` **Which to use** — `CB<cc>` vs `B.cond`: | Scenario | Preferred | |---|---| | Compare-and-branch, target within ±1 KB | `CB<cc>` (one instruction) | | Target beyond ±1 KB | `CMP` + `B.cond` (branch has ±1 MB range) | | Branch on V/N from prior ADDS/SUBS | `B.cond` only (CB can't test flags) | | Multiple branches on one compare | `CMP` + multiple `B.cond` (compare once, branch many) | | Byte/halfword compare | `CBB<cc>` / `CBH<cc>` (skips the AND-mask step) | | Pre-Armv9.6 target | `CMP` + `B.cond` (`CB<cc>` doesn't exist) | **RE note**: `CB<cc>` instructions appear in binaries compiled for `-march=armv9.6-a` or later. As of early 2026 most shipping binaries don't use them; expect increasing prevalence through 2027–2028 as Armv9.6 cores (Apple successors to M4/M5, Qualcomm Oryon v3+, server Neoverse V3/N3 successors) become targetable in compiler defaults. Until then, `CMP` + `B.cond` remains the dominant idiom. --- ## 18. Conditional Select & Increment AArch64 replaces AArch32's conditional execution (predicated instructions) with conditional select instructions. These choose between two values based on the condition flags, without branching. This is how compilers implement branchless `if/else` — the CPU always executes the instruction, but the result depends on the flags. ### 18.1 CSEL — Conditional Select `CSEL` picks one of two register values based on a condition. If the condition is true, the first source is selected; otherwise, the second. Like a hardware ternary operator: `Xd = cond ? Xn : Xm`. The `cond` operand is any condition code from the table in §4 (EQ, NE, LT, GE, GT, LE, HI, LS, etc.) — it tests the current NZCV flags, so you typically need a CMP/TST/ADDS before the CSEL. ``` CSEL Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : Xm [64-bit] CSEL Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : Wm [32-bit, upper 32 of Xd zeroed] ``` ```asm CMP X0, X1 CSEL X2, X0, X1, LE // X2 = min(X0, X1) signed CSEL X3, X0, X1, GE // X3 = max(X0, X1) signed CSEL X4, X0, X1, HI // X4 = max(X0, X1) unsigned ``` **What CSEL REALLY does — traced:** ```asm // If X0 = 10, X1 = 20: CMP X0, X1 // 10 - 20: N=1, Z=0, V=0 → N!=V so LT; N==V false so not-GE CSEL X2, X0, X1, LE // LE true (Z=1||N!=V = 0||1 = true) → X2 = X0 = 10 (the min) ✓ CSEL X3, X0, X1, GE // GE false (N==V = 1==0 = false) → X3 = X1 = 20 (the max) ✓ ``` ### 18.2 CSINC — Conditional Select Increment `CSINC` selects the first source if the condition is true, otherwise selects the second source **plus 1**. Its most common alias is `CSET`, which sets a register to 1 if a condition is true and 0 otherwise — this is how compilers convert comparisons to boolean values (like C's `result = (a > b)`). ``` CSINC Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : (Xm + 1) [64-bit] CSINC Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : (Wm + 1) [32-bit] ``` Aliases: ``` CINC Xd|XZR, Xn|XZR, cond // Xd = cond ? Xn+1 : Xn. Encodes as: CSINC Xd|XZR, Xn|XZR, Xn|XZR, invert(cond) CINC Wd|WZR, Wn|WZR, cond CSET Xd|XZR, cond // Xd = cond ? 1 : 0. Encodes as: CSINC Xd|XZR, XZR, XZR, invert(cond) CSET Wd|WZR, cond ``` **Why the inverted condition?** This confuses everyone, but it's forced by the encoding. `CSINC Xd, Xn, Xm, cond` means "if cond is true, select Xn (unchanged); if cond is false, select Xm+1." For `CSET Rd, GT` (set to 1 if greater), we want: result=1 when GT, result=0 when not-GT. We encode this as `CSINC Rd, XZR, XZR, LE` — when LE is true (i.e., GT is false), we select XZR=0 (unchanged); when LE is false (i.e., GT is true), we select XZR+1=1. The inversion happens because the "interesting" operation (the +1) is on the false path of CSINC, so to make the +1 happen when our desired condition is true, we must invert it. The same logic applies to CINC, CINV, CSETM, and CNEG — they all apply their operation (increment, invert, negate) on the **false** path, so the alias inverts the condition to put the operation where you want it. ```asm CMP X0, #10 CSET W1, GT // W1 = (X0 > 10) ? 1 : 0 (common pattern for bool conversion) CSET X1, GT // X1 = same but 64-bit result) ``` ### 18.3 CSINV — Conditional Select Invert `CSINV` selects the first source if the condition is true, otherwise selects the bitwise NOT of the second source. `CSETM Rd, cond` (set to all-ones if true, zero if false) is the most common alias — it produces a bitmask useful for branchless bitwise selection. ``` CSINV Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : ~Xm [64-bit] CSINV Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : ~Wm [32-bit] ``` Aliases: ``` CINV Xd|XZR, Xn|XZR, cond // Xd = cond ? ~Xn : Xn. Encodes as: CSINV Xd|XZR, Xn|XZR, Xn|XZR, invert(cond) CINV Wd|WZR, Wn|WZR, cond CSETM Xd|XZR, cond // Xd = cond ? -1 : 0. Encodes as: CSINV Xd|XZR, XZR, XZR, invert(cond) CSETM Wd|WZR, cond // 32-bit (Wd = 0xFFFFFFFF, NOT 64-bit -1) ``` **32-bit note**: `CSETM W0, cond` sets W0 to 0xFFFFFFFF (not 0xFFFFFFFFFFFFFFFF). X0 upper 32 bits are zeroed. ### 18.4 CSNEG — Conditional Select Negate `CSNEG` selects the first source if the condition is true, otherwise selects the two's complement negation of the second source. `CNEG Rd, Rn, cond` (negate if condition true, keep otherwise) is the key alias — it's how compilers implement branchless `abs()`. ``` CSNEG Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : -Xm [64-bit] CSNEG Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : -Wm [32-bit] ``` Alias: ``` CNEG Xd|XZR, Xn|XZR, cond // Xd = cond ? -Xn : Xn. Encodes as: CSNEG Xd|XZR, Xn|XZR, Xn|XZR, invert(cond) CNEG Wd|WZR, Wn|WZR, cond ``` ### 18.5 Branchless Patterns with Conditional Select ```asm // Absolute value: CMP X0, #0 CNEG X0, X0, LT // if (X0 < 0) X0 = -X0 // Clamp to range [0, 255]: CMP X0, #0 CSEL X0, XZR, X0, LT // X0 = max(X0, 0) MOV X1, #255 CMP X0, X1 CSEL X0, X1, X0, GT // X0 = min(X0, 255) // Convert bool to 0 or 1: CMP X0, #0 CSET W0, NE // W0 = (X0 != 0) ? 1 : 0 // Convert bool to 0 or -1 (all-ones mask): CMP X0, #0 CSETM W0, NE // W0 = (X0 != 0) ? 0xFFFFFFFF : 0 ``` **Traced examples for the aliases:** ```asm // ═══ CINC — Conditional Increment ═══ // CINC X0, X1, EQ = CSINC X0, X1, X1, NE (note inverted condition) // "If EQ, increment X1; otherwise keep X1 unchanged" // // If Z=1 (EQ is true): NE is false → CSINC takes false path → X0 = X1 + 1 // If Z=0 (EQ is false): NE is true → CSINC takes true path → X0 = X1 // // Concrete: X1 = 10, flags from CMP that set Z=1 (equal): // CINC X0, X1, EQ → X0 = 11 (incremented because EQ was true) // ═══ CNEG — Conditional Negate ═══ // CNEG X0, X1, LT = CSNEG X0, X1, X1, GE (inverted) // "If LT, negate X1; otherwise keep X1" // // Concrete: X1 = -5, flags from CMP that set LT true: // CNEG X0, X1, LT → X0 = 5 (negated because LT was true) // This is exactly how branchless abs() works: CMP + CNEG // ═══ CSETM — Conditional Set Mask ═══ // CSETM X0, NE = CSINV X0, XZR, XZR, EQ (inverted) // "Set to all-ones if NE, zero otherwise" // // If NE true: EQ false → CSINV takes false path → X0 = ~XZR = 0xFFFFFFFFFFFFFFFF // If NE false: EQ true → CSINV takes true path → X0 = XZR = 0 // // Why CSETM is useful: the all-ones mask (0xFFFF...F) can be used with AND/ORR // for branchless conditional operations on bitfields. In C, this is like: // mask = (cond) ? ~0ULL : 0ULL; // result = (value & mask) | (other & ~mask); ``` --- ## 19. System Registers & Special Instructions System registers control hardware features like interrupt masking, cache behavior, and virtual memory. They are not part of the general-purpose register file — you access them with dedicated `MRS` (read) and `MSR` (write) instructions. ### 19.1 MRS / MSR — System Register Access `MRS` (Move to Register from System) copies a system register into a general-purpose register. `MSR` (Move to System Register) copies a general-purpose register into a system register. Some system registers are read-only, some are write-only, and many are only accessible at higher exception levels (kernel, hypervisor). ``` MRS Xt|XZR, <sysreg> // Move system register to GPR MSR <sysreg>, Xt|XZR // Move GPR to system register MSR <pstatefield>, #imm // Immediate-to-PSTATE-field form. Only specific PSTATE fields: // MSR DAIFSet, #imm4 (set D/A/I/F mask bits; imm4 ∈ 0..15) // MSR DAIFClr, #imm4 (clear D/A/I/F mask bits; imm4 ∈ 0..15) // MSR SPSel, #imm1 (#0 = use SP_EL0, #1 = use SP_ELx) // MSR PAN, #imm1 (FEAT_PAN, Privileged Access Never toggle) // MSR UAO, #imm1 (FEAT_UAO, User Access Override toggle) // MSR DIT, #imm1 (FEAT_DIT, Data-Independent Timing toggle) // MSR SSBS, #imm1 (FEAT_SSBS, Speculative Store Bypass Safe toggle) // MSR TCO, #imm1 (FEAT_MTE, Tag Check Override toggle) // An arbitrary <sysreg> CANNOT be written with this immediate form — // only the pstatefields listed above accept it. ``` Common system registers: ```asm MRS X0, NZCV // Read condition flags MSR NZCV, X0 // Write condition flags MRS X0, FPCR // Floating-point control MRS X0, FPSR // Floating-point status MRS X0, CurrentEL // Current exception level (bits [3:2]) MRS X0, DAIF // Interrupt mask flags MRS X0, CNTFRQ_EL0 // Timer frequency MRS X0, CNTVCT_EL0 // Virtual timer count (high-resolution timestamp) MRS X0, CTR_EL0 // Cache type register MRS X0, DCZID_EL0 // Data cache zero ID MRS X0, TPIDR_EL0 // Thread ID register (user-accessible, used for thread-local storage) ``` ### 19.2 NOP, YIELD, WFE, WFI, SEV ``` NOP // No operation (often used for alignment or timing) YIELD // Hint: yield to other hardware threads sharing this core (spin-lock hint) WFE // Wait For Event (low-power wait) WFI // Wait For Interrupt (deeper low-power wait) SEV // Send Event (wake up WFE waiters) SEVL // Send Event Local (wake up local core from WFE) // FEAT_WFxT (ARMv8.7-A): timeout variants of WFE/WFI. Avoid the need for a separate timer-setup // dance before a WFE/WFI — the CPU wakes on the earlier of (event/interrupt) or (system-counter // reaching Xt). The register operand holds the absolute system-counter deadline (CNTPCT_EL0 value). WFET Xt|XZR // Wait For Event with Timeout: wakes on SEV, SEVL, or when CNTPCT_EL0 >= Xt WFIT Xt|XZR // Wait For Interrupt with Timeout: wakes on IRQ, or when CNTPCT_EL0 >= Xt ``` **Spin-lock pattern with WFE:** ```asm spin: LDAXR W1, [X0] // Load-acquire exclusive CBNZ W1, wait // If locked, wait STXR W2, W3, [X0] // Try to store our value CBNZ W2, spin // If exclusive failed, retry B got_lock wait: WFE // Low-power wait until event B spin // Try again ``` ### 19.3 SVC / HVC / SMC — Exception Generation These instructions deliberately trigger an exception to call into a higher privilege level. `SVC` (Supervisor Call) is how user programs make system calls to the kernel — the 16-bit immediate is captured by the hardware into `ESR_ELx.ISS[15:0]` when the exception is taken, so the handler reads it from `ESR_EL1` without needing to decode the SVC instruction itself. (Linux ignores the immediate and reads the syscall number from X8 instead, but some OSes use the immediate.) `HVC` calls the hypervisor. `SMC` calls secure firmware. `BRK` triggers a debug breakpoint with the same imm16-into-ESR behavior. ``` SVC #imm16 // Supervisor Call (EL0 → EL1 system call). imm16: 0–65535. HVC #imm16 // Hypervisor Call (EL1 → EL2) SMC #imm16 // Secure Monitor Call (EL1 → EL3) BRK #imm16 // Breakpoint (debug exception). imm16: 0–65535. HLT #imm16 // Halt (debug, external debugger) UDF #imm16 // Permanently Undefined instruction. Encoded as top-16-bits-zero // (0x0000NNNN where NNNN is imm16). Guaranteed to generate an // Undefined Instruction exception on every current and future // AArch64 implementation — this is what distinguishes UDF from // merely-unallocated encodings, which ARM is free to repurpose // in later architecture revisions. // // Compilers emit UDF for __builtin_trap(), unreachable() hints // after noreturn, abort-on-UB sanitizers, and reliable SIGILL // crashes in JITs. The 16-bit imm16 carries no hardware meaning // but is often used as a tag (e.g., LLVM emits `UDF #1` for // trap; SpiderMonkey uses a specific imm16 for its own sigill). // // Consequence: a page of all-zero bytes faults cleanly as a // stream of `UDF #0` instructions. This is why jumping into a // zero-filled region is a well-defined crash rather than UB. ``` Linux system call convention: ```asm MOV X8, #64 // syscall number (e.g., 64 = write) MOV X0, #1 // fd = stdout ADR X1, message // buffer MOV X2, #14 // length SVC #0 // trigger syscall // Return value in X0 ``` ### 19.4 HINT — Hint Space **ISA-level truth**: there is a single instruction `HINT #imm7` (7-bit immediate, range 0..127). Every mnemonic in the table below is an alias that fixes `imm7` to a particular value. On a CPU that doesn't implement a given hint, the encoding executes as NOP — that's the whole point of hint space. The raw `HINT #n` form is always accepted by the assembler if you need an unaliased value. ``` HINT #imm7 // Raw form; imm7 ∈ 0..127. Most values are RESERVED and behave as NOP. ``` | Mnemonic | Encoding | Purpose | |---|---|---| | `NOP` | `HINT #0` | No operation (architectural — takes a cycle, doesn't read/write) | | `YIELD` | `HINT #1` | Hint to SMT/SMP scheduler that this thread has nothing useful to do | | `WFE` | `HINT #2` | Wait for event (suspends until an event is signaled — spin-lock fast path) | | `WFI` | `HINT #3` | Wait for interrupt (deeper sleep than WFE) | | `SEV` | `HINT #4` | Send event — wakes WFE-sleeping cores | | `SEVL` | `HINT #5` | Send event local (only wakes this core's own WFE) | | `DGH` | `HINT #6` | Data Gathering Hint (FEAT_DGH) — speculation-barrier-like | | `CSDB` | `HINT #20` | Consumption of Speculative Data Barrier (Spectre mitigation) | | `ESB` | `HINT #16` | Error Synchronization Barrier (FEAT_RAS) | | `PSB CSYNC` | `HINT #17` | Profiling Synchronization Barrier (FEAT_SPE) | | `TSB CSYNC` | `HINT #18` | Trace Synchronization Barrier (FEAT_TRF) | | `GCSB DSYNC` | `HINT #19` | Guarded Control Stack Data Synchronization Barrier (FEAT_GCS) — ARMv9.4-A | | `CLRBHB` | `HINT #22` | Clear Branch History Buffer (FEAT_CLRBHB) — Spectre-BHB mitigation; mandatory from ARMv8.9-A | | `CHKFEAT X16`| `HINT #40` | Check Feature Status (FEAT_CHK) — implicitly reads and writes X16 (the only HINT-space mnemonic with an operand; assemblers emit it as `hint #40` on pre-ARMv8.9-A targets for backward compatibility) | | `BTI` | `HINT #32` | Branch Target Identification, **bare form** (`targets = 00` in the encoding). Distinct encoding from `BTI c/j/jc`; not interchangeable with them. See §32 for the full PSTATE.BTYPE compatibility table — in a guarded page the bare form's landing-pad semantics are deliberately strict, so use `BTI c`, `BTI j`, or `BTI jc` for any actual indirect-branch target unless you specifically want bare BTI's behavior. | | `BTI c` | `HINT #34` | Landing pad compatible with indirect **calls** (BLR Xn, or BR via X16/X17) | | `BTI j` | `HINT #36` | Landing pad compatible with indirect **jumps** (BR Xn for n ∉ {X16, X17}) — and also with BR X16/X17 | | `BTI jc` | `HINT #38` | Landing pad compatible with both call-style and jump-style indirect branches — most permissive | | `PACIA1716` | `HINT #8` | PAC X17 with X16 as context (key A) | | `PACIB1716` | `HINT #10` | Same with key B | | `AUTIA1716` | `HINT #12` | Authenticate X17 with X16 (key A) | | `AUTIB1716` | `HINT #14` | Authenticate X17 with X16 (key B) | | `PACIAZ` | `HINT #24` | PAC LR (X30) with zero context (key A) | | `PACIASP` | `HINT #25` | PAC LR (X30) with SP as context (key A) | | `PACIBZ` | `HINT #26` | PAC LR with zero context (key B) | | `PACIBSP` | `HINT #27` | PAC LR with SP as context (key B) | | `AUTIAZ` | `HINT #28` | Authenticate LR, zero context (key A) | | `AUTIASP` | `HINT #29` | Authenticate LR, SP context (key A) | | `AUTIBZ` | `HINT #30` | Authenticate LR, zero context (key B) | | `AUTIBSP` | `HINT #31` | Authenticate LR, SP context (key B) | | `XPACLRI` | `HINT #7` | Strip PAC from LR (X30); used for backtrace tools | **Form relationships for the table above**: - The alias takes **no operands** (all the operand info is baked into the fixed `imm7`). - The underlying `HINT #imm7` takes one operand: the 7-bit immediate. - For the PAC/BTI mnemonics, the alias also *implicitly* fixes which registers the operation uses — `PACIASP` always uses X30 as the pointer register and SP as the modifier, whereas the non-HINT `PACIA Xd, Xn` accepts arbitrary Xd/Xn. **Why HINT encoding?** Older CPUs that don't support a feature (like PAC or BTI) execute the HINT as a NOP — the program still runs, just without the security benefit. This provides backward compatibility: a PAC-enabled binary runs safely on old hardware (no crashes, just no protection). ### 19.5 SYS / SYSL — System Instructions `SYS` and `SYSL` are the generic system instruction encodings that all cache, TLB, and address translation operations are aliases for. You rarely write `SYS` directly — you write the friendly alias (like `DC ZVA`), and the assembler encodes it as `SYS`. ``` SYS #op1, Cn, Cm, #op2{, Xt|XZR} // System instruction with optional input register SYSL Xt|XZR, #op1, Cn, Cm, #op2 // System instruction with output to Xt ``` ### 19.6 Cache Maintenance Operations These are all aliases for `SYS` instructions. Cache maintenance is needed when writing self-modifying code (JIT compilers), setting up DMA transfers, or when the instruction and data caches see different views of memory. **Data Cache (DC) operations:** ```asm DC ZVA, Xt|XZR // Zero a DC ZVA block (Xt = address). Fastest way to zero memory. // Zeroes a naturally aligned block of N bytes, where N is indicated by // DCZID_EL0.BS (log2 of N in words). On typical Cortex-A this equals // the cache line size (often 64 bytes) but this is NOT architecturally // guaranteed — DC ZVA block size and cache line size are separate concepts. // The hardware aligns the address DOWN to the block boundary automatically; // there is no alignment restriction on the address within the block. // Always read DCZID_EL0 at runtime to get the actual block size. DC CVAC, Xt|XZR // Clean to Point of Coherency (write dirty data back to main memory) DC CVAU, Xt|XZR // Clean to Point of Unification (for instruction fetch coherency) DC CIVAC, Xt|XZR // Clean and Invalidate to Point of Coherency DC IVAC, Xt|XZR // Invalidate (discard data, EL1+ only — dangerous, can lose dirty data) ``` **Instruction Cache (IC) operations:** ```asm IC IALLU // Invalidate all instruction caches (EL1+) IC IVAU, Xt|XZR // Invalidate instruction cache by address to Point of Unification ``` **Why you need DC+IC together for JIT**: When you write machine code to memory (via stores), it goes through the data cache. But the CPU fetches instructions from the instruction cache, which is separate. To make the CPU see your new code, you must: (1) clean the data cache line to the point of unification (`DC CVAU`), so the data reaches a level visible to the I-cache; (2) invalidate the instruction cache (`IC IVAU`), so the I-cache re-fetches from the cleaned data; (3) insert barriers (`DSB ISH` + `ISB`) to ensure ordering. ```asm // After writing code to [X0]: DC CVAU, X0 // Clean data cache to Point of Unification DSB ISH // Wait for clean to complete IC IVAU, X0 // Invalidate instruction cache DSB ISH // Wait for invalidate to complete ISB // Flush pipeline, fetch new instructions ``` ### 19.7 Address Translation (AT) Translate a virtual address using the page tables, without actually accessing memory. The result goes into `PAR_EL1` (Physical Address Register). Useful for debugging page table issues in kernel code. ```asm AT S1E1R, X0 // Stage 1, EL1, Read: translate X0 as if EL1 read AT S1E1W, X0 // Stage 1, EL1, Write AT S1E0R, X0 // Stage 1, EL0, Read: translate as user-mode read MRS X1, PAR_EL1 // Read result (physical address + attributes, or fault info) ``` ### 19.8 TLB Invalidation (TLBI) The TLB (Translation Lookaside Buffer) is a cache of page table entries. When the OS modifies page tables (changing permissions, unmapping pages, switching address spaces), it must invalidate stale TLB entries so the CPU re-reads the updated page tables. ```asm TLBI VMALLE1 // Invalidate ALL TLB entries at EL1 (current VMID) TLBI VAE1, X0 // Invalidate TLB entry for virtual address in X0 (EL1) TLBI ASIDE1, X0 // Invalidate all entries matching ASID in X0 TLBI VALE1, X0 // Invalidate by VA, last level only (more targeted, faster) DSB ISH // Wait for invalidation to complete ISB // Ensure subsequent instruction fetches use new translations ``` **Why TLBI needs DSB+ISB**: TLBI is asynchronous — it tells the TLB to invalidate, but the invalidation may not be complete when the next instruction executes. `DSB ISH` waits for the invalidation to finish across all cores in the inner shareable domain. `ISB` then flushes the pipeline so subsequent instructions fetch with the new translations. --- ## 20. Overflow, Underflow & Carry **Why overflow detection matters**: Integer arithmetic silently wraps on overflow — `UINT64_MAX + 1 = 0`. In most code this is harmless (or intentional). But for security-critical code (buffer size calculations, array index bounds), undetected overflow causes vulnerabilities. ARM doesn't trap on overflow (unlike some architectures) — you must explicitly check using the flag-setting instructions (`ADDS`, `SUBS`) and conditional branches. This section shows how. For the basics of how the N/Z/C/V flags are set in the first place, see **§3 (The S Suffix & Condition Flags)** — this chapter builds directly on it. ### 20.1 Unsigned Overflow (Carry) For unsigned arithmetic, "overflow" means the result didn't fit in 64 (or 32) bits. The carry flag (C) indicates this. ```asm ADDS X0, X1, X2 // Unsigned: if result < X1, carry occurred B.CS overflow // CS = Carry Set = unsigned overflow SUBS X0, X1, X2 // Unsigned: if X1 < X2, borrow occurred B.CC underflow // CC = Carry Clear = unsigned underflow (borrow) ``` **Remember ARM's inverted carry for subtraction:** After SUBS, C=1 means NO borrow (X1 >= X2 unsigned), C=0 means borrow occurred (X1 < X2 unsigned). ### 20.2 Signed Overflow (V flag) Signed overflow occurs when the result of an operation doesn't fit in the signed range. The V flag indicates this. ```asm ADDS X0, X1, X2 B.VS signed_overflow // V=1 means signed overflow // Signed overflow in addition: positive + positive = negative, or negative + negative = positive // Signed overflow in subtraction: positive - negative = negative, or negative - positive = positive ``` ### 20.3 Detecting Overflow in Practice **Unsigned multiply overflow:** ```asm // Check if X0 * X1 overflows unsigned 64-bit: UMULH X2, X0, X1 // High 64 bits of product MUL X3, X0, X1 // Low 64 bits (the result we want) CBNZ X2, overflow // If high bits non-zero, overflow ``` **Signed multiply overflow:** ```asm // Check if X0 * X1 overflows signed 64-bit: SMULH X2, X0, X1 // High 64 bits (signed) MUL X3, X0, X1 // Low 64 bits // Overflow if X2 != sign-extension of X3 ASR X4, X3, #63 // X4 = all zeros or all ones (sign of X3) CMP X2, X4 B.NE overflow ``` **Multi-word addition with carry propagation:** ```asm // 128-bit: (X1:X0) + (X3:X2) → (X5:X4) ADDS X4, X0, X2 // Low 64, set carry ADCS X5, X1, X3 // High 64 + carry, set carry B.CS overflow_128 // Carry out of 128-bit B.VS signed_overflow_128 // Signed overflow of 128-bit ``` ### 20.4 Saturating Arithmetic AArch64 scalar doesn't have saturating add/sub (unlike NEON) — the NEON hardware saturating instructions (`SQADD`/`UQADD`/`SQSUB`/`UQSUB`/…) are in **§23.13**. In scalar code you must build it yourself: ```asm // Unsigned saturating add: X0 = min(X1 + X2, UINT64_MAX) MOV X3, #-1 // X3 = UINT64_MAX ADDS X0, X1, X2 CSEL X0, X3, X0, CS // If carry (overflow), use UINT64_MAX; else keep result // Signed saturating add is more complex — need to handle both directions: ADDS X0, X1, X2 // On signed overflow, saturate to INT64_MAX or INT64_MIN depending on direction MOV X3, #0x7FFFFFFFFFFFFFFF // INT64_MAX (valid bitmask immediate) // The trick: ASR #63 fills the entire register with copies of the sign bit. // If X1 >= 0: X1 ASR 63 = 0x0000000000000000, so EOR with INT64_MAX = INT64_MAX // If X1 < 0: X1 ASR 63 = 0xFFFFFFFFFFFFFFFF, so EOR with INT64_MAX = 0x8000000000000000 = INT64_MIN // This selects the correct saturation direction: positive overflow → MAX, negative overflow → MIN EOR X4, X3, X1, ASR #63 // X4 = INT64_MAX if X1 positive, INT64_MIN if negative CSEL X0, X4, X0, VS // If signed overflow, use saturated value ``` ### 20.5 FEAT_FlagM / FlagM2 — Direct Flag Manipulation FEAT_FlagM (mandatory from ARMv8.4-A) adds instructions that manipulate the NZCV flags directly without going through an arithmetic operation. Rare in hand-written code but useful for compilers porting x86 flag-dependent idioms and for implementing multi-word arithmetic without a full SUBS. FEAT_FlagM2 (mandatory from ARMv8.5-A) adds FP-flag format conversion — rewriting NZCV between Arm's internal FP-compare encoding and an IEEE-754-style "external" layout — useful for JITs translating FP-compare idioms from other ISAs. ```asm CFINV // C ← !C. Invert the carry flag (nothing else changes). // Useful for converting between ARM's "C=no-borrow after subtract" // and x86's "CF=borrow after subtract" conventions in ported code. RMIF Xn|XZR, #shift, #mask // Rotate Mask Insert Flags: extract 4 bits from (Xn ROR #shift) // at positions selected by #mask (4-bit), insert into NZCV. // shift: 0–63, mask: 0–15 (bit i of mask → update flag i of NZCV: // bit 3=N, bit 2=Z, bit 1=C, bit 0=V). SETF8 Wn|WZR // Set NZV flags as if the low 8 bits of Wn were a signed byte result. // Z = (Wn[7:0] == 0), N = Wn[7], V = (Wn[8] XOR Wn[7]). // Useful after a 32-bit op that produced an 8-bit value: // it lets you use signed comparisons (B.GT etc.) without // re-doing the operation at byte width. SETF16 Wn|WZR // Same for the low 16 bits (V = Wn[16] XOR Wn[15]). ``` **FEAT_FlagM2 — FP flag-format conversion (Arm ↔ external IEEE 754-style):** ```asm AXFLAG // Convert FP condition flags from Arm format to external format. // Arm FCMP produces: EQ(N=0,Z=1,C=1,V=0), LT(N=1,Z=0,C=0,V=0), // GT(N=0,Z=0,C=1,V=0), UN(N=0,Z=0,C=1,V=1). // After AXFLAG, NZCV encodes the same predicate in a different // bit layout that's cheaper for some algorithms (notably IEEE // totalOrder and certain JIT fast paths) to test with simple masks. XAFLAG // Inverse of AXFLAG: convert external format back to Arm format. // Use when NZCV was set by code (e.g., a JIT's open-coded path) // using external conventions and you want to follow it with // a standard Arm conditional branch. ``` **Why these exist**: the "external" format is the flag layout assumed by some portable FP algorithms and by other ISAs' FP-compare semantics. Without AXFLAG/XAFLAG, translating a flag-test idiom between the two formats took 3–5 instructions of masking and logic; these single-instruction conversions let JITs (especially JavaScript JITs) produce shorter code when mapping source-ISA FP-compare-and-branch sequences to AArch64. Note the trap in §35.6 (LT/LE/HI/NE unintentionally fire on NaN in Arm format) — AXFLAG is one way to rewrite those flags into a layout where a single-condition test cleanly distinguishes ordered-vs-unordered without an extra VS check. --- ## 21. Exceptions, Interrupts & Exception Levels ARM has a privilege system called Exception Levels (EL0–EL3). If you're writing user-space code, you only interact with exceptions via `SVC` (system calls). If you're writing a kernel, hypervisor, or firmware, you need to understand the full exception model. Even for RE, understanding EL helps you identify what privilege level code runs at. ### 21.1 Exception Levels (EL) | Level | Typical use | Can access | |---|---|---| | EL0 | User applications | User registers, EL0 system regs | | EL1 | OS kernel | EL0 + EL1 system regs, page tables | | EL2 | Hypervisor | EL0 + EL1 + EL2 system regs | | EL3 | Secure Monitor / firmware | Everything | Higher EL = more privilege. Exceptions go UP (or stay same level), returns go DOWN. You **cannot** take an exception to a lower EL. ### 21.2 Exception Types 1. **Synchronous exceptions** (caused by current instruction): - **SVC/HVC/SMC**: System calls - **Instruction abort**: Bad instruction fetch (e.g., page fault, permission fault) - **Data abort**: Bad data access (e.g., page fault, alignment, permission) - **Undefined instruction**: Unrecognized encoding - **Debug exceptions**: BRK, watchpoint, breakpoint, single-step - **SP/PC alignment fault** 2. **Asynchronous exceptions** (not caused by current instruction): - **IRQ**: Normal hardware interrupt — an external device (timer, network card, keyboard) signals the CPU that it needs attention. The CPU pauses its current code, saves state, and jumps to the interrupt handler. This happens asynchronously (at any point during program execution). - **FIQ**: Fast interrupt — same concept as IRQ but with a separate, higher-priority path. Used for latency-critical handlers (e.g., secure world interrupts). - **SError**: System error (asynchronous abort, e.g., uncorrectable memory error from a previous write that was buffered) ### 21.3 Exception Handling Mechanism When an exception occurs to ELx: 1. `PSTATE` is saved to `SPSR_ELx` (Saved Program Status Register) 2. Return address is saved to `ELR_ELx` (Exception Link Register) 3. Exception Syndrome info is saved to `ESR_ELx` (tells you WHY: instruction class, fault details) 4. If it's an abort, the faulting address is in `FAR_ELx` (Fault Address Register) 5. PSTATE is modified (interrupts masked, EL set, etc.) 6. PC jumps to the exception vector **ESR_ELx decoding**: Bits [31:26] are the **Exception Class (EC)** — the top-level reason for the exception. Common EC values: | EC (hex) | Meaning | |---|---| | 0x15 | SVC from AArch64 (system call) | | 0x16 | HVC from AArch64 | | 0x17 | SMC from AArch64 | | 0x18 | MSR/MRS/System-instruction trap (system register access from lower EL) | | 0x1C | PAC authentication failure (FEAT_FPAC) | | 0x20 | Instruction abort from lower EL (page fault on instruction fetch) | | 0x21 | Instruction abort taken without a change in EL | | 0x22 | PC alignment fault | | 0x24 | Data abort from lower EL (page fault on data access) | | 0x25 | Data abort taken without a change in EL | | 0x26 | SP alignment fault | | 0x27 | Memory Operation exception (FEAT_MOPS P/M/E sequence violation) | | 0x2C | Trapped FP exception from AArch64 | | 0x2F | SError | | 0x30 | Breakpoint exception from lower EL | | 0x31 | Breakpoint exception from same EL | | 0x32 | Software step from lower EL | | 0x33 | Software step from same EL | | 0x34 | Watchpoint from lower EL | | 0x35 | Watchpoint from same EL | | 0x3C | BRK instruction execution in AArch64 state | Bits [24:0] are the **ISS (Instruction Specific Syndrome)** — details specific to each EC. For SVC, the ISS contains the 16-bit immediate from the SVC instruction. For data aborts, the ISS tells you whether it was a read or write, the access size, and the fault type (translation, permission, alignment, etc.). ### 21.4 Exception Vector Table (VBAR_ELx) Each EL has a vector base address register (`VBAR_EL1`, `VBAR_EL2`, `VBAR_EL3`). The vector table has 16 entries, each 128 bytes (32 instructions). The CPU picks which entry to jump to based on three things: where the exception came from, which stack pointer was active, and what type of exception it is. The four groups: **"Current EL with SP_EL0"** means the exception happened at the same EL that's handling it, and the code was using the user-mode stack pointer (unusual — most kernels switch to SP_ELx immediately). **"Current EL with SP_ELx"** is the normal case for kernel exceptions. **"Lower EL, AArch64/AArch32"** means the exception came from a less-privileged level (e.g., a user-mode `SVC` arriving at the kernel). | Offset | Source | Type | |---|---|---| | 0x000 | Current EL with SP_EL0 | Synchronous | | 0x080 | Current EL with SP_EL0 | IRQ | | 0x100 | Current EL with SP_EL0 | FIQ | | 0x180 | Current EL with SP_EL0 | SError | | 0x200 | Current EL with SP_ELx | Synchronous | | 0x280 | Current EL with SP_ELx | IRQ | | 0x300 | Current EL with SP_ELx | FIQ | | 0x380 | Current EL with SP_ELx | SError | | 0x400 | Lower EL, AArch64 | Synchronous | | 0x480 | Lower EL, AArch64 | IRQ | | 0x500 | Lower EL, AArch64 | FIQ | | 0x580 | Lower EL, AArch64 | SError | | 0x600 | Lower EL, AArch32 | Synchronous | | 0x680 | Lower EL, AArch32 | IRQ | | 0x700 | Lower EL, AArch32 | FIQ | | 0x780 | Lower EL, AArch32 | SError | **Return from exception:** ``` ERET // PC = ELR_ELx, PSTATE = SPSR_ELx, EL drops as appropriate ``` **Debug-mode exception entry/return (halt mode — rarely seen outside kernel debuggers):** ```asm // DCPS (Debug Change PE State) — can ONLY be executed while the PE is in Debug state // (i.e., the external debugger has halted the core via EDSCR.HDE and the PE is servicing // the debugger). In normal operation these trap as UNDEFINED. They let the debugger // change the current Exception Level without a conventional exception. DCPS1 {#imm16} // Switch to EL1 in Debug state. imm16 optional — stored in ESR_EL1. DCPS2 {#imm16} // Switch to EL2 in Debug state (requires EL2 present). DCPS3 {#imm16} // Switch to EL3 in Debug state (requires EL3 present). // DRPS (Debug Restore PE State) — the Debug-state counterpart of ERET. Restores // PSTATE and PC from DLR_EL0 / DSPSR_EL0 when the debugger resumes normal execution // from a halt. Also UNDEFINED outside Debug state. DRPS // PC = DLR_EL0, PSTATE = DSPSR_EL0. No operands. ``` These are invisible to normal software; the only reason to recognize them in disassembly is that they sit next to SVC/HVC/SMC in the encoding space and a mis-decoded byte stream can display them where other instructions were intended. Also useful when reverse-engineering firmware blobs that include a dedicated debug-halt handler. **Practical example — minimal SVC handler (EL1 kernel handling user SVC):** ```asm // This code would be at VBAR_EL1 + 0x400 (Lower EL AArch64, Synchronous) el0_sync_handler: STP X29, X30, [SP, #-16]! // Save frame (using kernel SP) MRS X0, ESR_EL1 // Read exception syndrome LSR X1, X0, #26 // Extract EC (bits [31:26]) CMP X1, #0x15 // EC=0x15 = SVC from AArch64? B.NE not_svc // If not SVC, handle other exception // It's a syscall — X8 has the syscall number (set by user before SVC) // X0-X7 have the arguments (set by user) MRS X9, ELR_EL1 // Save return address (instruction after SVC) // ... dispatch to syscall handler based on X8 ... // ... handler puts return value in X0 ... MSR ELR_EL1, X9 // Restore return address LDP X29, X30, [SP], #16 ERET // Return to user: PC=ELR, PSTATE=SPSR, drop to EL0 ``` ### 21.5 Masking Interrupts ```asm MSR DAIFSet, #0xF // Mask all: Debug, SError (A), IRQ (I), FIQ (F) MSR DAIFClr, #0xF // Unmask all MSR DAIFSet, #0x2 // Mask IRQ only (bit 1) MSR DAIFClr, #0x2 // Unmask IRQ only ``` The bits: D=bit3, A=bit2, I=bit1, F=bit0. A set bit means **masked** (disabled). --- ## 22. Floating Point (SIMD/FP) Floating-point instructions operate on the S (32-bit single) and D (64-bit double) register views of the SIMD/FP register file. They are separate from integer instructions and use a separate set of condition flag semantics for comparisons (particularly around NaN — "Not a Number" — which is a special float value representing undefined results like 0/0). ### 22.1 Basic FP Instructions These mirror integer arithmetic but for floating-point values. `FADD` adds, `FSUB` subtracts, `FMUL` multiplies, `FDIV` divides. Unlike integer division, `FDIV` can produce fractional results. `FSQRT` computes the square root. ```asm FADD Sd, Sn, Sm // Single-precision add FADD Dd, Dn, Dm // Double-precision add FSUB Sd, Sn, Sm // Single-precision subtract FSUB Dd, Dn, Dm // Double-precision subtract FMUL Sd, Sn, Sm // Single-precision multiply FMUL Dd, Dn, Dm // Double-precision multiply FNMUL Sd, Sn, Sm // Negated multiply: Sd = -(Sn * Sm) — one rounding, same as -FMUL FNMUL Dd, Dn, Dm // Double FDIV Sd, Sn, Sm // Single-precision divide FDIV Dd, Dn, Dm // Double-precision divide FNEG Sd, Sn // Negate single (flip sign bit) FNEG Dd, Dn // Negate double FABS Sd, Sn // Absolute value single (clear sign bit) FABS Dd, Dn // Absolute value double FABD Sd, Sn, Sm // FP absolute difference: Sd = |Sn - Sm| (one rounding) FABD Dd, Dn, Dm // Double FSQRT Sd, Sn // Square root single FSQRT Dd, Dn // Square root double // Reciprocal & reciprocal-square-root — the "estimate + Newton-Raphson" pattern. // FRECPE/FRSQRTE give a fast, low-precision initial estimate (one cycle); FRECPS/FRSQRTS // give the refinement step that, multiplied back in, doubles the number of good bits // per iteration. Two iterations reach single-precision accuracy; three reach double. FRECPE Hd, Hn // Reciprocal estimate (~8-bit accurate) (FEAT_FP16) FRECPE Sd, Sn // Reciprocal estimate (1/Sn, ~8-bit accurate) FRECPE Dd, Dn // Double — same semantics FRECPS Hd, Hn, Hm // Reciprocal NR step (FEAT_FP16): Hd = 2 - (Hn * Hm) FRECPS Sd, Sn, Sm // Reciprocal Newton-Raphson step: Sd = 2 - (Sn * Sm) FRECPS Dd, Dn, Dm // Double FRSQRTE Hd, Hn // Reciprocal-sqrt estimate (FEAT_FP16) FRSQRTE Sd, Sn // Reciprocal square-root estimate (~8-bit accurate) FRSQRTE Dd, Dn FRSQRTS Hd, Hn, Hm // Reciprocal-sqrt NR step (FEAT_FP16) FRSQRTS Sd, Sn, Sm // Reciprocal-sqrt Newton-Raphson step: Sd = (3 - Sn * Sm) / 2 FRSQRTS Dd, Dn, Dm FRINT32Z Sd, Sn // Round to 32-bit integer value (toward zero), stay in FP format FRINT32Z Dd, Dn // Double version FRINT32X Sd, Sn // Round to 32-bit integer value (current rounding mode) FRINT32X Dd, Dn // Double version FRINT64Z Sd, Sn // Round to 64-bit integer value (toward zero) FRINT64Z Dd, Dn // Double version FRINT64X Sd, Sn // Round to 64-bit integer value (current rounding mode) FRINT64X Dd, Dn // Double version // All FRINT32/64 REQUIRE FEAT_FRINTTS (optional from Armv8.4-A, mandatory from Armv8.5-A) FRINTN Sd, Sn // Round to nearest integer (stay in FP: 3.7 → 4.0, not integer 4) FRINTN Dd, Dn // Double FRINTM Sd, Sn // Round toward -infinity (floor), stay in FP FRINTM Dd, Dn // Double FRINTP Sd, Sn // Round toward +infinity (ceil), stay in FP FRINTP Dd, Dn // Double FRINTZ Sd, Sn // Round toward zero (truncate), stay in FP FRINTZ Dd, Dn // Double FRINTA Sd, Sn // Round to nearest, ties away from zero, stay in FP FRINTA Dd, Dn // Double FRINTX Sd, Sn // Round using FPCR mode, signal inexact FRINTX Dd, Dn // Double FRINTI Sd, Sn // Round using FPCR mode FRINTI Dd, Dn // Double // FRINTN/M/P/Z are baseline ARMv8.0. FRINT32Z/32X/64Z/64X require FEAT_FRINTTS. ``` **What these REALLY do — traced with values:** ```asm // If S0 = 3.0 and S1 = 1.5: FADD S2, S0, S1 // S2 = 3.0 + 1.5 = 4.5 FSUB S2, S0, S1 // S2 = 3.0 - 1.5 = 1.5 FMUL S2, S0, S1 // S2 = 3.0 × 1.5 = 4.5 FDIV S2, S0, S1 // S2 = 3.0 ÷ 1.5 = 2.0 // Special cases the hardware handles: FDIV S2, S0, S1 // If S1 = 0.0: S2 = +infinity (not an exception!) FDIV S2, S0, S1 // If S0 = 0.0 and S1 = 0.0: S2 = NaN (0/0 is undefined) FSQRT S2, S0 // If S0 = -1.0: S2 = NaN (square root of negative) FSQRT S2, S0 // If S0 = 4.0: S2 = 2.0 ``` **Why FP doesn't trap on errors by default**: Unlike integer division (which returns 0 for divide-by-zero on ARM), FP operations produce IEEE 754 special values (infinity, NaN) instead of faulting. This lets algorithms handle edge cases without branch-heavy error checking. If you need to detect errors, check `FPSR` exception flags after the computation. ### 22.2 FP Multiply-Accumulate Fused multiply-accumulate (FMA) computes `a + (b × c)` or `a - (b × c)` with only **one** rounding step at the end, making it more accurate than separate FMUL + FADD. This is the single most important instruction for numerical performance — matrix multiply, convolution, polynomial evaluation, and physics simulations all reduce to FMA loops. ```asm FMADD Sd, Sn, Sm, Sa // Sd = Sa + (Sn * Sm), fused (single rounding) [single] FMADD Dd, Dn, Dm, Da // Dd = Da + (Dn * Dm) [double] FMSUB Sd, Sn, Sm, Sa // Sd = Sa - (Sn * Sm) [single] FMSUB Dd, Dn, Dm, Da // Dd = Da - (Dn * Dm) [double] FNMADD Sd, Sn, Sm, Sa // Sd = -Sa - (Sn * Sm) = -(Sa + Sn*Sm) [single] FNMADD Dd, Dn, Dm, Da // Dd = -Da - (Dn * Dm) [double] FNMSUB Sd, Sn, Sm, Sa // Sd = -Sa + (Sn * Sm) = Sn*Sm - Sa [single] FNMSUB Dd, Dn, Dm, Da // Dd = -Da + (Dn * Dm) = Dn*Dm - Da [double] ``` `FNMADD` and `FNMSUB` are the negated versions — they negate the entire result. `FNMADD` negates the fused multiply-add (useful for computing `-(a×b + c)`). `FNMSUB` computes `a×b - c` (the multiply result minus the accumulator). **Fused**: Only one rounding at the end, not after multiply then again after add. This is more accurate than separate FMUL + FADD. **MADD for polynomial evaluation** (Horner's method): compilers use FMADD to evaluate `a*x^2 + b*x + c` as `FMADD(FMADD(a, x, b), x, c)` — two fused multiply-adds instead of separate multiply and add chains. ### 22.3 FP Conditional Select & Moves ```asm FCSEL Sd, Sn, Sm, cond // Sd = cond ? Sn : Sm (based on integer NZCV flags) [single] FCSEL Dd, Dn, Dm, cond // Dd = cond ? Dn : Dm [double] FCSEL Hd, Hn, Hm, cond // Hd = cond ? Hn : Hm (FEAT_FP16) [half] ``` **What FCSEL REALLY does**: It's the FP equivalent of CSEL. The condition is tested against the integer NZCV flags (typically set by a prior FCMP or CMP), and one of the two FP registers is selected. This enables branchless FP min/max: ```asm // FP min — no-NaN version (assumes ordered inputs): FCMP S0, S1 FCSEL S0, S0, S1, LE // LE includes unordered — wrong if either operand is NaN // Ordered select (NOT a true min — picks S1 when unordered, even if S1 is NaN): FCMP S0, S1 FCSEL S0, S0, S1, MI // S0 only when S0 < S1 ordered; else S1 // TRUE NaN-safe min/max — use FMINNM/FMAXNM: FMINNM Sd, Sn, Sm // Min, returns numeric value when one operand is quiet NaN [single] FMINNM Dd, Dn, Dm // [double] FMAXNM Sd, Sn, Sm // Max, returns numeric value when one operand is quiet NaN [single] FMAXNM Dd, Dn, Dm // [double] FMIN Sd, Sn, Sm // Min (IEEE 754-2008 minimum: propagates NaN) [single] FMIN Dd, Dn, Dm // [double] FMAX Sd, Sn, Sm // Max (IEEE 754-2008 maximum: propagates NaN) [single] FMAX Dd, Dn, Dm // [double] // FP abs: FABS S0, S0 // Just use FABS — no FCSEL needed ``` **FP register-to-register moves** (no conversion, just copy): ```asm FMOV Sd, Sn // Copy single-precision register FMOV Dd, Dn // Copy double-precision register ``` These copy the value between FP registers without touching GPRs. Unlike `MOV` (which is always an alias), `FMOV` between same-width FP registers is a real instruction. ### 22.4 FP Comparison `FCMP` compares two floating-point values and sets the NZCV flags. Unlike integer comparisons, floats have a special case: if either operand is NaN (Not a Number), the result is "unordered" — the operands cannot be compared. After `FCMP`, you can use the same condition codes as after integer `CMP`, plus `B.VS` to detect the NaN/unordered case. ```asm FCMP Sn, Sm // Compare single, set NZCV flags FCMP Dn, Dm // Compare double FCMP Sn, #0.0 // Compare single with zero FCMP Dn, #0.0 // Compare double with zero FCMPE Sn, Sm // Signaling compare single: signals Invalid Operation for ANY NaN FCMPE Dn, Dm // Signaling compare double // (FCMP only signals for signaling NaNs, not quiet NaNs) FCMPE Sn, #0.0 // Signaling compare single against zero FCMPE Dn, #0.0 // Signaling compare double against zero FCMP Hn, Hm // FEAT_FP16: Compare half-precision FCMP Hn, #0.0 // FEAT_FP16: Compare half against zero FCMPE Hn, Hm // FEAT_FP16: Signaling compare half FCMPE Hn, #0.0 // FEAT_FP16: Signaling compare half against zero ``` **Conditional FP compare — FCCMP / FCCMPE**: The FP analog of integer `CCMP`/`CCMN`. If the condition holds, do an FP compare and update flags; if not, write the literal `#nzcv` immediate into the flags. This is how the compiler chains FP tests like `if (a == b && c < d)` without a branch. ```asm FCCMP Sn, Sm, #nzcv, cond // If cond: flags ← FCMP(Sn, Sm). Else: flags ← #nzcv (4-bit N,Z,C,V). [single] FCCMP Dn, Dm, #nzcv, cond // [double] FCCMP Hn, Hm, #nzcv, cond // [half, FEAT_FP16] FCCMPE Sn, Sm, #nzcv, cond // Signaling variant: raises Invalid Op on ANY NaN when the compare runs FCCMPE Dn, Dm, #nzcv, cond // [double] FCCMPE Hn, Hm, #nzcv, cond // [half, FEAT_FP16] // #nzcv is 4 bits packed as N<<3 | Z<<2 | C<<1 | V. Example: #0b0100 = Z set only → "equal". // There is no FCCMP-with-immediate-zero form; compare against Sm/Dm/Hm only. ``` Typical use — compiling `if (a < b && c < d)` with no branch in the middle: ```asm FCMP S0, S1 // test a < b, flags = N!=V means LT FCCMP S2, S3, #0, MI // if MI (N==1, i.e. a<b): compare c,d. Else: flags = 0000 (GE result, so outer MI fails) B.MI both_less_than // taken only if BOTH comparisons were LT (NaN-safe: FCCMP wouldn't run on unordered first compare either) ``` After FCMP (full NZCV per ARM ARM FPCompare pseudocode): - Ordered and equal: N=0, Z=1, C=1, V=0 - Ordered and less than: N=1, Z=0, C=0, V=0 (so N!=V → LT; C=0 is why HI doesn't fire) - Ordered and greater than: N=0, Z=0, C=1, V=0 (so HI and GT both fire) - Unordered (NaN involved): N=0, Z=0, C=1, V=1 Use `B.VS` to check for NaN after FCMP. **FACGT / FACGE — FP absolute-value compare** (base NEON, scalar and vector forms): ```asm // Scalar (produces an all-ones mask in the destination FP register when the comparison is true): FACGT Hd, Hn, Hm // Hd = (|Hn| > |Hm|) ? -1 : 0 (FEAT_FP16) FACGT Sd, Sn, Sm // Sd = (|Sn| > |Sm|) ? -1 : 0 FACGT Dd, Dn, Dm // Dd = (|Dn| > |Dm|) ? -1 : 0 FACGE Hd, Hn, Hm // Hd = (|Hn| >= |Hm|) ? -1 : 0 (FEAT_FP16) FACGE Sd, Sn, Sm // Sd = (|Sn| >= |Sm|) ? -1 : 0 FACGE Dd, Dn, Dm // Dd = (|Dn| >= |Dm|) ? -1 : 0 // Vector (per-lane mask — all 5 FP arrangements; FP16 forms require FEAT_FP16): FACGT V0.4H, V1.4H, V2.4H // FEAT_FP16 FACGT V0.8H, V1.8H, V2.8H // FEAT_FP16 FACGT V0.2S, V1.2S, V2.2S FACGT V0.4S, V1.4S, V2.4S FACGT V0.2D, V1.2D, V2.2D FACGE V0.4H, V1.4H, V2.4H // FEAT_FP16 — same arrangement set as FACGT FACGE V0.8H, V1.8H, V2.8H // FEAT_FP16 FACGE V0.2S, V1.2S, V2.2S FACGE V0.4S, V1.4S, V2.4S FACGE V0.2D, V1.2D, V2.2D ``` **Why absolute-value compare**: exact-magnitude tolerance checks like `fabs(x) < epsilon` compile to one `FACGT` plus a branch instead of `FABS + FCMP` pairs. Also used in clipping/saturation: `FACGT` tells you whether a value has exceeded a symmetric bound without caring about its sign. **Critical NaN gotcha**: NaN is not equal to **anything**, including itself. `FCMP S0, S0` where S0=NaN sets V=1 (unordered), Z=0 (not equal). This means `B.EQ` after comparing NaN to itself is **NOT taken**. This is the standard IEEE 754 behavior and is how `isnan(x)` works: `x != x` is true only if x is NaN. In ARM assembly: `FCMP S0, S0; B.VS is_nan` (check the V flag directly). **Traced example:** ```asm // If S0 = 3.14 and S1 = 2.71: FCMP S0, S1 // 3.14 > 2.71 → flags: N=0,Z=0,C=1,V=0 B.GT greater_label // GT = (Z==0 && N==V) = (0==0 && 0==0) = true → taken ✓ B.HI greater_label // HI = (C==1 && Z==0) = true → works here, but CAUTION: // B.HI also triggers on NaN! Use B.GT for FP greater (NaN-safe). // If S0 = NaN: FCMP S0, S1 // NaN involved → flags: N=0,Z=0,C=1,V=1 (unordered) B.VS nan_label // VS = (V==1) = true → branch to NaN handler ✓ B.GT greater_label // GT = (Z==0 && N==V) = (0==0 && 0==1) = false → NOT taken ✓ // (NaN is not greater than anything — GT excludes NaN) B.LT less_label // LT = (N!=V) = (0!=1) = TRUE → TAKEN! // CAREFUL: B.LT IS taken for NaN! Use B.MI for "less than, not NaN" ``` **Which conditions are NaN-safe after FCMP?** This matters because NaN sets N=0,Z=0,C=1,V=1. Any condition that evaluates to true with these flags will fire on NaN: | FP comparison you want | NaN-safe (excludes NaN) | NaN-UNSAFE (includes NaN) | |---|---|---| | Greater than | `B.GT` | `B.HI` | | Greater or equal | `B.GE` | `B.HS` / `B.CS` | | Less than | `B.MI` | `B.LT` | | Less or equal | `B.LS` | `B.LE` | | Equal | `B.EQ` | — | | Not equal | — | `B.NE` (includes NaN — usually what you want) | | Unordered (is NaN?) | `B.VS` | — | **Why this works**: FCMP never sets N=1 and V=1 simultaneously — there's no "signed overflow" in FP comparison. So `B.MI` (N==1) only fires for ordered-less-than, never for NaN (which sets N=0). Similarly, `B.GT` (Z == 0 && N == V) excludes NaN because NaN sets V=1 but N=0 (so N≠V). The unsigned conditions (`HI`, `HS`) are unsafe because NaN sets C=1, which is the same as "unsigned higher." ### 22.5 FP ↔ Integer Conversion These convert between integer and floating-point representations. The value is mathematically converted (not just bit-reinterpreted). For example, `SCVTF Sd, Wn` takes the signed integer in Wn and produces the nearest float in Sd. The reverse (`FCVTZS`) converts a float to an integer, rounding toward zero (truncating the fractional part, like a C cast `(int)f`). Other rounding modes are also available. **What these REALLY do:** ```asm // Integer → Float: // If W0 = 42 (integer): SCVTF S1, W0 // S1 = 42.0 (float representation of the integer 42) UCVTF S1, W0 // Same result for positive numbers // If W0 = -7 (signed integer): SCVTF S1, W0 // S1 = -7.0 (signed conversion, preserves negative) UCVTF S1, W0 // S1 = 4294967289.0 (unsigned! -7 as unsigned 32-bit = huge number) // Float → Integer: // If S0 = 3.7: FCVTZS W1, S0 // W1 = 3 (truncate toward zero — drops the .7) FCVTNS W1, S0 // W1 = 4 (round to nearest, .7 rounds up) FCVTMS W1, S0 // W1 = 3 (floor — round toward minus infinity) FCVTPS W1, S0 // W1 = 4 (ceiling — round toward plus infinity) // If S0 = -3.7: FCVTZS W1, S0 // W1 = -3 (truncate toward zero — NOT -4!) FCVTMS W1, S0 // W1 = -4 (floor — round toward minus infinity) // Edge cases — all FP→int converts saturate on overflow and return 0 for NaN: // If S0 = NaN: FCVTZS W1, S0 → W1 = 0 (all FCVTxx variants) // If S0 = +infinity: FCVTZS W1, S0 → W1 = INT32_MAX = 0x7FFFFFFF // If S0 = -infinity: FCVTZS W1, S0 → W1 = INT32_MIN = 0x80000000 // If S0 = 3.0e10 (too big): FCVTZS W1, S0 → W1 = INT32_MAX (saturates, no wrap) // If S0 = -3.0e10 (too neg): FCVTZS W1, S0 → W1 = INT32_MIN (saturates) // For FCVTZU: NaN → 0, +inf/too-big → UINT_MAX, -inf/negative → 0 // This saturation is hardware-guaranteed; FPSR.IOC (Invalid Op) is set on NaN/overflow. ``` ```asm // Float → Signed integer (round toward zero): FCVTZS Wd|WZR, Sn // Single → signed 32-bit FCVTZS Xd|XZR, Sn // Single → signed 64-bit FCVTZS Wd|WZR, Dn // Double → signed 32-bit FCVTZS Xd|XZR, Dn // Double → signed 64-bit FCVTZS Wd|WZR, Hn // Half → signed 32-bit (FEAT_FP16) FCVTZS Xd|XZR, Hn // Half → signed 64-bit (FEAT_FP16) // Float → Unsigned integer (round toward zero): FCVTZU Wd|WZR, Sn // Single → unsigned 32-bit FCVTZU Xd|XZR, Sn // Single → unsigned 64-bit FCVTZU Wd|WZR, Dn // Double → unsigned 32-bit FCVTZU Xd|XZR, Dn // Double → unsigned 64-bit FCVTZU Wd|WZR, Hn // Half → unsigned 32-bit (FEAT_FP16) FCVTZU Xd|XZR, Hn // Half → unsigned 64-bit (FEAT_FP16) // FCVT with OTHER rounding modes — these exist as independent mnemonics (NOT aliases). // The letter after FCVT selects the rounding mode; the trailing S/U selects sign: // // A = Away from zero on ties (IEEE "round to nearest ties-away") // M = Minus infinity (floor) // N = Nearest ties-even (IEEE default — "round to nearest ties-even") // P = Plus infinity (ceiling) // Z = Zero (truncate) — shown above as FCVTZS/FCVTZU // // Full signed set: FCVTAS, FCVTMS, FCVTNS, FCVTPS, FCVTZS. // Full unsigned set: FCVTAU, FCVTMU, FCVTNU, FCVTPU, FCVTZU. // Every mnemonic has 4 scalar forms (Wd/Xd × Sn/Dn) plus FEAT_FP16 adds Hn-input variants // (Wd/Xd × Hn). Shown explicitly for FCVTAS; the other rounding modes follow the identical pattern. FCVTAS Wd|WZR, Sn // single → signed 32-bit, round to nearest ties-away FCVTAS Xd|XZR, Sn // single → signed 64-bit FCVTAS Wd|WZR, Dn // double → signed 32-bit FCVTAS Xd|XZR, Dn // double → signed 64-bit FCVTAS Wd|WZR, Hn // half → signed 32-bit (FEAT_FP16) FCVTAS Xd|XZR, Hn // half → signed 64-bit (FEAT_FP16) // FCVTMS (floor), FCVTNS (round-nearest-ties-even), FCVTPS (ceiling) accept the same six forms — // just substitute the mnemonic and the rounding-mode semantic: FCVTMS Wd|WZR, Sn | FCVTMS Xd|XZR, Sn | FCVTMS Wd|WZR, Dn | FCVTMS Xd|XZR, Dn FCVTMS Wd|WZR, Hn | FCVTMS Xd|XZR, Hn // FEAT_FP16 FCVTNS Wd|WZR, Sn | FCVTNS Xd|XZR, Sn | FCVTNS Wd|WZR, Dn | FCVTNS Xd|XZR, Dn FCVTNS Wd|WZR, Hn | FCVTNS Xd|XZR, Hn // FEAT_FP16 FCVTPS Wd|WZR, Sn | FCVTPS Xd|XZR, Sn | FCVTPS Wd|WZR, Dn | FCVTPS Xd|XZR, Dn FCVTPS Wd|WZR, Hn | FCVTPS Xd|XZR, Hn // FEAT_FP16 // Unsigned variants (FCVTAU/FCVTMU/FCVTNU/FCVTPU) — identical operand shape to the signed set above: FCVTAU Wd|WZR, Sn | FCVTAU Xd|XZR, Sn | FCVTAU Wd|WZR, Dn | FCVTAU Xd|XZR, Dn FCVTAU Wd|WZR, Hn | FCVTAU Xd|XZR, Hn // FEAT_FP16 FCVTMU Wd|WZR, Sn | FCVTMU Xd|XZR, Sn | FCVTMU Wd|WZR, Dn | FCVTMU Xd|XZR, Dn FCVTMU Wd|WZR, Hn | FCVTMU Xd|XZR, Hn // FEAT_FP16 FCVTNU Wd|WZR, Sn | FCVTNU Xd|XZR, Sn | FCVTNU Wd|WZR, Dn | FCVTNU Xd|XZR, Dn FCVTNU Wd|WZR, Hn | FCVTNU Xd|XZR, Hn // FEAT_FP16 FCVTPU Wd|WZR, Sn | FCVTPU Xd|XZR, Sn | FCVTPU Wd|WZR, Dn | FCVTPU Xd|XZR, Dn FCVTPU Wd|WZR, Hn | FCVTPU Xd|XZR, Hn // FEAT_FP16 // Saturation and NaN behavior is identical to FCVTZS/FCVTZU — NaN → 0, // out-of-range → INT_MAX/MIN (signed) or UINT_MAX/0 (unsigned). // NEON vector forms also exist (FCVTAS V0.4S, V1.4S, etc.) with the same rounding-mode letters // and arrangements {.2S/.4S/.2D, and .4H/.8H with FEAT_FP16}; scalar SIMD forms Bd/Hd/Sd/Dd also exist. // Signed integer → Float: SCVTF Sd, Wn|WZR // Signed 32-bit → single SCVTF Sd, Xn|XZR // Signed 64-bit → single (may lose precision) SCVTF Dd, Wn|WZR // Signed 32-bit → double (lossless) SCVTF Dd, Xn|XZR // Signed 64-bit → double SCVTF Hd, Wn|WZR // Signed 32-bit → half (FEAT_FP16) SCVTF Hd, Xn|XZR // Signed 64-bit → half (FEAT_FP16) // Unsigned integer → Float: UCVTF Sd, Wn|WZR // Unsigned 32-bit → single UCVTF Sd, Xn|XZR // Unsigned 64-bit → single (may lose precision) UCVTF Dd, Wn|WZR // Unsigned 32-bit → double (lossless) UCVTF Dd, Xn|XZR // Unsigned 64-bit → double UCVTF Hd, Wn|WZR // Unsigned 32-bit → half (FEAT_FP16) UCVTF Hd, Xn|XZR // Unsigned 64-bit → half (FEAT_FP16) // FIXED-POINT CONVERSION (Q-format / DSP) — same mnemonics with a #fbits parameter. // #fbits is the number of fractional bits (1..W-size): 1–32 for Wn/Wd, 1–64 for Xn/Xd. // Integer-to-float: value is treated as int << fbits (i.e., divide by 2^fbits on the way out): SCVTF Sd, Wn|WZR, #fbits // fbits = 1..32. Sd = (float)Wn / 2^fbits SCVTF Sd, Xn|XZR, #fbits // fbits = 1..64 SCVTF Dd, Wn|WZR, #fbits SCVTF Dd, Xn|XZR, #fbits SCVTF Hd, Wn|WZR, #fbits // FEAT_FP16 fixed-point → half SCVTF Hd, Xn|XZR, #fbits UCVTF Sd, Wn|WZR, #fbits // same with unsigned input UCVTF Sd, Xn|XZR, #fbits UCVTF Dd, Wn|WZR, #fbits UCVTF Dd, Xn|XZR, #fbits UCVTF Hd, Wn|WZR, #fbits // FEAT_FP16 UCVTF Hd, Xn|XZR, #fbits // Float-to-fixed-point: multiply by 2^fbits then truncate toward zero: FCVTZS Wd|WZR, Sn, #fbits // Wd = trunc(Sn * 2^fbits), signed FCVTZS Xd|XZR, Sn, #fbits FCVTZS Wd|WZR, Dn, #fbits FCVTZS Xd|XZR, Dn, #fbits FCVTZS Wd|WZR, Hn, #fbits // FEAT_FP16 half → fixed-point (32-bit result) FCVTZS Xd|XZR, Hn, #fbits // FEAT_FP16 half → fixed-point (64-bit result) FCVTZU Wd|WZR, Sn, #fbits // same with unsigned output FCVTZU Xd|XZR, Sn, #fbits FCVTZU Wd|WZR, Dn, #fbits FCVTZU Xd|XZR, Dn, #fbits FCVTZU Wd|WZR, Hn, #fbits // FEAT_FP16 FCVTZU Xd|XZR, Hn, #fbits // FEAT_FP16 // Example: Q16.16 fixed-point encoding of 3.5: // MOV W0, #0x00038000 ; Q16.16 value for 3.5 // SCVTF S0, W0, #16 ; S0 = 0x00038000 / 2^16 = 3.5 // FCVTZS W0, S0, #16 ; round trip: W0 = 0x00038000 again // FCVTZS / FCVTZU — truncate-toward-zero FP → integer. FP16 source variants (FEAT_FP16): FCVTZS Wd|WZR, Hn // Half → signed 32-bit FCVTZS Xd|XZR, Hn // Half → signed 64-bit FCVTZU Wd|WZR, Hn // Half → unsigned 32-bit FCVTZU Xd|XZR, Hn // Half → unsigned 64-bit // Other rounding modes — each has ALL width combinations. Rounding-mode letters: // A = nearest, ties Away from zero M = Minus-infinity (floor) // N = Nearest, ties to eveN P = Plus-infinity (ceiling) // Z = toward Zero (truncate — this is just FCVTZS / FCVTZU shown earlier) // Width combinations (each of the 4 signed + 4 unsigned mnemonics below accepts all of these): // Wd|WZR, Sn Xd|XZR, Sn Wd|WZR, Dn Xd|XZR, Dn Wd|WZR, Hn Xd|XZR, Hn (Hn variants need FEAT_FP16) // Signed: FCVTAS Wd|WZR, Sn // Round to nearest, ties away from zero FCVTAS Xd|XZR, Sn FCVTAS Wd|WZR, Dn FCVTAS Xd|XZR, Dn FCVTAS Wd|WZR, Hn // FEAT_FP16 FCVTAS Xd|XZR, Hn // FEAT_FP16 FCVTNS Wd|WZR, Sn // Round to nearest, ties to even — same 6 width combos as FCVTAS FCVTNS Xd|XZR, Sn FCVTNS Wd|WZR, Dn FCVTNS Xd|XZR, Dn FCVTNS Wd|WZR, Hn // FEAT_FP16 FCVTNS Xd|XZR, Hn // FEAT_FP16 FCVTMS Wd|WZR, Sn // Round toward −∞ (floor) — same 6 width combos FCVTMS Xd|XZR, Sn FCVTMS Wd|WZR, Dn FCVTMS Xd|XZR, Dn FCVTMS Wd|WZR, Hn // FEAT_FP16 FCVTMS Xd|XZR, Hn // FEAT_FP16 FCVTPS Wd|WZR, Sn // Round toward +∞ (ceiling) — same 6 width combos FCVTPS Xd|XZR, Sn FCVTPS Wd|WZR, Dn FCVTPS Xd|XZR, Dn FCVTPS Wd|WZR, Hn // FEAT_FP16 FCVTPS Xd|XZR, Hn // FEAT_FP16 // Unsigned (same 6 width combos for each): FCVTAU Wd|WZR, Sn // Round to nearest, ties away (unsigned) FCVTAU Xd|XZR, Sn FCVTAU Wd|WZR, Dn FCVTAU Xd|XZR, Dn FCVTAU Wd|WZR, Hn // FEAT_FP16 FCVTAU Xd|XZR, Hn // FEAT_FP16 FCVTNU Wd|WZR, Sn // Round to nearest, ties to even (unsigned) FCVTNU Xd|XZR, Sn FCVTNU Wd|WZR, Dn FCVTNU Xd|XZR, Dn FCVTNU Wd|WZR, Hn // FEAT_FP16 FCVTNU Xd|XZR, Hn // FEAT_FP16 FCVTMU Wd|WZR, Sn // Round toward −∞ (unsigned) FCVTMU Xd|XZR, Sn FCVTMU Wd|WZR, Dn FCVTMU Xd|XZR, Dn FCVTMU Wd|WZR, Hn // FEAT_FP16 FCVTMU Xd|XZR, Hn // FEAT_FP16 FCVTPU Wd|WZR, Sn // Round toward +∞ (unsigned) FCVTPU Xd|XZR, Sn FCVTPU Wd|WZR, Dn FCVTPU Xd|XZR, Dn FCVTPU Wd|WZR, Hn // FEAT_FP16 FCVTPU Xd|XZR, Hn // FEAT_FP16 ``` **FJCVTZS — FP → signed 32-bit with JavaScript semantics** (FEAT_JSCVT, mandatory from ARMv8.3-A): ```asm FJCVTZS Wd|WZR, Dn // Double → signed 32-bit using ECMAScript's ToInt32 rules. // Round toward zero, then modulo-wrap into the int32 range. // Also updates the Z flag: Z=1 iff the input was exactly representable // (no fractional part AND within int32 range). // Z=0 tells the JIT "this wasn't a clean conversion" so it can fall // back to a slow path. ``` **Why this instruction exists**: JavaScript's `|0` idiom (and the spec's `ToInt32`) requires converting a double to a signed 32-bit integer with specific behavior for out-of-range values, NaN, and infinities — all map to 0, 2^32-aliased, etc. Without `FJCVTZS`, a JIT needed 5+ instructions to replicate the semantics. With it: one instruction, and the Z flag tells the JIT whether a fast-path assumption (integer-clean value) held. Used heavily by V8, JavaScriptCore, and SpiderMonkey on ARM64. ### 22.6 FP ↔ GPR Moves (no conversion) `FMOV` copies raw bits between a general-purpose register and a floating-point register **without any conversion**. The bit pattern is preserved exactly. This is different from `SCVTF`/`FCVTZS` which mathematically convert the value. `FMOV Sd, #fimm` loads a floating-point constant directly, but only a limited set of 256 values are encodable. **What FMOV REALLY does vs SCVTF — critical difference:** ```asm // If W0 = 0x40400000 (which happens to be the IEEE 754 encoding of 3.0): FMOV S1, W0 // S1 = 3.0 (raw bit copy — 0x40400000 IS 3.0 in float) SCVTF S2, W0 // S2 = 1077936128.0 (treats W0 as integer 0x40400000 = 1077936128, // converts that integer to float) // These give COMPLETELY different results! FMOV preserves bits, SCVTF converts values. // Going the other direction: // If S0 = 3.0 (bit pattern 0x40400000): FMOV W1, S0 // W1 = 0x40400000 (raw bits of the float) FCVTZS W2, S0 // W2 = 3 (mathematical conversion: 3.0 → 3) ``` **When to use FMOV vs SCVTF**: Use `SCVTF` when converting between number types (int→float). Use `FMOV` when you need to manipulate the raw bits of a float (e.g., extracting the exponent, comparing float bit patterns, or passing floats through integer registers in a calling convention). ```asm FMOV Sd, Wn|WZR // Copy bits: GPR → single FP (no conversion) FMOV Wd|WZR, Sn // Copy bits: single FP → GPR FMOV Dd, Xn|XZR // Copy bits: GPR → double FP FMOV Xd|XZR, Dn // Copy bits: double FP → GPR FMOV Vd.D[1], Xn|XZR // Copy bits: GPR → upper 64 bits of 128-bit V register FMOV Xd|XZR, Vn.D[1] // Copy bits: upper 64 bits of V register → GPR FMOV Sd, #fimm // Load FP immediate (limited set of 256 values) FMOV Dd, #fimm // Double-precision immediate (same 256 values) FMOV Hd, #fimm // Half-precision immediate (FEAT_FP16, same 256 values) ``` The FP immediate (`#fimm`) can encode values of the form: `±(1 + m/16) × 2^(n)` where 0 ≤ m ≤ 15 and -3 ≤ n ≤ 4. This gives 256 possible values. NOT arbitrary floats. Some examples: ``` // m=0, n=0: ±(1 + 0/16) × 2^0 = ±1.0 // m=0, n=1: ±(1 + 0/16) × 2^1 = ±2.0 // m=0, n=-1: ±(1 + 0/16) × 2^-1 = ±0.5 // m=8, n=0: ±(1 + 8/16) × 2^0 = ±1.5 // m=0, n=4: ±(1 + 0/16) × 2^4 = ±16.0 // m=15, n=4: ±(1 + 15/16) × 2^4 = ±31.0 // m=0, n=-3: ±(1 + 0/16) × 2^-3 = ±0.125 // You CANNOT encode 0.0, 0.1, 0.3, or π with FMOV immediate // To load 0.0: use FMOV Sd, WZR (copies all-zero bits = IEEE 754 +0.0) // To load other non-encodable constants: use LDR from a literal pool ``` ### 22.7 FP Precision Conversion These convert between different FP widths (half ↔ single ↔ double). Widening conversions (half→single, single→double) are lossless. Narrowing conversions (double→single, single→half) may lose precision and round. ```asm FCVT Dd, Sn // Single → Double (lossless, no precision lost) FCVT Sd, Dn // Double → Single (may lose precision, rounds) FCVT Hd, Sn // Single → Half (may lose precision) FCVT Sd, Hn // Half → Single (lossless) FCVT Hd, Dn // Double → Half FCVT Dd, Hn // Half → Double (lossless) ``` ### 22.8 Half-Precision (FP16) Operations **FEAT_FP16** (ARMv8.2-A and later) adds native arithmetic on 16-bit floats. Without this feature, half-precision registers (H0–H31) can only be used as a storage format — you convert to single/double to compute, then convert back. With FEAT_FP16, you get direct arithmetic: ```asm // Half-precision arithmetic (FEAT_FP16): FADD Hd, Hn, Hm // 16-bit float add FSUB Hd, Hn, Hm FMUL Hd, Hn, Hm FDIV Hd, Hn, Hm FSQRT Hd, Hn FMADD Hd, Hn, Hm, Ha // Fused multiply-add FMSUB Hd, Hn, Hm, Ha // Fused multiply-subtract FNMADD Hd, Hn, Hm, Ha // Negated fused multiply-add: -(Ha + Hn*Hm) FNMSUB Hd, Hn, Hm, Ha // Negated fused multiply-subtract: Hn*Hm - Ha FABS Hd, Hn FNEG Hd, Hn FMIN Hd, Hn, Hm // IEEE 754 minimum (propagates NaN) FMAX Hd, Hn, Hm // IEEE 754 maximum (propagates NaN) FMINNM Hd, Hn, Hm // Min, returns number when one operand is quiet NaN FMAXNM Hd, Hn, Hm // Max, returns number when one operand is quiet NaN FMOV Hd, Hn // Register-to-register copy (no conversion) FRINTA Hd, Hn // Round to nearest, ties away from zero FRINTN Hd, Hn // Round to nearest, ties to even FRINTM Hd, Hn // Round toward −∞ (floor) FRINTP Hd, Hn // Round toward +∞ (ceil) FRINTZ Hd, Hn // Round toward zero (truncate) FRINTX Hd, Hn // Round using FPCR mode, signal inexact FRINTI Hd, Hn // Round using FPCR mode FCMP Hn, Hm FCVTZS Wd|WZR, Hn // FP16 → signed 32-bit int FCVTZS Xd|XZR, Hn // FP16 → signed 64-bit int FCVTZU Wd|WZR, Hn // FP16 → unsigned 32-bit int FCVTZU Xd|XZR, Hn // FP16 → unsigned 64-bit int SCVTF Hd, Wn|WZR // Signed 32-bit int → FP16 SCVTF Hd, Xn|XZR // Signed 64-bit int → FP16 UCVTF Hd, Wn|WZR // Unsigned 32-bit int → FP16 UCVTF Hd, Xn|XZR // Unsigned 64-bit int → FP16 FMOV Hd, Wn|WZR // Copy raw bits GPR → FP16 (no conversion) FMOV Wd|WZR, Hn // Copy raw bits FP16 → GPR ``` **Why FP16 matters**: Machine learning inference uses FP16 (and even smaller formats) because neural network weights don't need full precision. FP16 gives 2× the throughput of FP32 at half the memory bandwidth, which is often the bottleneck. ARM also supports BFloat16 (BF16, via FEAT_BF16), which has the same 8-bit exponent as FP32 but only 7 mantissa bits — it trades precision for range, which works well for training. **FP16 format**: 1 sign bit, 5 exponent bits, 10 mantissa bits. Range: +/-65504, smallest normal: ~6.1e-5. The limited range means overflow to infinity is common — this is acceptable in ML but dangerous in general-purpose code. ### 22.9 FP Rounding Modes (FPCR) The `FPCR` (Floating-Point Control Register) controls the rounding mode via bits [23:22]: | FPCR.RMode | Meaning | |---|---| | 00 | Round to Nearest, ties to Even (default — IEEE 754) | | 01 | Round toward Plus Infinity (ceiling) | | 10 | Round toward Minus Infinity (floor) | | 11 | Round toward Zero (truncation) | ```asm MRS X0, FPCR // Read current FP control AND X0, X0, #~(0b11 << 22) // Clear rounding mode bits [23:22] (AND with inverted mask) ORR X0, X0, #(0b01 << 22) // Set round-toward-plus-infinity MSR FPCR, X0 // Write back // Note: ORR alone is NOT enough — it only sets bits. If the previous mode was // 10 (floor), ORR with 01 gives 11 (truncate), not 01. Clear first, then set. ``` Most code uses the default (Round to Nearest, ties to Even) and never touches FPCR. The `FCVTZS`/`FCVTZU` instructions always round toward zero regardless of FPCR — the "Z" in their name stands for "Zero" (the rounding mode, not the zero register). **Flush-to-Zero (FZ bit)**: FPCR bit [24]. When set, **denormalized** (subnormal) float results are flushed to zero instead of being represented as tiny non-zero values. Denormals are numbers smaller than the smallest normal float (e.g., below ~1.18e-38 for single-precision). Processing denormals is slow on many CPUs (up to 100x slower) because the hardware traps to microcode. Setting FZ=1 avoids this penalty at the cost of losing precision near zero. Most games and media applications set FZ=1; scientific code leaves it at 0 for accuracy. ```asm // Enable flush-to-zero: MRS X0, FPCR ORR X0, X0, #(1 << 24) // Set FZ bit MSR FPCR, X0 ``` **FPSR (FP Status Register)**: Records **cumulative** exception flags from FP operations — these flags are "sticky" (once set, they stay set until you clear them). Check FPSR after a sequence of FP operations to see if anything unusual happened: | FPSR bit | Flag | Meaning | |---|---|---| | [0] | IOC | Invalid Operation (0/0, sqrt of negative, NaN input) | | [1] | DZC | Division by Zero (finite ÷ 0 → ±infinity) | | [2] | OFC | Overflow (result too large for the format) | | [3] | UFC | Underflow (result too small, became denormal or zero) | | [4] | IXC | Inexact (result was rounded — extremely common, almost always set) | | [7] | IDC | Input Denormal (a denormal input was consumed) | ```asm MRS X0, FPSR // Read cumulative FP exception flags TST X0, #1 // Check IOC (Invalid Operation) B.NE had_invalid_op // Branch if any FP operation was invalid MSR FPSR, XZR // Clear all flags ``` --- ## 23. NEON / Advanced SIMD Overview NEON (also called Advanced SIMD) processes multiple data elements in parallel using a single instruction — this is SIMD (Single Instruction, Multiple Data). A 128-bit V register can hold, for example, four 32-bit integers or sixteen 8-bit bytes. One NEON `ADD V0.4S, V1.4S, V2.4S` adds four pairs of 32-bit integers simultaneously. **Why SIMD matters**: Scalar code processes one value per instruction. If you need to add 1000 pairs of 32-bit numbers, that's 1000 ADD instructions. With NEON `.4S`, it's 250 ADD instructions — 4× throughput from the same number of instructions. For byte-level operations (image processing, string scanning), `.16B` gives 16× throughput. This is why compilers auto-vectorize loops and why hand-written NEON dominates in codecs, crypto, and ML inference. **How lanes work**: Each V register is divided into **lanes** (also called elements). `V0.4S` means V0 is viewed as 4 lanes of 32-bit (S) values. An operation like `ADD V0.4S, V1.4S, V2.4S` adds lane 0 of V1 to lane 0 of V2 into lane 0 of V0, lane 1 to lane 1, etc. — all independently, in parallel. There is no carry or overflow between lanes. **64-bit (D) vs 128-bit (Q) operations**: The lower specifiers (`.8B`, `.4H`, `.2S`, `.1D`) operate on the lower 64 bits of the register only — the upper 64 bits of the destination are **zeroed**. The higher specifiers (`.16B`, `.8H`, `.4S`, `.2D`) use all 128 bits. Using 64-bit operations is useful when you have small amounts of data or want to avoid touching the upper half. **Common NEON housekeeping:** ```asm // Zero a vector register (two ways): MOVI V0.4S, #0 // Set all lanes to zero (preferred — single instruction) EOR V0.16B, V0.16B, V0.16B // XOR with self = zero (also works, sometimes preferred by compilers) // Set all lanes to a constant: MOVI V0.4S, #0xFF // All 32-bit lanes = 0xFF (only certain immediates encodable) MOVI V0.16B, #0x55 // All bytes = 0x55 // Broadcast a GPR value to all lanes: DUP V0.4S, Wn|WZR // Fill all 4 lanes with low 32 bits of Wn DUP V0.2D, Xn|XZR // Fill both 64-bit lanes with Xn // Broadcast one lane to all lanes: DUP V0.4S, V1.S[i] // Fill all 4 lanes with lane i of V1 ``` ### 23.1 Vector Arrangement Specifiers The suffix like `.4S` or `.8B` tells the CPU how to interpret the 128-bit register: how many elements and what size each element is. | Specifier | Element size | Elements per 64-bit D | Elements per 128-bit Q | |---|---|---|---| | `.8B` / `.16B` | 8-bit | 8 | 16 | | `.4H` / `.8H` | 16-bit | 4 | 8 | | `.2S` / `.4S` | 32-bit | 2 | 4 | | `.1D` / `.2D` | 64-bit | 1 | 2 | ### 23.2 Vector Immediate Instructions — MOVI / MVNI / ORR / BIC (immediate) NEON lets you load an 8-bit immediate into every lane of a vector — optionally shifted, optionally inverted, optionally OR'd or AND-NOT'd with the destination. These four mnemonics (`MOVI`, `MVNI`, `ORR` (immediate), `BIC` (immediate)) share the same 8-bit-immediate encoding and the same `cmode`/`op` bit field; each `cmode` selects an arrangement + shift combination. Understanding the rules lets you predict which compile-time constants land in one instruction vs. needing a literal-pool load. ```asm // === MOVI — Move Immediate to vector === // Byte element (no shift — 8-bit immediate directly fills each byte): MOVI V0.8B, #imm8 // imm8 ∈ 0..255; duplicated to 8 byte lanes MOVI V0.16B, #imm8 // 16 byte lanes // Halfword element (imm8 shifted by LSL #0 or #8 — i.e. placed in low or high byte of each 16-bit lane): MOVI V0.4H, #imm8{, LSL #0|#8} // T = 4H/8H; shift in {0, 8} MOVI V0.8H, #imm8{, LSL #0|#8} // Word element (imm8 shifted into one of four byte positions of each 32-bit lane): MOVI V0.2S, #imm8{, LSL #0|#8|#16|#24} // T = 2S/4S; shift in {0, 8, 16, 24} MOVI V0.4S, #imm8{, LSL #0|#8|#16|#24} // Word element with MSL (Masked Shift Left): imm8 is shifted left by 8 or 16, and the // low bits BELOW the shift are filled with ONES instead of zeros. Useful for building // values like 0x0000_00FF, 0x0000_FFFF directly. MOVI V0.2S, #imm8, MSL #8 // Each 32-bit lane = (imm8 << 8) | 0xFF MOVI V0.2S, #imm8, MSL #16 // Each 32-bit lane = (imm8 << 16) | 0xFFFF MOVI V0.4S, #imm8, MSL #8 MOVI V0.4S, #imm8, MSL #16 // Doubleword element — SPECIAL encoding. The 8-bit immediate is EXPANDED, not broadcast: // each bit of imm8 becomes all-ones-or-all-zeros for the corresponding BYTE of the 64-bit // lane. So imm8 = 0b10101010 → 0xFF00FF00_FF00FF00. This is the only way to load an // arbitrary-per-byte-mask constant in a single instruction. MOVI Dd, #imm // Scalar 64-bit (imm built from 8-bit pattern as above) MOVI V0.2D, #imm // Two 64-bit lanes, same expanded pattern in each // === MVNI — Move Inverted Immediate === // Same encoding as MOVI, but the final value is BITWISE-INVERTED before broadcasting. // Arrangements are halfword and word ONLY (no byte form, no 2D — because MVNI of a bit // pattern would often collide with a simpler MOVI encoding for those cases). MVNI V0.4H, #imm8{, LSL #0|#8} MVNI V0.8H, #imm8{, LSL #0|#8} MVNI V0.2S, #imm8{, LSL #0|#8|#16|#24} MVNI V0.4S, #imm8{, LSL #0|#8|#16|#24} MVNI V0.2S, #imm8, MSL #8|#16 // MSL variants also exist (low bits filled with 0s after inversion — i.e. ones before) MVNI V0.4S, #imm8, MSL #8|#16 // === ORR (immediate) vector — bitwise-OR destination with broadcast immediate === // DESTRUCTIVE: Vd is both source and destination. No byte or doubleword form. ORR V0.4H, #imm8{, LSL #0|#8} // Each 16-bit lane |= (imm8 << shift) ORR V0.8H, #imm8{, LSL #0|#8} ORR V0.2S, #imm8{, LSL #0|#8|#16|#24} // Each 32-bit lane |= (imm8 << shift) ORR V0.4S, #imm8{, LSL #0|#8|#16|#24} // === BIC (immediate) vector — bitwise AND-NOT with broadcast immediate === // DESTRUCTIVE. Used to clear specific bits in every lane in a single instruction. BIC V0.4H, #imm8{, LSL #0|#8} // Each 16-bit lane &= ~(imm8 << shift) BIC V0.8H, #imm8{, LSL #0|#8} BIC V0.2S, #imm8{, LSL #0|#8|#16|#24} BIC V0.4S, #imm8{, LSL #0|#8|#16|#24} ``` **Key rules:** - Arrangement determines which shift amounts are legal: `.4H/.8H` → `LSL #{0,8}`; `.2S/.4S` → `LSL #{0,8,16,24}`. - `MSL` is **only** valid for `.2S/.4S` and only with amounts `#8` or `#16`. - `.8B/.16B` takes **only** the unshifted imm8. - `.2D` (and the `Dd` scalar) takes the **expanded 8-bit → 64-bit byte-mask** encoding — not a broadcast. - `ORR` and `BIC` immediate are **destructive** (no separate Vn source — Vd is read and written). - `MVNI` has no `.8B/.16B` form and no `.2D` form — byte/doubleword inversion uses `MOVI` with an inverted literal, or a register-form `MVN`. - Compilers try the MOVI encoding first; if the constant doesn't fit, they try MVNI, then ORR/BIC combinations, then finally fall back to an `ADR`+`LDR` literal-pool load. ### 23.3 Element-wise Arithmetic — Add/Sub, FP Min/Max, Unary, Reciprocal, Rounding Per-lane add and subtract, plus the floating-point element-wise family: min/max, `FABS`/`FNEG`/`FSQRT`, reciprocal and reciprocal-square-root estimates, and round-to-integral. Every FP form here uses the same arrangement set as `FADD`. ```asm // Vector add/sub ADD V0.8B, V1.8B, V2.8B // 8× 8-bit integer add ADD V0.16B, V1.16B, V2.16B // 16× 8-bit ADD V0.4H, V1.4H, V2.4H // 4× 16-bit ADD V0.8H, V1.8H, V2.8H // 8× 16-bit ADD V0.2S, V1.2S, V2.2S // 2× 32-bit ADD V0.4S, V1.4S, V2.4S // 4× 32-bit integer add ADD V0.2D, V1.2D, V2.2D // 2× 64-bit integer add ADD Dd, Dn, Dm // Scalar 64-bit integer add (SIMD scalar encoding) SUB V0.8B, V1.8B, V2.8B // Same arrangement set as ADD SUB V0.16B, V1.16B, V2.16B SUB V0.4H, V1.4H, V2.4H SUB V0.8H, V1.8H, V2.8H SUB V0.2S, V1.2S, V2.2S SUB V0.4S, V1.4S, V2.4S SUB V0.2D, V1.2D, V2.2D SUB Dd, Dn, Dm // Scalar 64-bit integer subtract FADD V0.2S, V1.2S, V2.2S // FP vector add (baseline .2S/.4S/.2D) FADD V0.4S, V1.4S, V2.4S FADD V0.2D, V1.2D, V2.2D FADD V0.4H, V1.4H, V2.4H // FEAT_FP16 FADD V0.8H, V1.8H, V2.8H // FEAT_FP16 FSUB V0.2S, V1.2S, V2.2S // FP vector subtract — same arrangement set as FADD FSUB V0.4S, V1.4S, V2.4S FSUB V0.2D, V1.2D, V2.2D FSUB V0.4H, V1.4H, V2.4H // FEAT_FP16 FSUB V0.8H, V1.8H, V2.8H // FEAT_FP16 FDIV V0.2S, V1.2S, V2.2S // FP vector divide — all FADD arrangements FDIV V0.4S, V1.4S, V2.4S FDIV V0.2D, V1.2D, V2.2D FDIV V0.4H, V1.4H, V2.4H // FEAT_FP16 FDIV V0.8H, V1.8H, V2.8H // FEAT_FP16 FABD V0.2S, V1.2S, V2.2S // FP absolute difference (|V1[i] - V2[i]| per lane) FABD V0.4S, V1.4S, V2.4S FABD V0.2D, V1.2D, V2.2D FABD V0.4H, V1.4H, V2.4H // FEAT_FP16 FABD V0.8H, V1.8H, V2.8H // FEAT_FP16 FABD Sd, Sn, Sm // Scalar FABD FABD Dd, Dn, Dm FABD Hd, Hn, Hm // FEAT_FP16 // FP per-lane min/max. Four mnemonics × same arrangement set as FADD. // IEEE-754 vs NumericMax/NumericMin distinction mirrors the scalar FMAX/FMAXNM rule: // FMAX / FMIN — if either operand is NaN, result is NaN (IEEE 754-2008 maximum/minimum). // FMAXNM / FMINNM — if one operand is a quiet NaN and the other is a number, return the number // (IEEE 754-2008 maxNum/minNum). This is what you want for reductions that // should ignore NaNs in the input. FMAX V0.2S, V1.2S, V2.2S // Per-lane FP max (IEEE: NaN propagates) FMAX V0.4S, V1.4S, V2.4S FMAX V0.2D, V1.2D, V2.2D FMAX V0.4H, V1.4H, V2.4H // FEAT_FP16 FMAX V0.8H, V1.8H, V2.8H // FEAT_FP16 FMIN V0.2S, V1.2S, V2.2S // Per-lane FP min (IEEE: NaN propagates) — same arrangements FMIN V0.4S, V1.4S, V2.4S FMIN V0.2D, V1.2D, V2.2D FMIN V0.4H, V1.4H, V2.4H // FEAT_FP16 FMIN V0.8H, V1.8H, V2.8H // FEAT_FP16 FMAXNM V0.2S, V1.2S, V2.2S // Per-lane NaN-suppressing max (returns number when one operand is QNaN) FMAXNM V0.4S, V1.4S, V2.4S FMAXNM V0.2D, V1.2D, V2.2D FMAXNM V0.4H, V1.4H, V2.4H // FEAT_FP16 FMAXNM V0.8H, V1.8H, V2.8H // FEAT_FP16 FMINNM V0.2S, V1.2S, V2.2S // Per-lane NaN-suppressing min FMINNM V0.4S, V1.4S, V2.4S FMINNM V0.2D, V1.2D, V2.2D FMINNM V0.4H, V1.4H, V2.4H // FEAT_FP16 FMINNM V0.8H, V1.8H, V2.8H // FEAT_FP16 // Absolute-max / Absolute-min (FEAT_FAMINMAX — Armv9.2-A optional, mandatory from Armv9.5-A). // Semantics: FAMAX(a, b) = FMAX(|a|, |b|); FAMIN(a, b) = FMIN(|a|, |b|). NaN/Inf propagation // matches FMAX/FMIN (not FMAXNM/FMINNM), so a quiet NaN on either input gives a NaN output. // **ISA-level truth**: this is a distinct encoding, not an FABS+FMAX fusion pseudo-op — the // compiler emits it when recognizing the pattern AND the target has +faminmax enabled. FAMAX V0.4H, V1.4H, V2.4H // FP16 — FEAT_FP16 + FEAT_FAMINMAX FAMAX V0.8H, V1.8H, V2.8H FAMAX V0.2S, V1.2S, V2.2S // FP32 FAMAX V0.4S, V1.4S, V2.4S FAMAX V0.2D, V1.2D, V2.2D // FP64 (128-bit Q only; no 1-element .1D form) FAMIN V0.4H, V1.4H, V2.4H // Same arrangement set as FAMAX FAMIN V0.8H, V1.8H, V2.8H FAMIN V0.2S, V1.2S, V2.2S FAMIN V0.4S, V1.4S, V2.4S FAMIN V0.2D, V1.2D, V2.2D // FP vector unary (FABS / FNEG / FSQRT) — every arrangement FADD supports: FABS V0.2S, V1.2S FABS V0.4S, V1.4S FABS V0.2D, V1.2D FABS V0.4H, V1.4H // FEAT_FP16 FABS V0.8H, V1.8H // FEAT_FP16 FNEG V0.2S, V1.2S // Same arrangements as FABS FNEG V0.4S, V1.4S FNEG V0.2D, V1.2D FNEG V0.4H, V1.4H // FEAT_FP16 FNEG V0.8H, V1.8H // FEAT_FP16 FSQRT V0.2S, V1.2S // Same arrangements as FABS FSQRT V0.4S, V1.4S FSQRT V0.2D, V1.2D FSQRT V0.4H, V1.4H // FEAT_FP16 FSQRT V0.8H, V1.8H // FEAT_FP16 // FP reciprocal / reciprocal-sqrt estimate + Newton-Raphson step (vector forms): FRECPE V0.2S, V1.2S // Reciprocal estimate (per-lane, ~8-bit precision) FRECPE V0.4S, V1.4S FRECPE V0.2D, V1.2D FRECPE V0.4H, V1.4H // FEAT_FP16 FRECPE V0.8H, V1.8H // FEAT_FP16 FRECPS V0.2S, V1.2S, V2.2S // Reciprocal NR step: V0 = 2 − (V1 * V2) FRECPS V0.4S, V1.4S, V2.4S FRECPS V0.2D, V1.2D, V2.2D FRECPS V0.4H, V1.4H, V2.4H // FEAT_FP16 FRECPS V0.8H, V1.8H, V2.8H // FEAT_FP16 FRSQRTE V0.2S, V1.2S // Reciprocal-sqrt estimate FRSQRTE V0.4S, V1.4S FRSQRTE V0.2D, V1.2D FRSQRTE V0.4H, V1.4H // FEAT_FP16 FRSQRTE V0.8H, V1.8H // FEAT_FP16 FRSQRTS V0.2S, V1.2S, V2.2S // Reciprocal-sqrt NR step: V0 = (3 − V1 * V2) / 2 FRSQRTS V0.4S, V1.4S, V2.4S FRSQRTS V0.2D, V1.2D, V2.2D FRSQRTS V0.4H, V1.4H, V2.4H // FEAT_FP16 FRSQRTS V0.8H, V1.8H, V2.8H // FEAT_FP16 FRECPX Sd, Sn // Scalar reciprocal "exact" estimate (sets mantissa to 1.0, negates exponent) FRECPX Dd, Dn FRECPX Hd, Hn // FEAT_FP16 // FP round-to-integral (no conversion to integer register — result stays in FP register). // Rounding-mode letter meanings: // A = nearest, ties Away from zero M = Minus-infinity (floor) // I = current FPCR rounding mode N = Nearest, ties to eveN // P = Plus-infinity (ceiling) X = current FPCR mode, signal inexact // Z = toward Zero (truncate) FRINTA V0.2S, V1.2S // Same arrangement set as FABS (all: .2S .4S .2D .4H .8H) FRINTA V0.4S, V1.4S FRINTA V0.2D, V1.2D FRINTA V0.4H, V1.4H // FEAT_FP16 FRINTA V0.8H, V1.8H // FEAT_FP16 FRINTI V0.2S, V1.2S // Same arrangements as FRINTA FRINTI V0.4S, V1.4S FRINTI V0.2D, V1.2D FRINTI V0.4H, V1.4H // FEAT_FP16 FRINTI V0.8H, V1.8H // FEAT_FP16 FRINTM V0.2S, V1.2S // Floor (round toward −∞) FRINTM V0.4S, V1.4S FRINTM V0.2D, V1.2D FRINTM V0.4H, V1.4H // FEAT_FP16 FRINTM V0.8H, V1.8H // FEAT_FP16 FRINTN V0.2S, V1.2S // Round to nearest, ties to even FRINTN V0.4S, V1.4S FRINTN V0.2D, V1.2D FRINTN V0.4H, V1.4H // FEAT_FP16 FRINTN V0.8H, V1.8H // FEAT_FP16 FRINTP V0.2S, V1.2S // Ceiling (round toward +∞) FRINTP V0.4S, V1.4S FRINTP V0.2D, V1.2D FRINTP V0.4H, V1.4H // FEAT_FP16 FRINTP V0.8H, V1.8H // FEAT_FP16 FRINTX V0.2S, V1.2S // Current FPCR mode, signal inexact FRINTX V0.4S, V1.4S FRINTX V0.2D, V1.2D FRINTX V0.4H, V1.4H // FEAT_FP16 FRINTX V0.8H, V1.8H // FEAT_FP16 FRINTZ V0.2S, V1.2S // Toward zero (truncate) FRINTZ V0.4S, V1.4S FRINTZ V0.2D, V1.2D FRINTZ V0.4H, V1.4H // FEAT_FP16 FRINTZ V0.8H, V1.8H // FEAT_FP16 // Scalar forms: FRINTA Sd, Sn | FRINTA Dd, Dn | FRINTA Hd, Hn (FEAT_FP16) — same for I/M/N/P/X/Z. ``` ### 23.4 Multiply & Multiply-Accumulate (including by-element) Per-lane multiply and multiply-accumulate. The *by-element* forms broadcast one lane of `Vm` across the whole vector — the backbone of matrix and filter kernels. ```asm // Vector multiply MUL V0.8B, V1.8B, V2.8B // 8× 8-bit integer multiply MUL V0.16B, V1.16B, V2.16B // 16× 8-bit MUL V0.4H, V1.4H, V2.4H // 4× 16-bit MUL V0.8H, V1.8H, V2.8H // 8× 16-bit MUL V0.2S, V1.2S, V2.2S // 2× 32-bit MUL V0.4S, V1.4S, V2.4S // 4× 32-bit integer multiply (low 32 bits of each product) // No .1D or .2D — baseline NEON has no 64-bit integer multiply; use SMULL/UMULL widening instead FMUL V0.2S, V1.2S, V2.2S // 2× single-precision FP multiply (64-bit vector form) FMUL V0.4S, V1.4S, V2.4S // 4× single-precision FP multiply FMUL V0.2D, V1.2D, V2.2D // 2× double-precision FP multiply FMUL V0.4H, V1.4H, V2.4H // 4× half-precision (FEAT_FP16) FMUL V0.8H, V1.8H, V2.8H // 8× half-precision (FEAT_FP16) FMULX V0.2S, V1.2S, V2.2S // Multiply eXtended: 0×∞ → ±2.0 instead of NaN FMULX V0.4S, V1.4S, V2.4S FMULX V0.2D, V1.2D, V2.2D FMULX V0.4H, V1.4H, V2.4H // FEAT_FP16 FMULX V0.8H, V1.8H, V2.8H // FEAT_FP16 FMULX Hd, Hn, Hm // Scalar FP16 (FEAT_FP16) FMULX Sd, Sn, Sm // Scalar single FMULX Dd, Dn, Dm // Scalar double // Vector fused multiply-accumulate (CRITICAL for performance — used everywhere by compilers): FMLA V0.2S, V1.2S, V2.2S // V0 += V1 * V2 (per-lane, fused) FMLA V0.4S, V1.4S, V2.4S // 4× single-precision FMLA V0.2D, V1.2D, V2.2D // 2× double-precision FMLA V0.4H, V1.4H, V2.4H // 4× half-precision (FEAT_FP16) FMLA V0.8H, V1.8H, V2.8H // 8× half-precision (FEAT_FP16) FMLS V0.2S, V1.2S, V2.2S // V0 -= V1 * V2 (per-lane, fused subtract) — all arrangements as FMLA FMLS V0.4S, V1.4S, V2.4S FMLS V0.2D, V1.2D, V2.2D FMLS V0.4H, V1.4H, V2.4H // FEAT_FP16 FMLS V0.8H, V1.8H, V2.8H // FEAT_FP16 // FMLA is the single most important NEON instruction for numerical code — matrix multiply, // convolution, FIR filters, and physics all reduce to FMLA loops. // BY-ELEMENT form — broadcast one lane of Vm across the multiplication. Essential for matmul // because you can multiply an entire row by a single scalar in one instruction. // IMPORTANT — Vm register encoding constraint for .H by-element: // For .H element-type by-element (FP16 and integer halfword), Vm uses only 4 bits of // the encoding (Rm<3:0>), so Vm is restricted to V0..V15. The 5th bit encodes the // lane index high bit instead. For .S and .D by-element, Vm uses the full 5-bit field, // so Vm can be V0..V31. Assemblers enforce this. FMLA V0.2S, V1.2S, V2.S[i] // i ∈ 0..3; Vm can be V0..V31 FMLA V0.4S, V1.4S, V2.S[i] FMLA V0.2D, V1.2D, V2.D[i] // i ∈ 0..1; Vm can be V0..V31 FMLA V0.4H, V1.4H, V2.H[i] // i ∈ 0..7; Vm restricted to V0..V15 (FEAT_FP16) FMLA V0.8H, V1.8H, V2.H[i] // i ∈ 0..7; Vm restricted to V0..V15 (FEAT_FP16) FMLS V0.2S, V1.2S, V2.S[i] // Same arrangement set as FMLA FMLS V0.4S, V1.4S, V2.S[i] FMLS V0.2D, V1.2D, V2.D[i] FMLS V0.4H, V1.4H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) FMLS V0.8H, V1.8H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) FMUL V0.2S, V1.2S, V2.S[i] FMUL V0.4S, V1.4S, V2.S[i] FMUL V0.2D, V1.2D, V2.D[i] FMUL V0.4H, V1.4H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) FMUL V0.8H, V1.8H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) FMULX V0.2S, V1.2S, V2.S[i] // FMULX by-element — same arrangement set as FMUL/FMLA FMULX V0.4S, V1.4S, V2.S[i] FMULX V0.2D, V1.2D, V2.D[i] FMULX V0.4H, V1.4H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) FMULX V0.8H, V1.8H, V2.H[i] // Vm ∈ V0..V15 (FEAT_FP16) // Integer by-element forms (MUL/MLA/MLS) — also subject to Vm V0..V15 for .H: MUL V0.4H, V1.4H, V2.H[i] // i ∈ 0..7; Vm ∈ V0..V15 MUL V0.8H, V1.8H, V2.H[i] // Vm ∈ V0..V15 MUL V0.2S, V1.2S, V2.S[i] // i ∈ 0..3; Vm ∈ V0..V31 MUL V0.4S, V1.4S, V2.S[i] // 3-register MLA/MLS (per-lane multiply-accumulate / multiply-subtract). // **ISA-level truth** — arrangements: .8B/.16B/.4H/.8H/.2S/.4S (NO .2D, same as MUL). MLA V0.8B, V1.8B, V2.8B // V0 += V1 * V2 (per-lane, low bits) MLA V0.16B, V1.16B, V2.16B MLA V0.4H, V1.4H, V2.4H MLA V0.8H, V1.8H, V2.8H MLA V0.2S, V1.2S, V2.2S MLA V0.4S, V1.4S, V2.4S MLS V0.8B, V1.8B, V2.8B // V0 -= V1 * V2 (per-lane, low bits) — same arrangement set MLS V0.16B, V1.16B, V2.16B MLS V0.4H, V1.4H, V2.4H MLS V0.8H, V1.8H, V2.8H MLS V0.2S, V1.2S, V2.2S MLS V0.4S, V1.4S, V2.4S // By-element MLA / MLS — broadcast one lane of Vm. Same halfword/word arrangements as by-element MUL: MLA V0.4H, V1.4H, V2.H[i] // V0 += V1 * broadcast(V2[i]); Vm ∈ V0..V15 MLA V0.8H, V1.8H, V2.H[i] MLA V0.2S, V1.2S, V2.S[i] MLA V0.4S, V1.4S, V2.S[i] MLS V0.4H, V1.4H, V2.H[i] // V0 -= V1 * broadcast(V2[i]); Vm ∈ V0..V15 MLS V0.8H, V1.8H, V2.H[i] MLS V0.2S, V1.2S, V2.S[i] MLS V0.4S, V1.4S, V2.S[i] // SCALAR by-element forms — multiply a scalar FP register by a single lane of a vector. // These are "advanced SIMD scalar" encodings; they produce a scalar result but pick one // lane of the vector as the second operand. Same Vm V0..V15 restriction for Hd form. FMLA Sd, Sn, Vm.S[i] // Sd += Sn * Vm[i] (scalar by-element); Vm can be V0..V31 FMLA Dd, Dn, Vm.D[i] // Dd += Dn * Vm[i] (i ∈ {0,1}); Vm V0..V31 FMLA Hd, Hn, Vm.H[i] // FEAT_FP16; Vm restricted to V0..V15 FMLS Sd, Sn, Vm.S[i] FMLS Dd, Dn, Vm.D[i] FMLS Hd, Hn, Vm.H[i] // FEAT_FP16; Vm ∈ V0..V15 FMUL Sd, Sn, Vm.S[i] FMUL Dd, Dn, Vm.D[i] FMUL Hd, Hn, Vm.H[i] // FEAT_FP16; Vm ∈ V0..V15 FMULX Sd, Sn, Vm.S[i] // FMULX scalar by-element FMULX Dd, Dn, Vm.D[i] FMULX Hd, Hn, Vm.H[i] // FEAT_FP16; Vm ∈ V0..V15 // The scalar result zeros the upper lanes of Vd (scalar-mnemonic rule from §1.3). ``` ### 23.5 Widening, Narrowing & Long/Wide Operations Operations that change element width: widening (narrow → wide, overflow-free), wide (wide + narrow), narrowing (wide → narrow, dropping or saturating the result), and the saturating-doubling long-multiply MAC used in Q-format DSP. ```asm // Widening operations (narrow inputs → wider outputs, no overflow possible) // The "L" stands for "Long" — the result is longer than the inputs. SMULL V0.8H, V1.8B, V2.8B // 8× signed 8→16 multiply SMULL2 V0.8H, V1.16B, V2.16B // Upper half of inputs (16B → use bytes 8..15) SMULL V0.4S, V1.4H, V2.4H // 4× signed 16→32 multiply (lower 4 lanes of input) SMULL2 V0.4S, V1.8H, V2.8H // Same but upper 4 lanes of input SMULL V0.2D, V1.2S, V2.2S // 2× signed 32→64 multiply SMULL2 V0.2D, V1.4S, V2.4S // (the "2" suffix means "use the upper half of the source registers") UMULL V0.8H, V1.8B, V2.8B // Unsigned widening multiply — same arrangement set as SMULL UMULL2 V0.8H, V1.16B, V2.16B UMULL V0.4S, V1.4H, V2.4H UMULL2 V0.4S, V1.8H, V2.8H UMULL V0.2D, V1.2S, V2.2S UMULL2 V0.2D, V1.4S, V2.4S // Widening multiply + accumulate/subtract (one rounding, wide accumulator — the integer analog of FMA): SMLAL V0.8H, V1.8B, V2.8B // V0 += sign_extend(V1) * sign_extend(V2) (lanewise widening MAC) SMLAL2 V0.8H, V1.16B, V2.16B SMLAL V0.4S, V1.4H, V2.4H SMLAL2 V0.4S, V1.8H, V2.8H SMLAL V0.2D, V1.2S, V2.2S SMLAL2 V0.2D, V1.4S, V2.4S UMLAL V0.8H, V1.8B, V2.8B // Unsigned widening multiply-accumulate — same set UMLAL2 V0.8H, V1.16B, V2.16B UMLAL V0.4S, V1.4H, V2.4H UMLAL2 V0.4S, V1.8H, V2.8H UMLAL V0.2D, V1.2S, V2.2S UMLAL2 V0.2D, V1.4S, V2.4S SMLSL V0.8H, V1.8B, V2.8B // V0 -= sign_extend(V1) * sign_extend(V2) — same arrangement set SMLSL2 V0.8H, V1.16B, V2.16B SMLSL V0.4S, V1.4H, V2.4H SMLSL2 V0.4S, V1.8H, V2.8H SMLSL V0.2D, V1.2S, V2.2S SMLSL2 V0.2D, V1.4S, V2.4S UMLSL V0.8H, V1.8B, V2.8B UMLSL2 V0.8H, V1.16B, V2.16B UMLSL V0.4S, V1.4H, V2.4H UMLSL2 V0.4S, V1.8H, V2.8H UMLSL V0.2D, V1.2S, V2.2S UMLSL2 V0.2D, V1.4S, V2.4S // By-element forms of widening MUL/MLA/MLS also exist (H→S and S→D only — no B→H by-element // because there's no way to index a single byte lane with a small encoding). // Same Vm V0..V15 restriction when Vm element is .H: SMULL V0.4S, V1.4H, V2.H[i] // i ∈ 0..7; Vm restricted to V0..V15 SMULL2 V0.4S, V1.8H, V2.H[i] SMULL V0.2D, V1.2S, V2.S[i] // i ∈ 0..3; Vm can be V0..V31 SMULL2 V0.2D, V1.4S, V2.S[i] UMULL V0.4S, V1.4H, V2.H[i] // Vm ∈ V0..V15 UMULL2 V0.4S, V1.8H, V2.H[i] UMULL V0.2D, V1.2S, V2.S[i] UMULL2 V0.2D, V1.4S, V2.S[i] SMLAL V0.4S, V1.4H, V2.H[i] // Widening multiply-accumulate by one lane SMLAL2 V0.4S, V1.8H, V2.H[i] SMLAL V0.2D, V1.2S, V2.S[i] SMLAL2 V0.2D, V1.4S, V2.S[i] SMLSL V0.4S, V1.4H, V2.H[i] // Widening multiply-subtract by one lane SMLSL2 V0.4S, V1.8H, V2.H[i] SMLSL V0.2D, V1.2S, V2.S[i] SMLSL2 V0.2D, V1.4S, V2.S[i] UMLAL V0.4S, V1.4H, V2.H[i] UMLAL2 V0.4S, V1.8H, V2.H[i] UMLAL V0.2D, V1.2S, V2.S[i] UMLAL2 V0.2D, V1.4S, V2.S[i] UMLSL V0.4S, V1.4H, V2.H[i] UMLSL2 V0.4S, V1.8H, V2.H[i] UMLSL V0.2D, V1.2S, V2.S[i] UMLSL2 V0.2D, V1.4S, V2.S[i] // Saturating Doubling widening multiply (SQDMULL, SQDMLAL, SQDMLSL) — Q-format MAC with // widening: narrow signed inputs, doubled and saturated into wider lanes. Used in Q15/Q31 // DSP where you want the wide accumulator to avoid early saturation. SQDMULL V0.4S, V1.4H, V2.4H // Q15 × Q15 → Q31 (per-lane, doubled, saturated) SQDMULL2 V0.4S, V1.8H, V2.8H SQDMULL V0.2D, V1.2S, V2.2S // Q31 × Q31 → Q63 (saturating) SQDMULL2 V0.2D, V1.4S, V2.4S SQDMULL V0.4S, V1.4H, V2.H[i] // By-element; Vm ∈ V0..V15 for .H SQDMULL2 V0.4S, V1.8H, V2.H[i] SQDMULL V0.2D, V1.2S, V2.S[i] // Vm can be V0..V31 for .S SQDMULL2 V0.2D, V1.4S, V2.S[i] SQDMLAL V0.4S, V1.4H, V2.4H // Widening saturating doubling MAC SQDMLAL2 V0.4S, V1.8H, V2.8H SQDMLAL V0.2D, V1.2S, V2.2S SQDMLAL2 V0.2D, V1.4S, V2.4S SQDMLAL V0.4S, V1.4H, V2.H[i] // By-element MAC SQDMLAL2 V0.4S, V1.8H, V2.H[i] SQDMLAL V0.2D, V1.2S, V2.S[i] SQDMLAL2 V0.2D, V1.4S, V2.S[i] SQDMLSL V0.4S, V1.4H, V2.4H // Widening saturating doubling multiply-subtract SQDMLSL2 V0.4S, V1.8H, V2.8H SQDMLSL V0.2D, V1.2S, V2.2S SQDMLSL2 V0.2D, V1.4S, V2.4S SQDMLSL V0.4S, V1.4H, V2.H[i] SQDMLSL2 V0.4S, V1.8H, V2.H[i] SQDMLSL V0.2D, V1.2S, V2.S[i] SQDMLSL2 V0.2D, V1.4S, V2.S[i] // Scalar widening saturating Q-multiply family — two widening pairs (.H→.S, .S→.D): SQDMULL Sd, Hn, Hm // Scalar: Q15 × Q15 → Q31 (Sd result, doubled, saturated) SQDMULL Dd, Sn, Sm // Scalar: Q31 × Q31 → Q63 SQDMULL Sd, Hn, Vm.H[i] // Scalar by-element; Vm ∈ V0..V15 for .H (i ∈ 0..7) SQDMULL Dd, Sn, Vm.S[i] // Scalar by-element; Vm ∈ V0..V31 for .S (i ∈ 0..3) SQDMLAL Sd, Hn, Hm // Scalar widening saturating MAC: Sd += 2 × (Hn × Hm), saturated SQDMLAL Dd, Sn, Sm SQDMLAL Sd, Hn, Vm.H[i] // Scalar by-element MAC SQDMLAL Dd, Sn, Vm.S[i] SQDMLSL Sd, Hn, Hm // Scalar widening saturating multiply-subtract SQDMLSL Dd, Sn, Sm SQDMLSL Sd, Hn, Vm.H[i] // Scalar by-element multiply-subtract SQDMLSL Dd, Sn, Vm.S[i] // Widening add/sub (narrow + narrow → wide). **ISA-level truth** — three widening arrangement pairs // for each: .8H←.8B/.8B (8→16), .4S←.4H/.4H (16→32), .2D←.2S/.2S (32→64). Plus "2"-suffixed // variants that read the upper half of 128-bit input vectors. SADDL V0.8H, V1.8B, V2.8B // 8→16 signed widening add SADDL V0.4S, V1.4H, V2.4H // 16→32 SADDL V0.2D, V1.2S, V2.2S // 32→64 SADDL2 V0.8H, V1.16B, V2.16B // Upper-half variants SADDL2 V0.4S, V1.8H, V2.8H SADDL2 V0.2D, V1.4S, V2.4S UADDL V0.8H, V1.8B, V2.8B // Unsigned widening add — same arrangement set UADDL V0.4S, V1.4H, V2.4H UADDL V0.2D, V1.2S, V2.2S UADDL2 V0.8H, V1.16B, V2.16B UADDL2 V0.4S, V1.8H, V2.8H UADDL2 V0.2D, V1.4S, V2.4S SSUBL V0.8H, V1.8B, V2.8B // Signed widening subtract — same arrangement set SSUBL V0.4S, V1.4H, V2.4H SSUBL V0.2D, V1.2S, V2.2S SSUBL2 V0.8H, V1.16B, V2.16B SSUBL2 V0.4S, V1.8H, V2.8H SSUBL2 V0.2D, V1.4S, V2.4S USUBL V0.8H, V1.8B, V2.8B // Unsigned widening subtract — same arrangement set USUBL V0.4S, V1.4H, V2.4H USUBL V0.2D, V1.2S, V2.2S USUBL2 V0.8H, V1.16B, V2.16B USUBL2 V0.4S, V1.8H, V2.8H USUBL2 V0.2D, V1.4S, V2.4S // Wide add/sub (wide + narrow → wide) — extends Rm before adding to wide Rn. // Three arrangement triples: wide = {.8H, .4S, .2D}, Rm narrow = {.8B/.16B, .4H/.8H, .2S/.4S}. SADDW V0.8H, V1.8H, V2.8B // V0 = V1 + sign_extend(V2) (byte narrow source) SADDW V0.4S, V1.4S, V2.4H // halfword narrow source SADDW V0.2D, V1.2D, V2.2S // word narrow source SADDW2 V0.8H, V1.8H, V2.16B // "2" — upper half of narrow source SADDW2 V0.4S, V1.4S, V2.8H SADDW2 V0.2D, V1.2D, V2.4S UADDW V0.8H, V1.8H, V2.8B // Unsigned wide add — same triple UADDW V0.4S, V1.4S, V2.4H UADDW V0.2D, V1.2D, V2.2S UADDW2 V0.8H, V1.8H, V2.16B UADDW2 V0.4S, V1.4S, V2.8H UADDW2 V0.2D, V1.2D, V2.4S SSUBW V0.8H, V1.8H, V2.8B // Signed wide subtract — same triple SSUBW V0.4S, V1.4S, V2.4H SSUBW V0.2D, V1.2D, V2.2S SSUBW2 V0.8H, V1.8H, V2.16B SSUBW2 V0.4S, V1.4S, V2.8H SSUBW2 V0.2D, V1.2D, V2.4S USUBW V0.8H, V1.8H, V2.8B // Unsigned wide subtract — same triple USUBW V0.4S, V1.4S, V2.4H USUBW V0.2D, V1.2D, V2.2S USUBW2 V0.8H, V1.8H, V2.16B USUBW2 V0.4S, V1.4S, V2.8H USUBW2 V0.2D, V1.2D, V2.4S // Narrowing add/sub (wide + wide → narrow, takes the HIGH half — i.e. drops low bits): // Useful when the low bits of the sum are known to be zero (or you don't want them). // Narrowing add/sub (wide + wide → narrow, takes the HIGH half — i.e. drops low bits). // **ISA-level truth** — three arrangement pairs: .8B←.8H/.8H, .4H←.4S/.4S, .2S←.2D/.2D, plus // "2"-variants writing the upper half of a 128-bit destination. // Useful when the low bits of the sum are known to be zero (or you don't want them) — e.g. // averaging two packed pixel arrays with `(a + b) >> 1` falls out of ADDHN after << by 1. ADDHN V0.8B, V1.8H, V2.8H // V0[i] = high_half(V1[i] + V2[i]) (byte narrow from halfword sum) ADDHN V0.4H, V1.4S, V2.4S // halfword narrow from word sum ADDHN V0.2S, V1.2D, V2.2D // word narrow from doubleword sum ADDHN2 V0.16B, V1.8H, V2.8H // "2" — writes into upper 64 bits of V0, preserves lower half ADDHN2 V0.8H, V1.4S, V2.4S ADDHN2 V0.4S, V1.2D, V2.2D SUBHN V0.8B, V1.8H, V2.8H // V0[i] = high_half(V1[i] - V2[i]) — same arrangement set SUBHN V0.4H, V1.4S, V2.4S SUBHN V0.2S, V1.2D, V2.2D SUBHN2 V0.16B, V1.8H, V2.8H SUBHN2 V0.8H, V1.4S, V2.4S SUBHN2 V0.4S, V1.2D, V2.2D RADDHN V0.8B, V1.8H, V2.8H // Rounding variant: adds 2^(n-1) before dropping low bits RADDHN V0.4H, V1.4S, V2.4S RADDHN V0.2S, V1.2D, V2.2D RADDHN2 V0.16B, V1.8H, V2.8H RADDHN2 V0.8H, V1.4S, V2.4S RADDHN2 V0.4S, V1.2D, V2.2D RSUBHN V0.8B, V1.8H, V2.8H // Rounding narrowing subtract RSUBHN V0.4H, V1.4S, V2.4S RSUBHN V0.2S, V1.2D, V2.2D RSUBHN2 V0.16B, V1.8H, V2.8H RSUBHN2 V0.8H, V1.4S, V2.4S RSUBHN2 V0.4S, V1.2D, V2.2D // Sign/zero-extend narrow lanes into wider lanes — essential for fixed-point pipelines. // **ISA-level truth**: SXTL/UXTL are aliases for SSHLL/USHLL with shift amount = 0. // Alias form: SXTL Vd.Twide, Vn.Tnarrow — accepts exactly the widening pairs // (.8H,.8B) / (.4S,.4H) / (.2D,.2S). SXTL2 uses (.8H,.16B) / (.4S,.8H) / (.2D,.4S) // taking the upper half of Vn. No shift operand — it's fixed at 0. // Underlying: SSHLL Vd.Twide, Vn.Tnarrow, #shift // Underlying form: SSHLL accepts the same arrangement pairs, plus a shift in the range // 0 ≤ shift ≤ (element_size_of_Tnarrow − 1): // .8B → shift 0..7 ; .4H → shift 0..15 ; .2S → shift 0..31 // So SSHLL covers strictly more encodings than SXTL: every SXTL is an SSHLL (with shift=0), // but non-zero-shift SSHLLs have no SXTL spelling. SXTL V0.8H, V1.8B // Sign-extend 8 bytes → 8 halfwords (lower 8 lanes of input) SXTL V0.4S, V1.4H // Sign-extend 4 halfwords → 4 words SXTL V0.2D, V1.2S // Sign-extend 2 words → 2 doublewords SXTL2 V0.8H, V1.16B // Same but uses upper 8 bytes of V1 UXTL V0.8H, V1.8B // Zero-extend (unsigned) narrow → wide UXTL2 V0.8H, V1.16B // Upper-half zero-extend // The general SSHLL/USHLL also allow a non-zero shift (there is no alias for the non-zero case): SSHLL V0.8H, V1.8B, #3 // Sign-extend then left-shift by 3 (shift: 0..7 for .8B source) USHLL V0.4S, V1.4H, #5 // Zero-extend then left-shift by 5 (shift: 0..15 for .4H source) // Narrowing (wider inputs → narrow outputs, may lose data) // The "N" stands for "Narrow" — the result is narrower than the inputs. // **ISA-level truth** — three narrow/wide arrangement pairs: .8B←.8H, .4H←.4S, .2S←.2D. // The "2"-suffixed variants (XTN2/SQXTN2/UQXTN2/SQXTUN2) write the result into the UPPER 64 bits // of a 128-bit V register without disturbing the lower 64 bits. XTN V0.8B, V1.8H // Extract narrow: lower 8 bits of each halfword (truncate) XTN V0.4H, V1.4S // lower 16 bits of each word XTN V0.2S, V1.2D // lower 32 bits of each doubleword XTN2 V0.16B, V1.8H // Upper-half variant: writes into V0 bits [127:64] XTN2 V0.8H, V1.4S XTN2 V0.4S, V1.2D SQXTN V0.8B, V1.8H // Saturating narrow (signed): clamp each halfword to INT8 range SQXTN V0.4H, V1.4S // clamp word to INT16 SQXTN V0.2S, V1.2D // clamp doubleword to INT32 SQXTN2 V0.16B, V1.8H // Upper-half SQXTN2 V0.8H, V1.4S SQXTN2 V0.4S, V1.2D SQXTN Bd, Hn // Scalar: clamp Hn (16-bit signed) to 8-bit signed SQXTN Hd, Sn // Scalar: clamp Sn (32-bit signed) to 16-bit signed SQXTN Sd, Dn // Scalar: clamp Dn (64-bit signed) to 32-bit signed UQXTN V0.8B, V1.8H // Saturating narrow (unsigned): clamp to [0, UINT8_MAX] UQXTN V0.4H, V1.4S UQXTN V0.2S, V1.2D UQXTN2 V0.16B, V1.8H UQXTN2 V0.8H, V1.4S UQXTN2 V0.4S, V1.2D UQXTN Bd, Hn | Hd, Sn | Sd, Dn // Scalar forms (3 widths) SQXTUN V0.8B, V1.8H // Saturating narrow, signed→unsigned: clamp SIGNED input to [0, UINT8_MAX] SQXTUN V0.4H, V1.4S // clamp signed 32-bit to [0, UINT16_MAX] — useful for YCbCr→RGB SQXTUN V0.2S, V1.2D SQXTUN2 V0.16B, V1.8H SQXTUN2 V0.8H, V1.4S SQXTUN2 V0.4S, V1.2D SQXTUN Bd, Hn | Hd, Sn | Sd, Dn // Scalar forms ``` ### 23.6 Absolute Difference, Integer Unary & Per-Byte Counts Per-lane absolute difference (the core of SAD / motion estimation), saturating integer unary operations, and per-byte population/count helpers. ```asm // Absolute-difference (per-lane |Vn - Vm|), central to image-processing SAD: // Same-width form — accepts .8B/.16B/.4H/.8H/.2S/.4S: SABD V0.8B, V1.8B, V2.8B // Signed absolute difference SABD V0.16B, V1.16B, V2.16B SABD V0.4H, V1.4H, V2.4H SABD V0.8H, V1.8H, V2.8H SABD V0.2S, V1.2S, V2.2S SABD V0.4S, V1.4S, V2.4S UABD V0.8B, V1.8B, V2.8B // Unsigned — same arrangement set UABD V0.16B, V1.16B, V2.16B UABD V0.4H, V1.4H, V2.4H UABD V0.8H, V1.8H, V2.8H UABD V0.2S, V1.2S, V2.2S UABD V0.4S, V1.4S, V2.4S // Widening form (no overflow since result is wider): SABDL V0.8H, V1.8B, V2.8B // 8→16 signed SABDL V0.4S, V1.4H, V2.4H // 16→32 SABDL V0.2D, V1.2S, V2.2S // 32→64 SABDL2 V0.8H, V1.16B, V2.16B // Upper-half sources SABDL2 V0.4S, V1.8H, V2.8H SABDL2 V0.2D, V1.4S, V2.4S UABDL V0.8H, V1.8B, V2.8B // Unsigned widening — same arrangement set UABDL V0.4S, V1.4H, V2.4H UABDL V0.2D, V1.2S, V2.2S UABDL2 V0.8H, V1.16B, V2.16B UABDL2 V0.4S, V1.8H, V2.8H UABDL2 V0.2D, V1.4S, V2.4S // Absolute-difference and accumulate (Vd += |Vn - Vm|): SABA V0.8B, V1.8B, V2.8B // Signed abs-diff accumulate (same arrangements as SABD) SABA V0.16B, V1.16B, V2.16B SABA V0.4H, V1.4H, V2.4H SABA V0.8H, V1.8H, V2.8H SABA V0.2S, V1.2S, V2.2S SABA V0.4S, V1.4S, V2.4S UABA V0.8B, V1.8B, V2.8B // Unsigned abs-diff accumulate UABA V0.16B, V1.16B, V2.16B UABA V0.4H, V1.4H, V2.4H UABA V0.8H, V1.8H, V2.8H UABA V0.2S, V1.2S, V2.2S UABA V0.4S, V1.4S, V2.4S SABAL V0.8H, V1.8B, V2.8B // Widening abs-diff accumulate SABAL V0.4S, V1.4H, V2.4H SABAL V0.2D, V1.2S, V2.2S SABAL2 V0.8H, V1.16B, V2.16B SABAL2 V0.4S, V1.8H, V2.8H SABAL2 V0.2D, V1.4S, V2.4S UABAL V0.8H, V1.8B, V2.8B UABAL V0.4S, V1.4H, V2.4H UABAL V0.2D, V1.2S, V2.2S UABAL2 V0.8H, V1.16B, V2.16B UABAL2 V0.4S, V1.8H, V2.8H UABAL2 V0.2D, V1.4S, V2.4S // Integer vector unary (non-FP abs/neg, saturating variants): ABS V0.8B, V1.8B // Per-lane absolute value (two's complement → non-negative; INT_MIN→INT_MIN, wraps) ABS V0.16B, V1.16B ABS V0.4H, V1.4H ABS V0.8H, V1.8H ABS V0.2S, V1.2S ABS V0.4S, V1.4S ABS V0.2D, V1.2D // 2× 64-bit ABS Dd, Dn // Scalar 64-bit ABS NEG V0.8B, V1.8B // Per-lane negate (same arrangements as ABS) NEG V0.16B, V1.16B NEG V0.4H, V1.4H NEG V0.8H, V1.8H NEG V0.2S, V1.2S NEG V0.4S, V1.4S NEG V0.2D, V1.2D NEG Dd, Dn // Scalar 64-bit NEG SQABS V0.8B, V1.8B // Saturating ABS (INT_MIN → INT_MAX instead of wrapping) SQABS V0.16B, V1.16B SQABS V0.4H, V1.4H SQABS V0.8H, V1.8H SQABS V0.2S, V1.2S SQABS V0.4S, V1.4S SQABS V0.2D, V1.2D SQABS Bd, Bn // Scalar saturating ABS — byte SQABS Hd, Hn // halfword SQABS Sd, Sn // word SQABS Dd, Dn // doubleword SQNEG V0.8B, V1.8B // Saturating NEG — same arrangements as SQABS SQNEG V0.16B, V1.16B SQNEG V0.4H, V1.4H SQNEG V0.8H, V1.8H SQNEG V0.2S, V1.2S SQNEG V0.4S, V1.4S SQNEG V0.2D, V1.2D SQNEG Bd, Bn // Scalar saturating NEG SQNEG Hd, Hn SQNEG Sd, Sn SQNEG Dd, Dn // Per-byte count operations (useful for popcount / leading-zero lookup): CLS V0.8B, V1.8B // Count Leading Sign bits per lane. Byte/halfword/word (.8B/.16B/.4H/.8H/.2S/.4S) CLS V0.16B, V1.16B CLS V0.4H, V1.4H CLS V0.8H, V1.8H CLS V0.2S, V1.2S CLS V0.4S, V1.4S // No .2D form for vector CLS/CLZ CLZ V0.8B, V1.8B // Count Leading Zeros per lane — same arrangement set as CLS CLZ V0.16B, V1.16B CLZ V0.4H, V1.4H CLZ V0.8H, V1.8H CLZ V0.2S, V1.2S CLZ V0.4S, V1.4S CNT V0.8B, V1.8B // Population count per byte (only .8B and .16B accepted) CNT V0.16B, V1.16B RBIT V0.8B, V1.8B // Reverse bits within each byte (only .8B and .16B accepted) RBIT V0.16B, V1.16B ``` ### 23.7 Pairwise, Reductions & Min/Max Horizontal operations that combine lanes: pairwise (widening) add, across-vector reductions such as `ADDV`, and per-lane and pairwise min/max. ```asm // Pairwise widening add (widen each element, then add adjacent lanes — result has half the lane count): SADDLP V0.4H, V1.8B // Sign-extend and pairwise add: 8 bytes → 4 halves (V0[i] = V1[2i] + V1[2i+1] sign-extended) SADDLP V0.8H, V1.16B SADDLP V0.2S, V1.4H SADDLP V0.4S, V1.8H SADDLP V0.1D, V1.2S SADDLP V0.2D, V1.4S UADDLP V0.4H, V1.8B // Zero-extend and pairwise add — same arrangement set UADDLP V0.8H, V1.16B UADDLP V0.2S, V1.4H UADDLP V0.4S, V1.8H UADDLP V0.1D, V1.2S UADDLP V0.2D, V1.4S SADALP V0.4H, V1.8B // Pairwise widen-add and ACCUMULATE into Vd SADALP V0.8H, V1.16B SADALP V0.2S, V1.4H SADALP V0.4S, V1.8H SADALP V0.1D, V1.2S SADALP V0.2D, V1.4S UADALP V0.4H, V1.8B // Unsigned version — same set UADALP V0.8H, V1.16B UADALP V0.2S, V1.4H UADALP V0.4S, V1.8H UADALP V0.1D, V1.2S UADALP V0.2D, V1.4S // Reduction (collapse all lanes into a single scalar). // **ISA-level truth** — ADDV arrangement/destination set: // ADDV Bd, Vn.8B | Bd, Vn.16B | Hd, Vn.4H | Hd, Vn.8H | Sd, Vn.4S // (NO .2S, NO .2D, NO .1D — ADDV requires at least 4 lanes; 64-bit reduction uses ADDP Dd,Vn.2D) ADDV Bd, V0.8B // Sum 8 bytes → 1 byte (low byte of result register) ADDV Bd, V0.16B // Sum 16 bytes → 1 byte ADDV Hd, V0.4H // Sum 4 halfwords → 1 halfword ADDV Hd, V0.8H // Sum 8 halfwords → 1 halfword ADDV Sd, V0.4S // Sum 4 words → 1 word // For 2-lane reduction (.2S, .2D, .2H) use ADDP scalar: e.g. ADDP Dd, V0.2D. // Widening sum reductions — result is one size wider than input elements, preventing overflow. // SADDLV/UADDLV arrangement set (Vd is one size wider): // Hd ← Vn.8B / Vn.16B (byte sum → halfword accumulator) // Sd ← Vn.4H / Vn.8H (halfword sum → word accumulator) // Dd ← Vn.4S (word sum → doubleword accumulator; only .4S — .2S not valid) SADDLV Hd, V0.8B // Widening signed sum: 8 bytes → 1 halfword SADDLV Hd, V0.16B SADDLV Sd, V0.4H // Widening: 4 halfwords → 1 word SADDLV Sd, V0.8H SADDLV Dd, V0.4S // Widening: 4 words → 1 doubleword (prevents overflow) UADDLV Hd, V0.8B // Unsigned widening — same arrangement set UADDLV Hd, V0.16B UADDLV Sd, V0.4H UADDLV Sd, V0.8H UADDLV Dd, V0.4S // Integer min/max reductions — same arrangement/destination set as ADDV (no .2D, no .2S, no .1D). SMAXV Bd, V0.8B // Signed max across lanes SMAXV Bd, V0.16B SMAXV Hd, V0.4H SMAXV Hd, V0.8H SMAXV Sd, V0.4S SMINV Bd, V0.8B // Signed min — same arrangement set SMINV Bd, V0.16B SMINV Hd, V0.4H SMINV Hd, V0.8H SMINV Sd, V0.4S UMAXV Bd, V0.8B // Unsigned max — same arrangement set UMAXV Bd, V0.16B UMAXV Hd, V0.4H UMAXV Hd, V0.8H UMAXV Sd, V0.4S UMINV Bd, V0.8B // Unsigned min — same arrangement set UMINV Bd, V0.16B UMINV Hd, V0.4H UMINV Hd, V0.8H UMINV Sd, V0.4S // FP across-vector reductions. NOTE: there is NO FADDV — FP addition reduction must // be built from pairwise FADDP (e.g. two FADDPs on a .4S sums 4 lanes in 2 instructions). // Only max/min reduce across the vector as single instructions. // **ISA-level truth** — arrangement/destination set: // Sd ← Vn.4S (the only 32-bit FP reduction) // Hd ← Vn.4H / Vn.8H (FEAT_FP16) // NO .2S, .2D, .1D — those are handled by FMAXP/FMINP scalar. FMAXV Hd, V0.4H // FEAT_FP16 — IEEE max across 4 half-precision lanes (propagates NaN) FMAXV Hd, V0.8H // FEAT_FP16 — across 8 lanes FMAXV Sd, V0.4S // IEEE max across 4 single-precision lanes (propagates NaN) FMINV Hd, V0.4H // FEAT_FP16 FMINV Hd, V0.8H // FEAT_FP16 FMINV Sd, V0.4S FMAXNMV Hd, V0.4H // FEAT_FP16 — NaN-suppressing max FMAXNMV Hd, V0.8H // FEAT_FP16 FMAXNMV Sd, V0.4S FMINNMV Hd, V0.4H // FEAT_FP16 — NaN-suppressing min FMINNMV Hd, V0.8H // FEAT_FP16 FMINNMV Sd, V0.4S // Pairwise operations — add adjacent lanes. Useful for tree-reduction and per-pair ops. // V0[i] = op(V1[2i], V1[2i+1]) for lower half of V0; upper half gets pairs from V2. // Per-lane integer min/max (baseline NEON, distinct from the FEAT_CSSC SMAX/SMIN/UMAX/UMIN on GPRs). // Arrangements: .8B/.16B, .4H/.8H, .2S/.4S — NO .1D/.2D (integer min/max is not defined for 64-bit lanes). // For 64-bit-lane min/max, use CMP + BSL or build from SQSUB's sign bit. SMAX V0.8B, V1.8B, V2.8B // Signed per-lane max (byte) SMAX V0.16B, V1.16B, V2.16B SMAX V0.4H, V1.4H, V2.4H // Signed per-lane max (halfword) SMAX V0.8H, V1.8H, V2.8H SMAX V0.2S, V1.2S, V2.2S // Signed per-lane max (word) SMAX V0.4S, V1.4S, V2.4S SMIN V0.8B, V1.8B, V2.8B // Signed per-lane min — same 6 arrangements SMIN V0.16B, V1.16B, V2.16B SMIN V0.4H, V1.4H, V2.4H SMIN V0.8H, V1.8H, V2.8H SMIN V0.2S, V1.2S, V2.2S SMIN V0.4S, V1.4S, V2.4S UMAX V0.8B, V1.8B, V2.8B // Unsigned per-lane max — same 6 arrangements UMAX V0.16B, V1.16B, V2.16B UMAX V0.4H, V1.4H, V2.4H UMAX V0.8H, V1.8H, V2.8H UMAX V0.2S, V1.2S, V2.2S UMAX V0.4S, V1.4S, V2.4S UMIN V0.8B, V1.8B, V2.8B // Unsigned per-lane min — same 6 arrangements UMIN V0.16B, V1.16B, V2.16B UMIN V0.4H, V1.4H, V2.4H UMIN V0.8H, V1.8H, V2.8H UMIN V0.2S, V1.2S, V2.2S UMIN V0.4S, V1.4S, V2.4S ADDP V0.4S, V1.4S, V2.4S // Pairwise add: V0 = [V1[0]+V1[1], V1[2]+V1[3], V2[0]+V2[1], V2[2]+V2[3]] SMAXP V0.4S, V1.4S, V2.4S // Signed pairwise max (integer): same pair-and-concatenate layout as ADDP SMINP V0.4S, V1.4S, V2.4S // Signed pairwise min UMAXP V0.4S, V1.4S, V2.4S // Unsigned pairwise max UMINP V0.4S, V1.4S, V2.4S // Unsigned pairwise min // SMAXP/SMINP/UMAXP/UMINP accept arrangements .8B/.16B, .4H/.8H, .2S/.4S — NO .1D/.2D // (integer pairwise min/max is not defined for 64-bit elements, same restriction as SMAX/UMAX). FADDP V0.2S, V1.2S, V2.2S // FP pairwise add — all 5 FP arrangements FADDP V0.4S, V1.4S, V2.4S FADDP V0.2D, V1.2D, V2.2D FADDP V0.4H, V1.4H, V2.4H // FEAT_FP16 FADDP V0.8H, V1.8H, V2.8H // FEAT_FP16 FMAXP V0.2S, V1.2S, V2.2S // FP pairwise max (IEEE — propagates NaN) — same arrangement set FMAXP V0.4S, V1.4S, V2.4S FMAXP V0.2D, V1.2D, V2.2D FMAXP V0.4H, V1.4H, V2.4H // FEAT_FP16 FMAXP V0.8H, V1.8H, V2.8H // FEAT_FP16 FMINP V0.2S, V1.2S, V2.2S // FP pairwise min — same arrangement set FMINP V0.4S, V1.4S, V2.4S FMINP V0.2D, V1.2D, V2.2D FMINP V0.4H, V1.4H, V2.4H // FEAT_FP16 FMINP V0.8H, V1.8H, V2.8H // FEAT_FP16 FMAXNMP V0.2S, V1.2S, V2.2S // FP pairwise max, NaN-suppressing — same arrangement set FMAXNMP V0.4S, V1.4S, V2.4S FMAXNMP V0.2D, V1.2D, V2.2D FMAXNMP V0.4H, V1.4H, V2.4H // FEAT_FP16 FMAXNMP V0.8H, V1.8H, V2.8H // FEAT_FP16 FMINNMP V0.2S, V1.2S, V2.2S // FP pairwise min, NaN-suppressing — same arrangement set FMINNMP V0.4S, V1.4S, V2.4S FMINNMP V0.2D, V1.2D, V2.2D FMINNMP V0.4H, V1.4H, V2.4H // FEAT_FP16 FMINNMP V0.8H, V1.8H, V2.8H // FEAT_FP16 // Scalar forms of FP pairwise — reduce 2 lanes of a .2S/.2D (or .2H FEAT_FP16) into one scalar: ADDP Dd, V0.2D // Dd = V0[0] + V0[1] (useful for horizontal sum after ADDV) FADDP Hd, V0.2H // FEAT_FP16: Hd = V0[0] + V0[1] FADDP Sd, V0.2S // Sd = V0[0] + V0[1] (FP pairwise scalar) FADDP Dd, V0.2D // Dd = V0[0] + V0[1] (FP pairwise scalar) FMAXP Hd, V0.2H // FEAT_FP16 FMAXP Sd, V0.2S FMAXP Dd, V0.2D FMINP Hd, V0.2H // FEAT_FP16 FMINP Sd, V0.2S FMINP Dd, V0.2D FMAXNMP Hd, V0.2H // FEAT_FP16 NaN-suppressing pairwise max scalar FMAXNMP Sd, V0.2S FMAXNMP Dd, V0.2D FMINNMP Hd, V0.2H // FEAT_FP16 FMINNMP Sd, V0.2S FMINNMP Dd, V0.2D ``` ### 23.8 FP Precision Conversion (vector) Vector floating-point precision conversion (widen/narrow between half, single, and double), including the round-to-odd `FCVTXN` that avoids double-rounding when narrowing. ```asm // Vector FP precision conversion (widen/narrow). Distinct from FCVTZS/FCVTNS // (int conversion) and from FCVT scalar (precision conversion on a single value). FCVTN V0.4H, V1.4S // Narrow 4× single → 4× half (writes lower 4 halves of V0, // zeroes upper half per scalar-SIMD rule above) FCVTN2 V0.8H, V1.4S // Narrow 4× single → upper 4 halves of V0 (preserves lower half) FCVTN V0.2S, V1.2D // Narrow 2× double → 2× single (writes lower 2 singles) FCVTN2 V0.4S, V1.2D // Narrow 2× double → upper 2 singles (preserves lower 2) FCVTL V0.4S, V1.4H // Widen 4× half → 4× single (reads lower 4 halves of V1) FCVTL2 V0.4S, V1.8H // Widen upper 4 halves → 4× single FCVTL V0.2D, V1.2S // Widen 2× single → 2× double (reads lower 2) FCVTL2 V0.2D, V1.4S // Widen upper 2 singles → 2× double // FCVTXN / FCVTXN2 — narrow double → single with ROUND-TO-ODD (aka "jamming" rounding). // **ISA-level truth** — this is a distinct mnemonic from FCVTN because FCVTN uses the current // FPCR rounding mode, whereas FCVTXN always rounds to odd regardless of FPCR. Round-to-odd // preserves the property that a subsequent round-to-nearest gives the same result as directly // rounding the original double to half-precision — essential for correctly-rounded // double → half conversion in multi-step paths. Only double→single supported (no FP16 variants). FCVTXN V0.2S, V1.2D // Narrow 2× double → 2× single, round-to-odd (writes lower 2 singles) FCVTXN2 V0.4S, V1.2D // Narrow 2× double → upper 2 singles (preserves lower half) FCVTXN Sd, Dn // Scalar: narrow Dn to Sd with round-to-odd ``` ### 23.9 Vector Compare Per-lane compares. Unlike scalar `CMP`, these do NOT set NZCV — each lane produces an all-ones (true) or all-zeros (false) bitmask meant to feed a bitwise select. ```asm // Compare (result is a bitmask: all-ones if true, all-zeros if false). // **ISA-level truth** — integer compare family has register-register form and against-zero form. // Arrangement set: all 7 integer arrangements (.8B/.16B/.4H/.8H/.2S/.4S/.2D) plus scalar Dd // (operates on low 64 bits of V registers). CMLT/CMLE only exist in the against-zero form; // for register-register "less-than", swap operands and use CMGT/CMGE (or CMHI/CMHS for unsigned). // Register-register form — signed compare CMGT/CMGE, equality CMEQ, unsigned CMHI/CMHS, test CMTST: CMEQ V0.8B, V1.8B, V2.8B // Per-lane equality CMEQ V0.16B, V1.16B, V2.16B CMEQ V0.4H, V1.4H, V2.4H CMEQ V0.8H, V1.8H, V2.8H CMEQ V0.2S, V1.2S, V2.2S CMEQ V0.4S, V1.4S, V2.4S CMEQ V0.2D, V1.2D, V2.2D CMEQ Dd, Dn, Dm // Scalar 64-bit: all 64 bits of Dd set to 1 if Dn == Dm, else 0 CMGT V0.8B, V1.8B, V2.8B // Signed Greater Than — same 7 arrangements CMGT V0.16B, V1.16B, V2.16B CMGT V0.4H, V1.4H, V2.4H CMGT V0.8H, V1.8H, V2.8H CMGT V0.2S, V1.2S, V2.2S CMGT V0.4S, V1.4S, V2.4S CMGT V0.2D, V1.2D, V2.2D CMGT Dd, Dn, Dm CMGE V0.8B, V1.8B, V2.8B // Signed Greater or Equal — same 7 arrangements + scalar Dd CMGE V0.16B, V1.16B, V2.16B CMGE V0.4H, V1.4H, V2.4H CMGE V0.8H, V1.8H, V2.8H CMGE V0.2S, V1.2S, V2.2S CMGE V0.4S, V1.4S, V2.4S CMGE V0.2D, V1.2D, V2.2D CMGE Dd, Dn, Dm CMHI V0.8B, V1.8B, V2.8B // Unsigned Higher — same 7 arrangements + scalar Dd CMHI V0.16B, V1.16B, V2.16B CMHI V0.4H, V1.4H, V2.4H CMHI V0.8H, V1.8H, V2.8H CMHI V0.2S, V1.2S, V2.2S CMHI V0.4S, V1.4S, V2.4S CMHI V0.2D, V1.2D, V2.2D CMHI Dd, Dn, Dm CMHS V0.8B, V1.8B, V2.8B // Unsigned Higher or Same — same 7 arrangements + scalar Dd CMHS V0.16B, V1.16B, V2.16B CMHS V0.4H, V1.4H, V2.4H CMHS V0.8H, V1.8H, V2.8H CMHS V0.2S, V1.2S, V2.2S CMHS V0.4S, V1.4S, V2.4S CMHS V0.2D, V1.2D, V2.2D CMHS Dd, Dn, Dm CMTST V0.8B, V1.8B, V2.8B // Test: all-ones if (V1[i] & V2[i]) != 0 — same 7 arrangements + Dd CMTST V0.16B, V1.16B, V2.16B CMTST V0.4H, V1.4H, V2.4H CMTST V0.8H, V1.8H, V2.8H CMTST V0.2S, V1.2S, V2.2S CMTST V0.4S, V1.4S, V2.4S CMTST V0.2D, V1.2D, V2.2D CMTST Dd, Dn, Dm // Against-zero form — CMEQ/CMGT/CMGE/CMLE/CMLT against the #0 immediate. Same 7 arrangements + Dd. // (The #0 immediate is the ONLY allowed value; assemblers reject #1 or any non-zero immediate.) CMEQ V0.8B, V1.8B, #0 // all-ones if lane == 0 CMEQ V0.16B, V1.16B, #0 CMEQ V0.4H, V1.4H, #0 CMEQ V0.8H, V1.8H, #0 CMEQ V0.2S, V1.2S, #0 CMEQ V0.4S, V1.4S, #0 CMEQ V0.2D, V1.2D, #0 CMEQ Dd, Dn, #0 CMGT V0.8B, V1.8B, #0 // all-ones if lane > 0 (signed) CMGT V0.16B, V1.16B, #0 CMGT V0.4H, V1.4H, #0 CMGT V0.8H, V1.8H, #0 CMGT V0.2S, V1.2S, #0 CMGT V0.4S, V1.4S, #0 CMGT V0.2D, V1.2D, #0 CMGT Dd, Dn, #0 CMGE V0.8B, V1.8B, #0 // all-ones if lane >= 0 (signed) CMGE V0.16B, V1.16B, #0 CMGE V0.4H, V1.4H, #0 CMGE V0.8H, V1.8H, #0 CMGE V0.2S, V1.2S, #0 CMGE V0.4S, V1.4S, #0 CMGE V0.2D, V1.2D, #0 CMGE Dd, Dn, #0 CMLE V0.8B, V1.8B, #0 // all-ones if lane <= 0 (signed) — #0 immediate ONLY, no reg-reg form CMLE V0.16B, V1.16B, #0 CMLE V0.4H, V1.4H, #0 CMLE V0.8H, V1.8H, #0 CMLE V0.2S, V1.2S, #0 CMLE V0.4S, V1.4S, #0 CMLE V0.2D, V1.2D, #0 CMLE Dd, Dn, #0 CMLT V0.8B, V1.8B, #0 // all-ones if lane < 0 (signed) — #0 immediate ONLY, no reg-reg form CMLT V0.16B, V1.16B, #0 CMLT V0.4H, V1.4H, #0 CMLT V0.8H, V1.8H, #0 CMLT V0.2S, V1.2S, #0 CMLT V0.4S, V1.4S, #0 CMLT V0.2D, V1.2D, #0 CMLT Dd, Dn, #0 // FP vector compare — per-lane, produces all-ones/all-zeros bitmasks (NOT a flag-setting compare; // FCMP is the flag-setting one and is scalar-only). Unordered (NaN) always yields zero (FALSE) // in the result lane, so these compares are NaN-safe for ordered predicates. // **ISA-level truth** — all 5 FP vector arrangements (.2S/.4S/.2D, and .4H/.8H with FEAT_FP16) // plus scalar Hd/Sd/Dd (Hd FEAT_FP16). Both register-register and against-#0.0 forms available // for FCMEQ/FCMGT/FCMGE. FCMLT/FCMLE only exist as against-#0.0 form (no two-register form). FCMEQ V0.4H, V1.4H, V2.4H // FEAT_FP16 — per-lane equality (ordered) FCMEQ V0.8H, V1.8H, V2.8H // FEAT_FP16 FCMEQ V0.2S, V1.2S, V2.2S FCMEQ V0.4S, V1.4S, V2.4S FCMEQ V0.2D, V1.2D, V2.2D FCMEQ Hd, Hn, Hm // Scalar half (FEAT_FP16) FCMEQ Sd, Sn, Sm // Scalar single FCMEQ Dd, Dn, Dm // Scalar double FCMGT V0.4H, V1.4H, V2.4H // FEAT_FP16 — per-lane greater-than FCMGT V0.8H, V1.8H, V2.8H // FEAT_FP16 FCMGT V0.2S, V1.2S, V2.2S FCMGT V0.4S, V1.4S, V2.4S FCMGT V0.2D, V1.2D, V2.2D FCMGT Hd, Hn, Hm // FEAT_FP16 FCMGT Sd, Sn, Sm FCMGT Dd, Dn, Dm FCMGE V0.4H, V1.4H, V2.4H // FEAT_FP16 — per-lane greater-or-equal FCMGE V0.8H, V1.8H, V2.8H // FEAT_FP16 FCMGE V0.2S, V1.2S, V2.2S FCMGE V0.4S, V1.4S, V2.4S FCMGE V0.2D, V1.2D, V2.2D FCMGE Hd, Hn, Hm // FEAT_FP16 FCMGE Sd, Sn, Sm FCMGE Dd, Dn, Dm // Against #0.0 (the ONLY allowed FP immediate — assemblers reject any other value): FCMEQ V0.4H, V1.4H, #0.0 // FEAT_FP16 — same arrangement set + scalar Hd/Sd/Dd FCMEQ V0.8H, V1.8H, #0.0 // FEAT_FP16 FCMEQ V0.2S, V1.2S, #0.0 FCMEQ V0.4S, V1.4S, #0.0 FCMEQ V0.2D, V1.2D, #0.0 FCMEQ Hd, Hn, #0.0 | Sd, Sn, #0.0 | Dd, Dn, #0.0 // scalar FCMGT V0.4H, V1.4H, #0.0 | V0.8H, V1.8H, #0.0 // FEAT_FP16 FCMGT V0.2S, V1.2S, #0.0 | V0.4S, V1.4S, #0.0 | V0.2D, V1.2D, #0.0 FCMGT Hd, Hn, #0.0 | Sd, Sn, #0.0 | Dd, Dn, #0.0 FCMGE V0.4H, V1.4H, #0.0 | V0.8H, V1.8H, #0.0 // FEAT_FP16 FCMGE V0.2S, V1.2S, #0.0 | V0.4S, V1.4S, #0.0 | V0.2D, V1.2D, #0.0 FCMGE Hd, Hn, #0.0 | Sd, Sn, #0.0 | Dd, Dn, #0.0 FCMLT V0.4H, V1.4H, #0.0 | V0.8H, V1.8H, #0.0 // FEAT_FP16 — #0.0 ONLY, no two-register form FCMLT V0.2S, V1.2S, #0.0 | V0.4S, V1.4S, #0.0 | V0.2D, V1.2D, #0.0 FCMLT Hd, Hn, #0.0 | Sd, Sn, #0.0 | Dd, Dn, #0.0 FCMLE V0.4H, V1.4H, #0.0 | V0.8H, V1.8H, #0.0 // FEAT_FP16 — #0.0 ONLY, no two-register form FCMLE V0.2S, V1.2S, #0.0 | V0.4S, V1.4S, #0.0 | V0.2D, V1.2D, #0.0 FCMLE Hd, Hn, #0.0 | Sd, Sn, #0.0 | Dd, Dn, #0.0 // For vector FCMLT/FCMLE against two registers, swap operands and use FCMGT/FCMGE. ``` ### 23.10 Permute, Shuffle & Element Move Moving data between lanes and registers: table lookup (`TBL`/`TBX`), zip/unzip, transpose, `EXT`, element insert/extract, and the `DUP`/`MOV` broadcast and copy forms. ```asm // Table lookup (byte-level permutation — like x86 PSHUFB) // Table can be 1, 2, 3, or 4 consecutive V registers → 16, 32, 48, or 64-byte lookup tables. // Out-of-range indices return 0 for TBL; for TBX the destination lane is preserved. TBL V0.16B, {V1.16B}, V2.16B // 16-byte table: V0[i] = V1[V2[i]], or 0 if V2[i] ≥ 16 TBL V0.16B, {V1.16B, V2.16B}, V3.16B // 32-byte table (2 consecutive registers) TBL V0.16B, {V1.16B, V2.16B, V3.16B}, V4.16B // 48-byte table (3 consecutive) TBL V0.16B, {V1.16B, V2.16B, V3.16B, V4.16B}, V5.16B // 64-byte table (4 consecutive — max) TBX V0.16B, {V1.16B}, V2.16B // Like TBL but out-of-range indices leave V0[i] unchanged // NOTE: the table registers MUST be consecutive and the list wraps V31→V0. The assembler enforces this. // .8B variants exist too for 8-lane forms; otherwise identical. // Zip/unzip (interleave/deinterleave) — all permute ops accept the full arrangement set: // .8B, .16B, .4H, .8H, .2S, .4S, .2D ZIP1 V0.8B, V1.8B, V2.8B // Interleave lower halves of matching lanes ZIP1 V0.16B, V1.16B, V2.16B ZIP1 V0.4H, V1.4H, V2.4H ZIP1 V0.8H, V1.8H, V2.8H ZIP1 V0.2S, V1.2S, V2.2S ZIP1 V0.4S, V1.4S, V2.4S ZIP1 V0.2D, V1.2D, V2.2D ZIP2 V0.8B, V1.8B, V2.8B // Interleave upper halves — same 7 arrangements as ZIP1 ZIP2 V0.16B, V1.16B, V2.16B ZIP2 V0.4H, V1.4H, V2.4H ZIP2 V0.8H, V1.8H, V2.8H ZIP2 V0.2S, V1.2S, V2.2S ZIP2 V0.4S, V1.4S, V2.4S ZIP2 V0.2D, V1.2D, V2.2D UZP1 V0.8B, V1.8B, V2.8B // Even-indexed elements — same 7 arrangements UZP1 V0.16B, V1.16B, V2.16B UZP1 V0.4H, V1.4H, V2.4H UZP1 V0.8H, V1.8H, V2.8H UZP1 V0.2S, V1.2S, V2.2S UZP1 V0.4S, V1.4S, V2.4S UZP1 V0.2D, V1.2D, V2.2D UZP2 V0.8B, V1.8B, V2.8B // Odd-indexed elements — same 7 arrangements UZP2 V0.16B, V1.16B, V2.16B UZP2 V0.4H, V1.4H, V2.4H UZP2 V0.8H, V1.8H, V2.8H UZP2 V0.2S, V1.2S, V2.2S UZP2 V0.4S, V1.4S, V2.4S UZP2 V0.2D, V1.2D, V2.2D TRN1 V0.8B, V1.8B, V2.8B // Transpose even-indexed — same 7 arrangements TRN1 V0.16B, V1.16B, V2.16B TRN1 V0.4H, V1.4H, V2.4H TRN1 V0.8H, V1.8H, V2.8H TRN1 V0.2S, V1.2S, V2.2S TRN1 V0.4S, V1.4S, V2.4S TRN1 V0.2D, V1.2D, V2.2D TRN2 V0.8B, V1.8B, V2.8B // Transpose odd-indexed — same 7 arrangements TRN2 V0.16B, V1.16B, V2.16B TRN2 V0.4H, V1.4H, V2.4H TRN2 V0.8H, V1.8H, V2.8H TRN2 V0.2S, V1.2S, V2.2S TRN2 V0.4S, V1.4S, V2.4S TRN2 V0.2D, V1.2D, V2.2D // Transpose (reorganize rows/columns — used in matrix operations) TRN1 V0.8B, V1.8B, V2.8B // Transpose even-indexed elements TRN1 V0.4S, V1.4S, V2.4S TRN2 V0.8B, V1.8B, V2.8B // Transpose odd-indexed elements TRN2 V0.4S, V1.4S, V2.4S // Extract (byte-level sliding window across two registers). Two arrangements: // .8B form: #imm ∈ 0..7 (slides within 8 bytes total) // .16B form: #imm ∈ 0..15 (slides within 16 bytes total) EXT V0.8B, V1.8B, V2.8B, #imm // imm 0..7 EXT V0.16B, V1.16B, V2.16B, #imm // imm 0..15 // Insert/extract element (move between scalar and lane) INS V0.S[i], Wn|WZR // Insert 32-bit GPR value into lane i of V0 (INS from general) INS V0.D[i], Xn|XZR // Insert 64-bit GPR into lane i (i ∈ 0..1) INS V0.B[i], Wn|WZR // Byte lane from GPR low 8 bits (i ∈ 0..15) INS V0.H[i], Wn|WZR // Halfword lane (i ∈ 0..7) INS V0.B[i], V1.B[j] // Element-to-element: insert lane j of V1 into lane i of V0 (byte) INS V0.H[i], V1.H[j] // Halfword element copy (i, j ∈ 0..7) INS V0.S[i], V1.S[j] // Word element copy (i, j ∈ 0..3) INS V0.D[i], V1.D[j] // Doubleword element copy (i, j ∈ 0..1) // **ISA-level truth**: `MOV Vd.T[i], Vn.T[j]` is a preferred-disassembly // alias for INS — the two mnemonics assemble to the same encoding. // INS is the only way (at all) to copy a single lane between V registers // without disturbing the other lanes; UMOV+INS is the GPR round-trip. // MOV (vector) — full-register bitwise copy. **ISA-level truth**: this is an alias for // ORR (vector, register) with Rm == Rn, and the underlying ORR only accepts .8B or .16B // arrangement specifiers. So MOV V0.16B, V1.16B ≡ ORR V0.16B, V1.16B, V1.16B at the // encoding level. Writing e.g. `MOV V0.4S, V1.4S` works in GAS/LLVM but that is purely // an assembler convenience — the resulting machine code is the same ORR V0.16B encoding. // Per ARM ARM, the only preferred MOV (vector) forms are: MOV V0.16B, V1.16B // V0 = V1 (entire 128 bits) — encodes ORR V0.16B, V1.16B, V1.16B MOV V0.8B, V1.8B // V0 = V1 (lower 64 bits; upper half of V0 zeroed per §1.3 rule) // **MOV aliases of INS** (ARM ARM preferred-disassembly forms) — assemble to the same encoding // as the corresponding INS. Using MOV vs INS is a style choice for assembly authors; disassemblers // prefer MOV because it reads more naturally for a move-like operation. MOV V0.S[i], Wn|WZR // Alias of: INS V0.S[i], Wn|WZR (INS general) MOV V0.D[i], Xn|XZR // Alias of: INS V0.D[i], Xn|XZR MOV V0.B[i], Wn|WZR // Alias of: INS V0.B[i], Wn|WZR MOV V0.H[i], Wn|WZR // Alias of: INS V0.H[i], Wn|WZR MOV V0.S[i], V1.S[j] // Alias of: INS V0.S[i], V1.S[j] (INS element) MOV V0.D[i], V1.D[j] // Alias of: INS V0.D[i], V1.D[j] (all widths same pattern) UMOV Wd|WZR, V0.B[i] // Extract byte lane i, zero-extend into 32-bit GPR (i ∈ 0..15) UMOV Wd|WZR, V0.H[i] // Extract halfword, zero-extend → 32-bit (i ∈ 0..7) UMOV Wd|WZR, V0.S[i] // Extract word → 32-bit (i ∈ 0..3) UMOV Xd|XZR, V0.D[i] // Extract doubleword → 64-bit (i ∈ 0..1) // **MOV aliases of UMOV** — ONLY for the two cases where no actual extension occurs // (.S into W is same width, .D into X is same width). The byte/halfword UMOV variants // genuinely zero-extend, so they have no MOV alias spelling in ARM ARM. MOV Wd|WZR, V0.S[i] // Alias of: UMOV Wd|WZR, V0.S[i] (.S only — no extension) MOV Xd|XZR, V0.D[i] // Alias of: UMOV Xd|XZR, V0.D[i] (.D only — no extension) SMOV Wd|WZR, V0.B[i] // Sign-extend byte → 32-bit SMOV Xd|XZR, V0.B[i] // Sign-extend byte → 64-bit SMOV Wd|WZR, V0.H[i] // Sign-extend halfword → 32-bit SMOV Xd|XZR, V0.H[i] // Sign-extend halfword → 64-bit SMOV Xd|XZR, V0.S[i] // Sign-extend word → 64-bit // (SMOV always sign-extends; UMOV always zero-extends. // SMOV has no .D form — would be a redundant 64→64 move. // UMOV has no 32-bit-destination .D form for the same reason. // SMOV has NO MOV alias — ARM ARM does not permit it, because // the sign-extension semantic must remain visible in the mnemonic.) // DUP (element → scalar) — extract one lane of a V register into a scalar FP/SIMD register, // preserving the element's natural width. **ISA-level truth**: this has a MOV alias spelling. DUP Bd, Vn.B[i] // Scalar byte from lane i (i ∈ 0..15) DUP Hd, Vn.H[i] // Scalar halfword DUP Sd, Vn.S[i] // Scalar single (i ∈ 0..3) DUP Dd, Vn.D[i] // Scalar double (i ∈ 0..1) MOV Bd, Vn.B[i] // Alias of: DUP Bd, Vn.B[i] MOV Hd, Vn.H[i] // Alias of: DUP Hd, Vn.H[i] MOV Sd, Vn.S[i] // Alias of: DUP Sd, Vn.S[i] MOV Dd, Vn.D[i] // Alias of: DUP Dd, Vn.D[i] // Note: `FMOV Dd, Vn.D[0]` (lane-zero transfer) and `DUP Dd, Vn.D[0]` produce the same // result bit-for-bit but use DIFFERENT encodings. `FMOV Dd, Dn` (scalar-to-scalar) is yet // another encoding. Three machine-code paths for "copy a D register", each with its own // disassembly. Compilers pick whichever is smaller/faster on the target core. // DUP (scalar to vector) — broadcast a GPR value into every lane. GPR source uses |XZR. DUP V0.8B, Wn|WZR // Broadcast low 8 bits of Wn to 8 byte lanes DUP V0.16B, Wn|WZR // Broadcast low 8 bits to 16 byte lanes DUP V0.4H, Wn|WZR // Broadcast low 16 bits to 4 halfword lanes DUP V0.8H, Wn|WZR // Broadcast low 16 bits to 8 halfword lanes DUP V0.2S, Wn|WZR // Broadcast Wn to 2 word lanes DUP V0.4S, Wn|WZR // Broadcast Wn to 4 word lanes DUP V0.2D, Xn|XZR // Broadcast Xn to 2 doubleword lanes // DUP from lane (element-to-all-lanes): DUP V0.8B, V1.B[i] // Broadcast V1's byte lane i to 8 byte lanes (64-bit vector) DUP V0.16B, V1.B[i] // All 16 byte lanes get V1's byte lane i DUP V0.4H, V1.H[i] // Broadcast V1's halfword lane i to 4 halfword lanes DUP V0.8H, V1.H[i] // All 8 halfword lanes get V1's halfword lane i DUP V0.2S, V1.S[i] // Broadcast V1's word lane i to 2 word lanes DUP V0.4S, V1.S[i] // All 4 word lanes get V1's word lane i DUP V0.2D, V1.D[i] // Both doubleword lanes get V1's doubleword lane i (i ∈ 0..1) // NO `DUP V0.1D, V1.D[i]` — that would be a degenerate 1-lane "broadcast" to 64-bit; use DUP Dd, V1.D[i] (scalar). ``` ### 23.11 Vector Shifts Per-lane shifts by immediate and by register-variable amount, plus the widening, narrowing, saturating, rounding, shift-insert, and shift-accumulate variants. ```asm // Shift (per-lane) — immediate-amount variants. // SHIFT IMMEDIATE RANGES depend on element size (ELEMENT_BITS = 8/16/32/64 for B/H/S/D): // Left shifts (SHL, SLI, SQSHL (imm), SQSHLU, UQSHL (imm)): #shift ∈ 0 .. (ELEMENT_BITS − 1) // Right shifts (USHR, SSHR, SRSHR, URSHR, SRI): #shift ∈ 1 .. ELEMENT_BITS // Narrowing right-shifts (SHRN, RSHRN, SQSHRN, UQSHRN, SQSHRUN): // #shift ∈ 1 .. (dest_element_bits) // So .8B/.16B: 0..7 left, 1..8 right; .4H/.8H: 0..15 / 1..16; .2S/.4S: 0..31 / 1..32; // .2D: 0..63 left, 1..64 right (no .1D arrangement — D scalar form uses Dd/Dn/Dm instead). SHL V0.8B, V1.8B, #imm // Vector shift left (imm range 0..7). imm == 0 is legal (alias-like no-op). SHL V0.16B, V1.16B, #imm SHL V0.4H, V1.4H, #imm // imm 0..15 SHL V0.8H, V1.8H, #imm SHL V0.2S, V1.2S, #imm // imm 0..31 SHL V0.4S, V1.4S, #imm SHL V0.2D, V1.2D, #imm // imm 0..63 SHL Dd, Dn, #imm // Scalar shift left (64-bit); imm 0..63 USHR V0.8B, V1.8B, #imm // Unsigned shift right. imm 1..8 (cannot be 0 on right shifts) USHR V0.16B, V1.16B, #imm USHR V0.4H, V1.4H, #imm // imm 1..16 USHR V0.8H, V1.8H, #imm USHR V0.2S, V1.2S, #imm // imm 1..32 USHR V0.4S, V1.4S, #imm USHR V0.2D, V1.2D, #imm // imm 1..64 USHR Dd, Dn, #imm // Scalar USHR; imm 1..64 SSHR V0.8B, V1.8B, #imm // Signed (arithmetic) shift right — same arrangement set and range as USHR. SSHR V0.16B, V1.16B, #imm SSHR V0.4H, V1.4H, #imm SSHR V0.8H, V1.8H, #imm SSHR V0.2S, V1.2S, #imm SSHR V0.4S, V1.4S, #imm SSHR V0.2D, V1.2D, #imm SSHR Dd, Dn, #imm // Scalar SSHR SRSHR V0.4S, V1.4S, #imm // Signed rounding shift right — all SHL/SSHR/USHR arrangements supported URSHR V0.4S, V1.4S, #imm // Unsigned rounding shift right — same arrangement set // Saturating shifts by immediate (clamp on overflow instead of dropping bits). // **ISA-level truth** — arrangement set matches SHL: all 7 integer arrangements (.8B/.16B/.4H/.8H/ // .2S/.4S/.2D) plus scalar Bd/Hd/Sd/Dd. Same imm range as SHL (0..(elsize-1)). SQSHL V0.8B, V1.8B, #imm // Signed saturating left shift SQSHL V0.16B, V1.16B, #imm SQSHL V0.4H, V1.4H, #imm SQSHL V0.8H, V1.8H, #imm SQSHL V0.2S, V1.2S, #imm SQSHL V0.4S, V1.4S, #imm SQSHL V0.2D, V1.2D, #imm SQSHL Bd, Bn, #imm | Hd, Hn, #imm | Sd, Sn, #imm | Dd, Dn, #imm // Scalar (4 element sizes) UQSHL V0.8B, V1.8B, #imm // Unsigned saturating left shift — same arrangement set UQSHL V0.16B, V1.16B, #imm UQSHL V0.4H, V1.4H, #imm UQSHL V0.8H, V1.8H, #imm UQSHL V0.2S, V1.2S, #imm UQSHL V0.4S, V1.4S, #imm UQSHL V0.2D, V1.2D, #imm UQSHL Bd, Bn, #imm | Hd, Hn, #imm | Sd, Sn, #imm | Dd, Dn, #imm // Scalar SQSHLU V0.8B, V1.8B, #imm // Signed → unsigned saturating left shift (interpret result as unsigned) SQSHLU V0.16B, V1.16B, #imm SQSHLU V0.4H, V1.4H, #imm SQSHLU V0.8H, V1.8H, #imm SQSHLU V0.2S, V1.2S, #imm SQSHLU V0.4S, V1.4S, #imm SQSHLU V0.2D, V1.2D, #imm SQSHLU Bd, Bn, #imm | Hd, Hn, #imm | Sd, Sn, #imm | Dd, Dn, #imm // Scalar // Shift + narrow (wide input, narrow output, with truncation/rounding/saturation). // Destination element is half the source element width; shift range is 1 .. dest_element_bits. // So for dest .8B (from .8H source): imm 1..8; for dest .4H (from .4S source): imm 1..16; etc. SHRN V0.8B, V1.8H, #imm // imm 1..8 (high 8 bits of each halfword dropped) SHRN V0.4H, V1.4S, #imm // imm 1..16 SHRN V0.2S, V1.2D, #imm // imm 1..32 SHRN2 V0.16B, V1.8H, #imm // "2" = write the UPPER half of the destination SHRN2 V0.8H, V1.4S, #imm SHRN2 V0.4S, V1.2D, #imm RSHRN V0.8B, V1.8H, #imm // Rounding shift-right-narrow — same arrangement set as SHRN RSHRN V0.4H, V1.4S, #imm RSHRN V0.2S, V1.2D, #imm RSHRN2 V0.16B, V1.8H, #imm RSHRN2 V0.8H, V1.4S, #imm RSHRN2 V0.4S, V1.2D, #imm SQSHRN V0.8B, V1.8H, #imm // Signed saturating shift-right-narrow SQSHRN V0.4H, V1.4S, #imm SQSHRN V0.2S, V1.2D, #imm SQSHRN2 V0.16B, V1.8H, #imm SQSHRN2 V0.8H, V1.4S, #imm SQSHRN2 V0.4S, V1.2D, #imm UQSHRN V0.8B, V1.8H, #imm // Unsigned saturating shift-right-narrow UQSHRN V0.4H, V1.4S, #imm UQSHRN V0.2S, V1.2D, #imm UQSHRN2 V0.16B, V1.8H, #imm UQSHRN2 V0.8H, V1.4S, #imm UQSHRN2 V0.4S, V1.2D, #imm SQSHRUN V0.8B, V1.8H, #imm // Signed → unsigned saturating shift-right-narrow SQSHRUN V0.4H, V1.4S, #imm SQSHRUN V0.2S, V1.2D, #imm SQSHRUN2 V0.16B, V1.8H, #imm SQSHRUN2 V0.8H, V1.4S, #imm SQSHRUN2 V0.4S, V1.2D, #imm SQRSHRN V0.8B, V1.8H, #imm // Rounding variants of the above: SQRSHRN V0.4H, V1.4S, #imm SQRSHRN V0.2S, V1.2D, #imm SQRSHRN2 V0.16B, V1.8H, #imm SQRSHRN2 V0.8H, V1.4S, #imm SQRSHRN2 V0.4S, V1.2D, #imm UQRSHRN V0.8B, V1.8H, #imm UQRSHRN V0.4H, V1.4S, #imm UQRSHRN V0.2S, V1.2D, #imm UQRSHRN2 V0.16B, V1.8H, #imm UQRSHRN2 V0.8H, V1.4S, #imm UQRSHRN2 V0.4S, V1.2D, #imm SQRSHRUN V0.8B, V1.8H, #imm SQRSHRUN V0.4H, V1.4S, #imm SQRSHRUN V0.2S, V1.2D, #imm SQRSHRUN2 V0.16B, V1.8H, #imm SQRSHRUN2 V0.8H, V1.4S, #imm SQRSHRUN2 V0.4S, V1.2D, #imm // Shift + widen (narrow input, wide output — fill zeroes/sign bits): // SHLL/SSHLL/USHLL with imm = 0 are the preferred-disassembly forms for SXTL/UXTL (see §23 widening). // imm range: 0 .. (source_element_bits − 1) for SSHLL/USHLL; SHLL is specifically imm = source_bits (so // it produces the pure zero-padded widening, equivalent to SSHLL/USHLL by element-size amount). SSHLL V0.8H, V1.8B, #imm // Signed shift-left and widen; imm 0..7 (input = .8B / .4H / .2S) SSHLL V0.4S, V1.4H, #imm // imm 0..15 SSHLL V0.2D, V1.2S, #imm // imm 0..31 SSHLL2 V0.8H, V1.16B, #imm // "2" = read upper half of source SSHLL2 V0.4S, V1.8H, #imm SSHLL2 V0.2D, V1.4S, #imm USHLL V0.8H, V1.8B, #imm // Unsigned shift-left and widen (fills with zeros) — same set USHLL V0.4S, V1.4H, #imm USHLL V0.2D, V1.2S, #imm USHLL2 V0.8H, V1.16B, #imm USHLL2 V0.4S, V1.8H, #imm USHLL2 V0.2D, V1.4S, #imm // SHLL / SHLL2 — shift-left-and-widen by EXACTLY the source element size. // **ISA-level truth** — this exists as a distinct mnemonic from (U)SHLL because USHLL's shift // field is 0..(src_element_size − 1), so shift-by-exactly-src_element_size isn't encodable in // USHLL. SHLL encodes that one specific case: each source element gets left-shifted by its full // width and zero-filled, so Vd[i] = Vn[i] << element_size. SHLL V0.8H, V1.8B, #8 // 8-bit source: shift by 8 → halfword destination SHLL V0.4S, V1.4H, #16 // halfword source: shift by 16 → word destination SHLL V0.2D, V1.2S, #32 // word source: shift by 32 → doubleword destination SHLL2 V0.8H, V1.16B, #8 // "2" variant — reads upper half of source SHLL2 V0.4S, V1.8H, #16 SHLL2 V0.2D, V1.4S, #32 // The shift amount is REQUIRED and MUST equal the source element size — any other value is UNDEFINED // (the encoding has no imm field for shift; the source-size is the only legal value). // Shift-insert (keeps bits of Vd that the shift doesn't overwrite — combines a shift // of Vn with the bits of Vd that the shift doesn't touch). Used for bit-packing: // SLI replaces low-N bits of Vd with (Vn << imm); SRI replaces high-N bits with (Vn >> imm). // **ISA-level truth** — arrangements: all 7 integer (.8B/.16B/.4H/.8H/.2S/.4S/.2D) + scalar Dd. // SLI imm range: 0..(element_bits − 1). SRI imm range: 1..element_bits. SLI V0.8B, V1.8B, #imm // Shift-left-and-insert (byte; imm 0..7) SLI V0.16B, V1.16B, #imm SLI V0.4H, V1.4H, #imm // halfword (imm 0..15) SLI V0.8H, V1.8H, #imm SLI V0.2S, V1.2S, #imm // word (imm 0..31) SLI V0.4S, V1.4S, #imm SLI V0.2D, V1.2D, #imm // doubleword (imm 0..63) SLI Dd, Dn, #imm // Scalar 64-bit (imm 0..63) SRI V0.8B, V1.8B, #imm // Shift-right-and-insert — same arrangement set (imm 1..element_bits) SRI V0.16B, V1.16B, #imm SRI V0.4H, V1.4H, #imm SRI V0.8H, V1.8H, #imm SRI V0.2S, V1.2S, #imm SRI V0.4S, V1.4S, #imm SRI V0.2D, V1.2D, #imm SRI Dd, Dn, #imm // Shift-right-and-ACCUMULATE — Vd += (Vn >> imm), optionally with rounding. // **ISA-level truth** — all 7 integer arrangements + scalar Dd, imm range 1..element_bits. // Heavily used in DSP IIR filters and fixed-point leaky-integrator averages. SSRA V0.8B, V1.8B, #imm // Signed shift-right-accumulate (Vd += (Vn >>_s imm)) SSRA V0.16B, V1.16B, #imm SSRA V0.4H, V1.4H, #imm SSRA V0.8H, V1.8H, #imm SSRA V0.2S, V1.2S, #imm SSRA V0.4S, V1.4S, #imm SSRA V0.2D, V1.2D, #imm SSRA Dd, Dn, #imm // Scalar 64-bit USRA V0.8B, V1.8B, #imm // Unsigned shift-right-accumulate — same arrangements USRA V0.16B, V1.16B, #imm USRA V0.4H, V1.4H, #imm USRA V0.8H, V1.8H, #imm USRA V0.2S, V1.2S, #imm USRA V0.4S, V1.4S, #imm USRA V0.2D, V1.2D, #imm USRA Dd, Dn, #imm SRSRA V0.8B, V1.8B, #imm // Signed ROUNDING shift-right-accumulate (round-half-up) SRSRA V0.16B, V1.16B, #imm SRSRA V0.4H, V1.4H, #imm SRSRA V0.8H, V1.8H, #imm SRSRA V0.2S, V1.2S, #imm SRSRA V0.4S, V1.4S, #imm SRSRA V0.2D, V1.2D, #imm SRSRA Dd, Dn, #imm URSRA V0.8B, V1.8B, #imm // Unsigned ROUNDING shift-right-accumulate — same arrangements URSRA V0.16B, V1.16B, #imm URSRA V0.4H, V1.4H, #imm URSRA V0.8H, V1.8H, #imm URSRA V0.2S, V1.2S, #imm URSRA V0.4S, V1.4S, #imm URSRA V0.2D, V1.2D, #imm URSRA Dd, Dn, #imm // Rounding shift-right by immediate (no accumulate) — SRSHR / URSHR: SRSHR V0.8B, V1.8B, #imm // Signed ROUNDING shift right (Vd = (Vn >>_s imm) with round-half-up) SRSHR V0.16B, V1.16B, #imm SRSHR V0.4H, V1.4H, #imm SRSHR V0.8H, V1.8H, #imm SRSHR V0.2S, V1.2S, #imm SRSHR V0.4S, V1.4S, #imm SRSHR V0.2D, V1.2D, #imm SRSHR Dd, Dn, #imm URSHR V0.8B, V1.8B, #imm // Unsigned ROUNDING shift right — same arrangements URSHR V0.16B, V1.16B, #imm URSHR V0.4H, V1.4H, #imm URSHR V0.8H, V1.8H, #imm URSHR V0.2S, V1.2S, #imm URSHR V0.4S, V1.4S, #imm URSHR V0.2D, V1.2D, #imm URSHR Dd, Dn, #imm // Register-variable shift amount (signed amount in each lane of Vm — negative means right-shift). // **ISA-level truth** — arrangements: all 7 integer (.8B/.16B/.4H/.8H/.2S/.4S/.2D) + scalar Dd. SSHL V0.8B, V1.8B, V2.8B // Signed shift left (by signed amount, per lane) SSHL V0.16B, V1.16B, V2.16B SSHL V0.4H, V1.4H, V2.4H SSHL V0.8H, V1.8H, V2.8H SSHL V0.2S, V1.2S, V2.2S SSHL V0.4S, V1.4S, V2.4S SSHL V0.2D, V1.2D, V2.2D SSHL Dd, Dn, Dm // Scalar 64-bit USHL V0.8B, V1.8B, V2.8B // Unsigned shift left — same arrangements USHL V0.16B, V1.16B, V2.16B USHL V0.4H, V1.4H, V2.4H USHL V0.8H, V1.8H, V2.8H USHL V0.2S, V1.2S, V2.2S USHL V0.4S, V1.4S, V2.4S USHL V0.2D, V1.2D, V2.2D USHL Dd, Dn, Dm SQSHL V0.8B, V1.8B, V2.8B // Signed saturating shift (register amount) — all 7 + scalar Dd SQSHL V0.16B, V1.16B, V2.16B SQSHL V0.4H, V1.4H, V2.4H SQSHL V0.8H, V1.8H, V2.8H SQSHL V0.2S, V1.2S, V2.2S SQSHL V0.4S, V1.4S, V2.4S SQSHL V0.2D, V1.2D, V2.2D SQSHL Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm // Scalar (register variant supports all 4 sizes) UQSHL V0.8B, V1.8B, V2.8B // Unsigned saturating shift (register) — same arrangements UQSHL V0.16B, V1.16B, V2.16B UQSHL V0.4H, V1.4H, V2.4H UQSHL V0.8H, V1.8H, V2.8H UQSHL V0.2S, V1.2S, V2.2S UQSHL V0.4S, V1.4S, V2.4S UQSHL V0.2D, V1.2D, V2.2D UQSHL Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm SRSHL V0.8B, V1.8B, V2.8B // Signed ROUNDING shift — same arrangements SRSHL V0.16B, V1.16B, V2.16B SRSHL V0.4H, V1.4H, V2.4H SRSHL V0.8H, V1.8H, V2.8H SRSHL V0.2S, V1.2S, V2.2S SRSHL V0.4S, V1.4S, V2.4S SRSHL V0.2D, V1.2D, V2.2D SRSHL Dd, Dn, Dm URSHL V0.8B, V1.8B, V2.8B // Unsigned ROUNDING shift — same arrangements URSHL V0.16B, V1.16B, V2.16B URSHL V0.4H, V1.4H, V2.4H URSHL V0.8H, V1.8H, V2.8H URSHL V0.2S, V1.2S, V2.2S URSHL V0.4S, V1.4S, V2.4S URSHL V0.2D, V1.2D, V2.2D URSHL Dd, Dn, Dm SQRSHL V0.8B, V1.8B, V2.8B // Signed saturating rounding shift — same arrangements SQRSHL V0.16B, V1.16B, V2.16B SQRSHL V0.4H, V1.4H, V2.4H SQRSHL V0.8H, V1.8H, V2.8H SQRSHL V0.2S, V1.2S, V2.2S SQRSHL V0.4S, V1.4S, V2.4S SQRSHL V0.2D, V1.2D, V2.2D SQRSHL Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm UQRSHL V0.8B, V1.8B, V2.8B // Unsigned saturating rounding shift — same arrangements UQRSHL V0.16B, V1.16B, V2.16B UQRSHL V0.4H, V1.4H, V2.4H UQRSHL V0.8H, V1.8H, V2.8H UQRSHL V0.2S, V1.2S, V2.2S UQRSHL V0.4S, V1.4S, V2.4S UQRSHL V0.2D, V1.2D, V2.2D UQRSHL Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm ``` ### 23.12 Reverse, Bitwise Logical & Select Bit and byte reversal, the vector bitwise-logical ops (`AND`/`ORR`/`EOR`/`BIC`/`ORN`/`NOT`), plain negate/abs, and the `BSL`/`BIT`/`BIF` bitwise-select family — NEON's branchless equivalent of `CSEL`. ```asm // Byte/bit reversal on vectors (useful for endian swaps, bit-reverse FFT indexing). // **ISA-level truth** — arrangement sets match ARM ARM pseudocode: // REV16: reverses bytes within each 16-bit container. Arrangements: {.8B, .16B}. // REV32: reverses elements within each 32-bit container. Element can be byte or halfword. // Arrangements: {.8B, .16B, .4H, .8H}. // REV64: reverses elements within each 64-bit container. Element can be byte, halfword, or word. // Arrangements: {.8B, .16B, .4H, .8H, .2S, .4S}. (.1D/.2D are INVALID — "reverse within // a 64-bit container" requires the element to be smaller than 64 bits.) REV16 V0.8B, V1.8B // Reverse byte order within each 16-bit halfword (64-bit vector) REV16 V0.16B, V1.16B // Same, 128-bit vector REV32 V0.8B, V1.8B // Reverse byte order within each 32-bit word REV32 V0.16B, V1.16B REV32 V0.4H, V1.4H // Reverse halfword order within each 32-bit word REV32 V0.8H, V1.8H REV64 V0.8B, V1.8B // Reverse byte order within each 64-bit doubleword REV64 V0.16B, V1.16B REV64 V0.4H, V1.4H // Reverse halfword order within each 64-bit doubleword REV64 V0.8H, V1.8H REV64 V0.2S, V1.2S // Reverse word order within each 64-bit doubleword REV64 V0.4S, V1.4S RBIT V0.8B, V1.8B // Reverse bit order within each byte (64-bit vector) RBIT V0.16B, V1.16B // Same, 128-bit vector ``` ```asm // Bitwise logical vector ops — arrangement is always .8B (64-bit vector) or .16B (128-bit); // the element-size doesn't matter because the operation is bit-by-bit. ARM ARM treats // `AND V0.16B, ...` as the primary form; `AND V0.8B, ...` is the same instruction with the Q bit = 0. AND V0.8B, V1.8B, V2.8B // Bitwise AND (64-bit vector; upper 64 bits of Vd zeroed) AND V0.16B, V1.16B, V2.16B // Bitwise AND (128-bit vector) ORR V0.8B, V1.8B, V2.8B // Bitwise OR (all logicals accept .8B and .16B only) ORR V0.16B, V1.16B, V2.16B EOR V0.8B, V1.8B, V2.8B // Bitwise XOR EOR V0.16B, V1.16B, V2.16B NOT V0.8B, V1.8B // Bitwise NOT (invert all bits). Mnemonic alias: MVN. NOT V0.16B, V1.16B // ARM ARM preferred name is NOT; MVN is accepted by assemblers. BIC V0.8B, V1.8B, V2.8B // Bitwise AND NOT: V0 = V1 & ~V2 (clear bits) BIC V0.16B, V1.16B, V2.16B ORN V0.8B, V1.8B, V2.8B // Bitwise OR NOT: V0 = V1 | ~V2 ORN V0.16B, V1.16B, V2.16B // Element-wise arithmetic negation/absolute value (signed integer lanes): NEG V0.4S, V1.4S // Per-lane two's-complement negate (signed) ABS V0.4S, V1.4S // Per-lane absolute value (signed; note: ABS(INT_MIN) = INT_MIN) // Bitwise select family — the SIMD equivalent of CSEL: // These use a mask to select bits from two sources. Combined with CMEQ/CMGT // (which produce all-ones/all-zeros masks), they give branchless per-lane selection. BSL V0.8B, V1.8B, V2.8B // Bitwise select: where V0 has 1, take from V1; where 0, take from V2 (64-bit) BSL V0.16B, V1.16B, V2.16B // Same, 128-bit BIT V0.8B, V1.8B, V2.8B // Bitwise insert if true: where V2 has 1, take from V1 into V0 BIT V0.16B, V1.16B, V2.16B // V0 = (V1 & V2) | (V0_original & ~V2) BIF V0.8B, V1.8B, V2.8B // Bitwise insert if false: where V2 has 0, take from V1 into V0 BIF V0.16B, V1.16B, V2.16B // V0 = (V0_original & V2) | (V1 & ~V2) ``` **Why compare results are all-ones / all-zeros** (not 1/0): The result is a bitmask meant to be used directly with bitwise select. `CMEQ` + `BSL` gives you a branchless per-lane conditional select — the all-ones mask selects from V1, all-zeros selects from V2. This is the SIMD equivalent of CSEL. ### 23.13 Saturating Arithmetic NEON has saturating versions of most arithmetic — when a result overflows, it clamps to the maximum (or minimum) representable value instead of wrapping. Scalar AArch64 does NOT have this (you must build it from CMP+CSEL — see **§20.4**), which is why NEON saturating ops are so valuable. ```asm // Signed and unsigned saturating add/sub. **ISA-level truth** — arrangement set is all 7 integer // arrangements (including .2D), plus scalar forms for all 4 element sizes. This is one of the // few NEON integer ops that supports .2D. SQADD V0.8B, V1.8B, V2.8B // Signed saturating add SQADD V0.16B, V1.16B, V2.16B SQADD V0.4H, V1.4H, V2.4H SQADD V0.8H, V1.8H, V2.8H SQADD V0.2S, V1.2S, V2.2S SQADD V0.4S, V1.4S, V2.4S SQADD V0.2D, V1.2D, V2.2D SQADD Bd, Bn, Bm // Scalar byte SQADD Hd, Hn, Hm // Scalar halfword SQADD Sd, Sn, Sm // Scalar word SQADD Dd, Dn, Dm // Scalar doubleword UQADD V0.8B, V1.8B, V2.8B // Unsigned saturating add — same arrangements as SQADD UQADD V0.16B, V1.16B, V2.16B UQADD V0.4H, V1.4H, V2.4H UQADD V0.8H, V1.8H, V2.8H UQADD V0.2S, V1.2S, V2.2S UQADD V0.4S, V1.4S, V2.4S UQADD V0.2D, V1.2D, V2.2D UQADD Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm // Scalar forms (4 element sizes) SQSUB V0.8B, V1.8B, V2.8B // Signed saturating subtract — same arrangement set SQSUB V0.16B, V1.16B, V2.16B SQSUB V0.4H, V1.4H, V2.4H SQSUB V0.8H, V1.8H, V2.8H SQSUB V0.2S, V1.2S, V2.2S SQSUB V0.4S, V1.4S, V2.4S SQSUB V0.2D, V1.2D, V2.2D SQSUB Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm // Scalar forms UQSUB V0.8B, V1.8B, V2.8B // Unsigned saturating subtract — same arrangement set UQSUB V0.16B, V1.16B, V2.16B UQSUB V0.4H, V1.4H, V2.4H UQSUB V0.8H, V1.8H, V2.8H UQSUB V0.2S, V1.2S, V2.2S UQSUB V0.4S, V1.4S, V2.4S UQSUB V0.2D, V1.2D, V2.2D UQSUB Bd, Bn, Bm | Hd, Hn, Hm | Sd, Sn, Sm | Dd, Dn, Dm // Scalar forms ``` `SQ` prefix = signed saturating, `UQ` prefix = unsigned saturating. These work with all element sizes (.8B, .4H, .2S, etc.). There are also saturating versions of shifts (`SQSHL`, `UQSHL`), narrowing operations (`SQXTN` — saturating narrow: each element is clamped to the target range before truncating), and accumulates (`SQRDMULH` — saturating rounding doubling multiply returning high half, used heavily in fixed-point DSP). **Q15/Q31 fixed-point — SQRDMULH, SQRDMLAH, SQRDMLSH**: these are the workhorse instructions of Q-format DSP (Q15 = 16-bit samples, Q31 = 32-bit samples; format is "sign bit + N fractional bits, interpreted as value ∈ [-1, 1)"). `SQRDMULH` implements `(a × b) × 2 >> N` with saturation and round-to-nearest, which is the Q-format multiply. `SQRDMLAH`/`SQRDMLSH` (FEAT_RDM, **mandatory from ARMv8.1-A**) extend this to multiply-accumulate and multiply-subtract. ```asm // SQDMULH — Signed saturating Doubling Multiply returning High half (baseline NEON, // WITHOUT rounding). Pre-dates SQRDMULH; truncates instead of rounding. // Use SQRDMULH instead of SQDMULH whenever rounding is acceptable (usually everywhere) — // it's the same cost on modern cores and has better numerical properties for DSP. // **ISA-level truth** — arrangements are halfword and word only (.4H/.8H/.2S/.4S). // No byte or doubleword forms exist — "doubling" at byte width would saturate on almost every // input, and the encoding space isn't allocated for 64-bit-lane SQDMULH. SQDMULH V0.4H, V1.4H, V2.4H // Truncating doubling multiply, halfword lanes SQDMULH V0.8H, V1.8H, V2.8H SQDMULH V0.2S, V1.2S, V2.2S // Word lanes SQDMULH V0.4S, V1.4S, V2.4S SQDMULH V0.4H, V1.4H, V2.H[i] // By-element vector; Vm ∈ V0..V15 for .H (i ∈ 0..7) SQDMULH V0.8H, V1.8H, V2.H[i] // Vm ∈ V0..V15 for .H SQDMULH V0.2S, V1.2S, V2.S[i] // Vm ∈ V0..V31 for .S (i ∈ 0..3) SQDMULH V0.4S, V1.4S, V2.S[i] SQDMULH Hd, Hn, Hm // Scalar halfword SQDMULH Sd, Sn, Sm // Scalar word SQDMULH Hd, Hn, Vm.H[i] // Scalar by-element; Vm ∈ V0..V15 for .H SQDMULH Sd, Sn, Vm.S[i] // Scalar by-element; Vm ∈ V0..V31 for .S // Saturating Rounding Doubling Multiply Returning High Half — baseline NEON (ARMv8.0-A). // Result is ((a × b) << 1) >> W, saturated to the signed half-size range. Used for // Q15 × Q15 → Q15 and Q31 × Q31 → Q31 multiplies where the doubling compensates for // the implicit scaling from the fixed-point format. Same arrangement set as SQDMULH. SQRDMULH V0.4H, V1.4H, V2.4H // Q15 × Q15 → Q15 (per-lane) SQRDMULH V0.8H, V1.8H, V2.8H SQRDMULH V0.2S, V1.2S, V2.2S // Q31 × Q31 → Q31 SQRDMULH V0.4S, V1.4S, V2.4S SQRDMULH V0.4H, V1.4H, V2.H[i] // By-element vector: Vm ∈ V0..V15 for .H SQRDMULH V0.8H, V1.8H, V2.H[i] SQRDMULH V0.2S, V1.2S, V2.S[i] // Vm ∈ V0..V31 for .S SQRDMULH V0.4S, V1.4S, V2.S[i] SQRDMULH Hd, Hn, Hm // Scalar halfword SQRDMULH Sd, Sn, Sm // Scalar word SQRDMULH Hd, Hn, Vm.H[i] // Scalar by-element; Vm ∈ V0..V15 for .H SQRDMULH Sd, Sn, Vm.S[i] // Scalar by-element; Vm ∈ V0..V31 for .S // Saturating Rounding Doubling Multiply-Accumulate returning High half (FEAT_RDM). // Computes: Vd = Vd + ((Vn × Vm) << 1) >> W, with saturation. // The full Q-format MAC — one instruction instead of SQRDMULH + SQADD. SQRDMLAH V0.4H, V1.4H, V2.4H // Vd += Q-multiply(Vn, Vm), saturated SQRDMLAH V0.8H, V1.8H, V2.8H SQRDMLAH V0.2S, V1.2S, V2.2S SQRDMLAH V0.4S, V1.4S, V2.4S SQRDMLAH V0.4H, V1.4H, V2.H[i] // By-element MAC; Vm ∈ V0..V15 for .H SQRDMLAH V0.8H, V1.8H, V2.H[i] SQRDMLAH V0.2S, V1.2S, V2.S[i] // Vm ∈ V0..V31 for .S SQRDMLAH V0.4S, V1.4S, V2.S[i] SQRDMLAH Hd, Hn, Hm // Scalar halfword SQRDMLAH Sd, Sn, Sm // Scalar word SQRDMLAH Hd, Hn, Vm.H[i] // Scalar by-element SQRDMLAH Sd, Sn, Vm.S[i] // Saturating Rounding Doubling Multiply-Subtract returning High half (FEAT_RDM). SQRDMLSH V0.4H, V1.4H, V2.4H // Vd -= Q-multiply(Vn, Vm), saturated SQRDMLSH V0.8H, V1.8H, V2.8H SQRDMLSH V0.2S, V1.2S, V2.2S SQRDMLSH V0.4S, V1.4S, V2.4S SQRDMLSH V0.4H, V1.4H, V2.H[i] SQRDMLSH V0.8H, V1.8H, V2.H[i] SQRDMLSH V0.2S, V1.2S, V2.S[i] SQRDMLSH V0.4S, V1.4S, V2.S[i] SQRDMLSH Hd, Hn, Hm // Scalar SQRDMLSH Sd, Sn, Sm SQRDMLSH Hd, Hn, Vm.H[i] // Scalar by-element SQRDMLSH Sd, Sn, Vm.S[i] ``` **Why FEAT_RDM matters**: before RDM, a Q-format FIR filter tap cost two instructions (`SQRDMULH` + `SQADD`). `SQRDMLAH` collapses it to one, which roughly doubles throughput on audio and radio DSP inner loops. GCC/LLVM auto-vectorize recognizable Q15/Q31 MAC patterns to `SQRDMLAH` when the target includes `+rdm` (implied by `armv8.1-a` or later). **Halving add/sub — SHADD/UHADD/SRHADD/URHADD/SHSUB/UHSUB** (baseline NEON): compute `(a ± b) / 2` *without intermediate overflow*, per lane. Semantically the same as widening to (N+1) bits, adding/subtracting, shifting right by 1, narrowing back. Used as the building block for fast image/audio averaging and for lossless accumulation of unsigned samples in a signed accumulator. The rounded variants (`SRHADD`/`URHADD`) add 1 before the right-shift so the result rounds to nearest rather than truncating toward −∞. ```asm SHADD V0.16B, V1.16B, V2.16B // Signed halving add: each lane = (V1 + V2) >> 1, no overflow UHADD V0.16B, V1.16B, V2.16B // Unsigned halving add: each lane = (V1 + V2) >> 1, no overflow SRHADD V0.16B, V1.16B, V2.16B // Signed rounding halving add: (V1 + V2 + 1) >> 1 URHADD V0.16B, V1.16B, V2.16B // Unsigned rounding halving add: (V1 + V2 + 1) >> 1 SHSUB V0.16B, V1.16B, V2.16B // Signed halving subtract: (V1 - V2) >> 1 UHSUB V0.16B, V1.16B, V2.16B // Unsigned halving subtract: (V1 - V2) >> 1 // All six accept arrangements .8B/.16B, .4H/.8H, .2S/.4S — NO .1D/.2D (unlike SQADD/UQADD, // which DO support the doubleword arrangements; halving ops are limited to byte/halfword/word). // No scalar (Bd/Hd/Sd/Dd) forms — halving is vector-only. // No rounding halving subtract exists: SRHSUB and URHSUB are NOT in the ISA. ``` **Cross-signed/unsigned saturating accumulate — SUQADD / USQADD** (baseline NEON): accumulate one type into the other with the *destination's* saturation rules. Useful when combining a signed delta with an unsigned base (or vice versa) without first converting. ```asm // SUQADD: Vd (signed) += Vn (unsigned), saturates to SIGNED range. // USQADD: Vd (unsigned) += Vn (signed), saturates to UNSIGNED range. // Two-operand destructive — Vd is both source accumulator and destination. // **ISA-level truth** — all 7 integer arrangements + scalar for all 4 element sizes. SUQADD V0.8B, V1.8B // Signed-destination cross-sign saturating add SUQADD V0.16B, V1.16B SUQADD V0.4H, V1.4H SUQADD V0.8H, V1.8H SUQADD V0.2S, V1.2S SUQADD V0.4S, V1.4S SUQADD V0.2D, V1.2D SUQADD Bd, Bn | Hd, Hn | Sd, Sn | Dd, Dn // Scalar (4 element sizes) USQADD V0.8B, V1.8B // Unsigned-destination cross-sign saturating add USQADD V0.16B, V1.16B USQADD V0.4H, V1.4H USQADD V0.8H, V1.8H USQADD V0.2S, V1.2S USQADD V0.4S, V1.4S USQADD V0.2D, V1.2D USQADD Bd, Bn | Hd, Hn | Sd, Sn | Dd, Dn // Scalar (4 element sizes) ``` **Why saturating arithmetic**: Audio/image processing needs it constantly. If you add two pixel values (0-255) and the result is 300, you want 255 (clamp), not 44 (wrap). Without saturation, every pixel operation would need a clamp sequence. `UQADD` does it in one instruction for 16 bytes at once. ### 23.14 NEON Load/Store ```asm LD1 {V0.4S}, [Xn|SP] // Load 1 register (16 bytes) LD1 {V0.4S, V1.4S}, [Xn|SP] // Load 2 registers (32 bytes, consecutive in memory) LD1 {V0.4S, V1.4S, V2.4S}, [Xn|SP] // Load 3 LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [Xn|SP] // Load 4 // Structure loads (automatic deinterleaving): LD2 {V0.4S, V1.4S}, [Xn|SP] // Load 8 words, deinterleave: V0={w0,w2,w4,w6}, V1={w1,w3,w5,w7} LD3 {V0.4S, V1.4S, V2.4S}, [Xn|SP] // Deinterleave 3 streams (e.g., RGB pixels) LD4 {V0.4S, V1.4S, V2.4S, V3.4S}, [Xn|SP] // Deinterleave 4 streams (e.g., RGBA pixels) // Structure stores (mirror of LD2/LD3/LD4 — interleave on store): ST1 {V0.4S}, [Xn|SP] // Store 1 register ST1 {V0.4S, V1.4S}, [Xn|SP] // Store 2 registers (consecutive) ST2 {V0.4S, V1.4S}, [Xn|SP] // Store 8 words, interleaved ST3 {V0.4S, V1.4S, V2.4S}, [Xn|SP] ST4 {V0.4S, V1.4S, V2.4S, V3.4S}, [Xn|SP] // REPLICATE loads — load ONE element of size <T>.element_size and broadcast it to every lane of each listed register. // **ISA-level truth** — the full arrangement set is {.8B, .16B, .4H, .8H, .2S, .4S, .1D, .2D} for ALL four (LD1R..LD4R). // The .1D form is unusual — "1 doubleword lane" means the register holds a single 64-bit value (only D-half of V used); // it assembles successfully but degenerates to a plain 64-bit load, so compilers rarely emit it. LD1R {V0.8B}, [Xn|SP] // Load one byte, broadcast to 8 byte lanes (64-bit V-register view) LD1R {V0.16B}, [Xn|SP] // Load one byte, broadcast to 16 byte lanes (full 128-bit) LD1R {V0.4H}, [Xn|SP] // Load one halfword, broadcast to 4 halfword lanes LD1R {V0.8H}, [Xn|SP] // Load one halfword, broadcast to 8 halfword lanes LD1R {V0.2S}, [Xn|SP] // Load one 32-bit word, broadcast to 2 word lanes LD1R {V0.4S}, [Xn|SP] // Load one 32-bit word, broadcast to 4 word lanes LD1R {V0.1D}, [Xn|SP] // Load one doubleword into 1-lane view (degenerate — same bits as LDR Dt) LD1R {V0.2D}, [Xn|SP] // Load one doubleword, broadcast to 2 doubleword lanes // LD2R/LD3R/LD4R accept the same eight arrangement specifiers — examples with .4S shown: LD2R {V0.4S, V1.4S}, [Xn|SP] // Load a pair of 32-bit words (consecutive in memory) and broadcast each LD3R {V0.4S, V1.4S, V2.4S}, [Xn|SP] // Three-way replicate LD4R {V0.4S, V1.4S, V2.4S, V3.4S}, [Xn|SP] // Four-way replicate // Post-index forms (update Xn after the load by access-size or register-amount): LD1R {V0.4S}, [Xn|SP], #4 // Xn += 4 (= element_size); immediate post-index is fixed to element size LD1R {V0.4S}, [Xn|SP], Xm // Xn += Xm (register post-index; any X register) LD2R {V0.4S, V1.4S}, [Xn|SP], #8 // Xn += 2 × element_size = 8 LD3R {V0.4S, V1.4S, V2.4S}, [Xn|SP], #12 // Xn += 3 × element_size = 12 LD4R {V0.4S, V1.4S, V2.4S, V3.4S}, [Xn|SP], #16 // Xn += 4 × element_size = 16 // LDxR is the cheap way to turn a scalar-in-memory into an all-lanes-broadcast operand // for a vector op — compilers emit it when multiplying a vector by a loaded scalar. // Single-lane LD/ST — load or store ONE lane of a vector register; other lanes unchanged. // Base register accepts SP (this is a load/store, base operand works like other load/stores). // Element size in the operand (.B/.H/.S/.D) picks the lane width; [i] picks which lane. // Lane index range depends on element size: .B → 0..15, .H → 0..7, .S → 0..3, .D → 0..1. LD1 {V0.B}[i], [Xn|SP] // Load one byte into lane i of V0 (i ∈ 0..15) LD1 {V0.H}[i], [Xn|SP] // Load one halfword into lane i (i ∈ 0..7) LD1 {V0.S}[i], [Xn|SP] // Load one 32-bit word into lane i (i ∈ 0..3) LD1 {V0.D}[i], [Xn|SP] // Load one 64-bit doubleword into lane i (i ∈ 0..1) ST1 {V0.B}[i], [Xn|SP] // Store one byte from lane i of V0 ST1 {V0.H}[i], [Xn|SP] // Store one halfword ST1 {V0.S}[i], [Xn|SP] // Store one 32-bit word ST1 {V0.D}[i], [Xn|SP] // Store one 64-bit doubleword // LD2/LD3/LD4 single-lane: load N consecutive memory elements into lane i of N consecutive // V registers. Deinterleaves at lane granularity. Same .B/.H/.S/.D variants exist for all. // Example shown for .S; same pattern applies to .B/.H/.D with corresponding lane-index ranges. LD2 {V0.S, V1.S}[i], [Xn|SP] // 8 bytes read (2 × 4 bytes), split into lane i of V0,V1 LD3 {V0.S, V1.S, V2.S}[i], [Xn|SP] // 12 bytes, 3-way split LD4 {V0.S, V1.S, V2.S, V3.S}[i], [Xn|SP] // 16 bytes, 4-way split ST2 {V0.S, V1.S}[i], [Xn|SP] // Mirror of LD2 single-lane ST3 {V0.S, V1.S, V2.S}[i], [Xn|SP] ST4 {V0.S, V1.S, V2.S, V3.S}[i], [Xn|SP] // Register list must be CONSECUTIVE (V31→V0 wrap allowed). Register names must match .T. // Post-index (advance pointer after load): LD1 {V0.4S}, [Xn|SP], #16 // Load, then Xn += 16 (16 = sizeof(V0.4S)) LD1 {V0.4S}, [Xn|SP], Xm // Load, then Xn += Xm // Xm must be one of X0..X30 — NOT XZR and NOT SP. // Register number 31 is reserved as the encoding sentinel // that selects the #imm post-index form; if you write // `LD1 ..., [Xn], XZR` the assembler will either reject it // or silently encode it as the #imm form. Either way it's // not a register post-increment by zero. // All of LD1/LD2/LD3/LD4/LDxR/STx have post-index variants with #imm or Xm. // The #imm is fixed to the transfer size (can't be arbitrary); Xm is a free register increment. ``` **Why LD2/LD3/LD4 exist**: Real-world data is often interleaved — RGB pixels are stored as R,G,B,R,G,B,... in memory. Without LD3, you'd load all the data, then spend many instructions shuffling R values into one register, G into another, B into another. LD3 does this deinterleaving in hardware during the load, which is dramatically faster. ST2/ST3/ST4 do the reverse (interleave on store). ### 23.15 Practical NEON Examples **Sum an array of 32-bit integers:** ```asm // X0 = array pointer, X1 = count (multiple of 4 for simplicity) MOVI V0.4S, #0 // Accumulator = {0, 0, 0, 0} loop: LD1 {V1.4S}, [X0], #16 // Load 4 ints, advance pointer ADD V0.4S, V0.4S, V1.4S // Add to accumulator (4 adds in parallel) SUBS X1, X1, #4 // count -= 4 B.GT loop // Horizontal reduction: sum the 4 lanes ADDV S0, V0.4S // S0 = V0[0] + V0[1] + V0[2] + V0[3] UMOV W0, V0.S[0] // Move scalar result to GPR ``` **Byte-level: count occurrences of a byte in a buffer:** ```asm // X0 = buffer, X1 = length (multiple of 16), W2 = byte to find // Result in W0 DUP V1.16B, W2 // Broadcast search byte to all 16 lanes MOVI V2.16B, #0 // Accumulator (byte lanes, max 255 iterations before overflow) loop: LD1 {V0.16B}, [X0], #16 // Load 16 bytes, advance pointer CMEQ V3.16B, V0.16B, V1.16B // Compare: 0xFF where match, 0x00 where not // 0xFF = -1 signed. Subtracting -1 from accumulator = adding 1: SUB V2.16B, V2.16B, V3.16B // Accumulator += 1 for each matching byte SUBS X1, X1, #16 B.GT loop // Horizontal sum: add all 16 byte lanes into one scalar UADDLV H0, V2.16B // Widening sum: 16 bytes → one 16-bit result UMOV W0, V0.H[0] // Move to GPR ``` Note: the byte accumulator overflows after 255 matching bytes per lane. For large buffers, periodically drain with UADDLV into a wider accumulator, or use 16-bit lanes from the start. **NEON memcpy (64 bytes per iteration):** ```asm // X0 = dst, X1 = src, X2 = byte count (multiple of 64) loop: LDP Q0, Q1, [X1] // Load 32 bytes LDP Q2, Q3, [X1, #32] // Load next 32 bytes STP Q0, Q1, [X0] // Store 32 bytes STP Q2, Q3, [X0, #32] // Store next 32 bytes ADD X0, X0, #64 ADD X1, X1, #64 SUBS X2, X2, #64 B.GT loop ``` This copies 64 bytes per iteration using LDP/STP with Q (128-bit) registers, which is how optimized `memcpy` implementations work on ARM. --- ## 24. Atomic & Synchronization Instructions In a multi-core system, two CPU cores might try to modify the same memory location at the same time. Atomic instructions guarantee that a read-modify-write sequence happens as one indivisible operation — no other core can see a half-finished update. These are the building blocks for locks, lock-free data structures, and reference counting. ### 24.1 ARMv8.1 Atomics (LSE — Large System Extensions) LSE adds single-instruction atomics that are faster than the older LDXR/STXR loop approach. Each instruction reads the old value, performs an operation (add, OR, swap, etc.), and writes the new value — all atomically. The suffix `A` means acquire ordering, `L` means release ordering, `AL` means both. **Important — XZR/WZR destination on acquire-returning LSE atomics (stronger rule than LDAR)**: For acquire-flavored LSE ops (`LDADDA`, `LDCLRA`, `LDSETA`, `LDEORA`, `LDSMAXA`, `LDSMINA`, `LDUMAXA`, `LDUMINA`, `SWPA`, `CASA`, and their `AL` variants), the architectural rule per ARM ARM is: "If the destination register is not one of WZR or XZR, LDADDA/LDADDAL load from memory with acquire semantics." When the destination **is** WZR/XZR, the acquire semantic is **actually lost** (not just unobservable) — the instruction is equivalent to the corresponding `ST<op>` alias (`STADD`, `STCLR`, `STSET`, etc.) which has **no** ordering. For `LDADDAL`-style (acquire+release), using XZR downgrades it to release-only (equivalent to `STADDL`). This is different from the LDAR case above: for LSE atomics, the hardware genuinely weakens the ordering when the result is discarded. If you need "atomic RMW with a release fence and don't care about the result," `STADDL` etc. are what you want (and the assembler may pick these aliases). If you need acquire or acquire-release ordering, **always use a real destination register** — even a throwaway scratch register — and never use XZR/WZR for those flavors. **Why LSE is faster than LDXR/STXR**: The exclusive loop must retry if another core touches the cache line. Under high contention (many cores competing for the same lock), retries waste cycles. LSE atomics are handled by the cache coherency hardware itself — the cache controller performs the read-modify-write without the retry loop, reducing bus traffic and latency. These require the LSE feature (check `ID_AA64ISAR0_EL1.Atomic`): ```asm // Atomic add: mem[Xn] += Xs, old value returned in Xt LDADD Xs|XZR, Xt|XZR, [Xn|SP] // Load old, add, store new (relaxed — no ordering) LDADDA Xs|XZR, Xt, [Xn|SP] // Acquire semantics (ordered after this load) LDADDL Xs|XZR, Xt|XZR, [Xn|SP] // Release semantics (ordered before this store) LDADDAL Xs|XZR, Xt, [Xn|SP] // Acquire + Release (full barrier for this operation) // LDCLR — Atomic AND-NOT: mem[Xn] &= ~Xs (clear bits), old value in Xt LDCLR Xs|XZR, Xt|XZR, [Xn|SP] // Relaxed LDCLRA Xs|XZR, Xt, [Xn|SP] // + acquire LDCLRL Xs|XZR, Xt|XZR, [Xn|SP] // + release LDCLRAL Xs|XZR, Xt, [Xn|SP] // + acquire+release // LDSET — Atomic OR: mem[Xn] |= Xs (set bits), old value in Xt LDSET Xs|XZR, Xt|XZR, [Xn|SP] // Relaxed LDSETA Xs|XZR, Xt, [Xn|SP] // + acquire LDSETL Xs|XZR, Xt|XZR, [Xn|SP] // + release LDSETAL Xs|XZR, Xt, [Xn|SP] // + acquire+release // LDEOR — Atomic XOR: mem[Xn] ^= Xs (toggle bits), old value in Xt LDEOR Xs|XZR, Xt|XZR, [Xn|SP] // Relaxed LDEORA Xs|XZR, Xt, [Xn|SP] // + acquire LDEORL Xs|XZR, Xt|XZR, [Xn|SP] // + release LDEORAL Xs|XZR, Xt, [Xn|SP] // + acquire+release // All of the above also have 32-bit forms: LDADD Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic add (relaxed) LDADDA Ws|WZR, Wt, [Xn|SP] // + acquire LDADDL Ws|WZR, Wt|WZR, [Xn|SP] // + release LDADDAL Ws|WZR, Wt, [Xn|SP] // + acquire+release LDCLR Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic AND-NOT (relaxed) LDCLRA Ws|WZR, Wt, [Xn|SP] // + acquire LDCLRL Ws|WZR, Wt|WZR, [Xn|SP] // + release LDCLRAL Ws|WZR, Wt, [Xn|SP] // + acquire+release LDSET Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic OR (relaxed) LDSETA Ws|WZR, Wt, [Xn|SP] // + acquire LDSETL Ws|WZR, Wt|WZR, [Xn|SP] // + release LDSETAL Ws|WZR, Wt, [Xn|SP] // + acquire+release LDEOR Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic XOR (relaxed) LDEORA Ws|WZR, Wt, [Xn|SP] // + acquire LDEORL Ws|WZR, Wt|WZR, [Xn|SP] // + release LDEORAL Ws|WZR, Wt, [Xn|SP] // + acquire+release // All return the OLD value in Xt. Xs is the operand, [Xn|SP] is the memory address. // LDSMAX / LDSMIN / LDUMAX / LDUMIN (signed/unsigned max/min atomics) follow the // IDENTICAL pattern to LDADD/LDCLR/LDSET/LDEOR. Result register gets OLD value. // 64-bit forms: LDSMAX Xs|XZR, Xt|XZR, [Xn|SP] // *[Xn] = signed_max(*[Xn], Xs); Xt ← old value (relaxed) LDSMAXA Xs|XZR, Xt, [Xn|SP] // + acquire LDSMAXL Xs|XZR, Xt|XZR, [Xn|SP] // + release LDSMAXAL Xs|XZR, Xt, [Xn|SP] // + acquire+release LDSMIN Xs|XZR, Xt|XZR, [Xn|SP] // signed_min LDSMINA Xs|XZR, Xt, [Xn|SP] LDSMINL Xs|XZR, Xt|XZR, [Xn|SP] LDSMINAL Xs|XZR, Xt, [Xn|SP] LDUMAX Xs|XZR, Xt|XZR, [Xn|SP] // unsigned_max LDUMAXA Xs|XZR, Xt, [Xn|SP] LDUMAXL Xs|XZR, Xt|XZR, [Xn|SP] LDUMAXAL Xs|XZR, Xt, [Xn|SP] LDUMIN Xs|XZR, Xt|XZR, [Xn|SP] // unsigned_min LDUMINA Xs|XZR, Xt, [Xn|SP] LDUMINL Xs|XZR, Xt|XZR, [Xn|SP] LDUMINAL Xs|XZR, Xt, [Xn|SP] // 32-bit (W), byte (B-suffix), halfword (H-suffix) forms follow the same pattern with the // width suffix appended to each mnemonic: LDSMAXB/LDSMAXAB/LDSMAXLB/LDSMAXALB, etc. // The XZR/WZR-destination caveat from §24.2's LSE-atomic note applies to all A/AL variants. ``` **Store-only atomics** — these are ALIASES for the corresponding `LD<op>` with `Rt = XZR/WZR`. At the ISA encoding level there is no separate "store atomic" instruction: `STADD Xs, [Xn]` and `LDADD Xs, XZR, [Xn]` are the exact same machine word. The assembler just prefers the `ST<op>` spelling when the destination is discarded. Per ARM ARM: "when Rt is X31/W31, a load from memory **may not** be performed" — so the implementation is *permitted* to skip the load and save a bus transaction, but is not required to. Any speedup is implementation-defined, not architectural. ``` STADD Xs|XZR, [Xn|SP] // Alias: LDADD Xs|XZR, XZR, [Xn|SP] — atomic add, no return (fire-and-forget) STADDL Xs|XZR, [Xn|SP] // Alias: LDADDL Xs|XZR, XZR, [Xn|SP] — + release ordering STSET Xs|XZR, [Xn|SP] // Alias: LDSET Xs|XZR, XZR, [Xn|SP] — atomic OR STSETL Xs|XZR, [Xn|SP] // Alias: LDSETL Xs|XZR, XZR, [Xn|SP] STCLR Xs|XZR, [Xn|SP] // Alias: LDCLR Xs|XZR, XZR, [Xn|SP] — atomic AND-NOT STCLRL Xs|XZR, [Xn|SP] // Alias: LDCLRL Xs|XZR, XZR, [Xn|SP] STEOR Xs|XZR, [Xn|SP] // Alias: LDEOR Xs|XZR, XZR, [Xn|SP] — atomic XOR STEORL Xs|XZR, [Xn|SP] // Alias: LDEORL Xs|XZR, XZR, [Xn|SP] STSMAX Xs|XZR, [Xn|SP] // Alias: LDSMAX Xs|XZR, XZR, [Xn|SP] STSMAXL Xs|XZR, [Xn|SP] // Alias: LDSMAXL Xs|XZR, XZR, [Xn|SP] STSMIN Xs|XZR, [Xn|SP] // Alias: LDSMIN Xs|XZR, XZR, [Xn|SP] STSMINL Xs|XZR, [Xn|SP] // Alias: LDSMINL Xs|XZR, XZR, [Xn|SP] STUMAX Xs|XZR, [Xn|SP] // Alias: LDUMAX Xs|XZR, XZR, [Xn|SP] STUMAXL Xs|XZR, [Xn|SP] // Alias: LDUMAXL Xs|XZR, XZR, [Xn|SP] STUMIN Xs|XZR, [Xn|SP] // Alias: LDUMIN Xs|XZR, XZR, [Xn|SP] STUMINL Xs|XZR, [Xn|SP] // Alias: LDUMINL Xs|XZR, XZR, [Xn|SP] // ST variants also have 32-bit forms (aliasing 32-bit LDADD etc. with WZR destination): STADD Ws|WZR, [Xn|SP] // Alias: LDADD Ws|WZR, WZR, [Xn|SP] STADDL Ws|WZR, [Xn|SP] // Alias: LDADDL Ws|WZR, WZR, [Xn|SP] STSET Ws|WZR, [Xn|SP] // Alias: LDSET Ws|WZR, WZR, [Xn|SP] STSETL Ws|WZR, [Xn|SP] // Alias: LDSETL Ws|WZR, WZR, [Xn|SP] STCLR Ws|WZR, [Xn|SP] // Alias: LDCLR Ws|WZR, WZR, [Xn|SP] STCLRL Ws|WZR, [Xn|SP] // Alias: LDCLRL Ws|WZR, WZR, [Xn|SP] STEOR Ws|WZR, [Xn|SP] // Alias: LDEOR Ws|WZR, WZR, [Xn|SP] STEORL Ws|WZR, [Xn|SP] // Alias: LDEORL Ws|WZR, WZR, [Xn|SP] // The full underlying LD<op> also has A (acquire) and AL (acquire+release) variants, // but there is NO ST<op>A or ST<op>AL alias — because "discard the loaded value" breaks // the acquire semantic (see the XZR-on-acquire-LSE-atomic caveat in §24.2). If you want // release ordering with no return value, use ST<op>L. If you want acquire, you MUST use // the LD<op>A/LD<op>AL form with a real (non-XZR) destination. // Byte (B) and halfword (H) width variants also exist for all of the above, following // the same ST<op>{L}{B,H} naming and aliasing the corresponding LD<op>{L}{B,H}. // Compare-and-swap: CAS Xs|XZR, Xt|XZR, [Xn|SP] // If [Xn]==Xs, store Xt; Xs = old value either way CASA Xs, Xt|XZR, [Xn|SP] // + acquire CASL Xs|XZR, Xt|XZR, [Xn|SP] // + release CASAL Xs, Xt|XZR, [Xn|SP] // + acquire+release CAS Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit CAS CASA Ws, Wt|WZR, [Xn|SP] // 32-bit + acquire CASL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + release CASAL Ws, Wt|WZR, [Xn|SP] // 32-bit + acquire+release // Swap: SWP Xs|XZR, Xt|XZR, [Xn|SP] // Xt = old [Xn], [Xn] = Xs (unconditional swap) SWPA Xs|XZR, Xt, [Xn|SP] // + acquire SWPL Xs|XZR, Xt|XZR, [Xn|SP] // + release SWPAL Xs|XZR, Xt, [Xn|SP] // + acquire+release SWP Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit swap SWPA Ws|WZR, Wt, [Xn|SP] // 32-bit + acquire SWPL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + release SWPAL Ws|WZR, Wt, [Xn|SP] // 32-bit + acquire+release // Compare-and-swap pair (128-bit atomic): // NOTE: The encoding for CASP requires both Rs and Rt to be EVEN-numbered — per ARM ARM pseudocode: // "if Rs<0> == '1' then UNDEFINED; if Rt<0> == '1' then UNDEFINED" // The pair is (Rs, Rs+1) for expected value and (Rt, Rt+1) for desired value. // Since register 31 is odd, it cannot be used as Rs or Rt. But Rs=30 IS valid (30 is even): // in that case X(s+1) resolves to register 31, which in data-processing context is XZR // (reads as zero, writes discarded) — so the "upper half" of the pair is effectively the zero constant. CASP Xs, X(s+1), Xt, X(t+1), [Xn|SP] // 128-bit CAS (Xs:X(s+1) = expected, Xt:X(t+1) = desired) CASPA Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + acquire CASPL Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + release CASPAL Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + acquire+release // Valid Xs,Xt ∈ {X0, X2, X4, ..., X28, X30}. When Xs=X30, X(s+1)=XZR. // CASP 32-bit pair forms (64-bit atomic CAS using two 32-bit registers): CASP Ws, W(s+1), Wt, W(t+1), [Xn|SP] // 64-bit CAS (Ws:W(s+1) = expected, Wt:W(t+1) = desired) CASPA Ws, W(s+1), Wt, W(t+1), [Xn|SP] // + acquire CASPL Ws, W(s+1), Wt, W(t+1), [Xn|SP] // + release CASPAL Ws, W(s+1), Wt, W(t+1), [Xn|SP] // + acquire+release // Ws,Wt ∈ {W0,W2,W4,...,W28,W30}. Same even-register constraint as the Xs forms. ``` **Alignment**: All LSE atomics require natural alignment for the access size. That means: 1-byte (any address) for the `B`-suffix variants (LDADDB, STADDB, SWPB, CASB, etc.), 2-byte alignment for the `H`-suffix variants (LDADDH, CASH, etc.), 4-byte for W operations (on Ws/Wt registers), 8-byte for X operations (on Xs/Xt registers), 8-byte for CASP Ws:W(s+1) pairs (64-bit total), and 16-byte for CASP Xs:X(s+1) pairs (128-bit total). Misalignment for any of these raises an Alignment fault (synchronous data abort) — there is no "unaligned atomic" mode for LSE. **Byte and halfword atomics**: Every LSE instruction also has B (byte) and H (halfword) forms: ```asm // Byte atomics (B suffix): operate on 8-bit values LDADDB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte add (relaxed) LDADDAB Ws|WZR, Wt, [Xn|SP] // + acquire LDADDLB Ws|WZR, Wt|WZR, [Xn|SP] // + release LDADDALB Ws|WZR, Wt, [Xn|SP] // + acquire+release LDCLRB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte AND-NOT (relaxed) LDCLRAB Ws|WZR, Wt, [Xn|SP] // + acquire LDCLRLB Ws|WZR, Wt|WZR, [Xn|SP] // + release LDCLRALB Ws|WZR, Wt, [Xn|SP] // + acquire+release LDSETB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte OR (relaxed) LDSETAB Ws|WZR, Wt, [Xn|SP] // + acquire LDSETLB Ws|WZR, Wt|WZR, [Xn|SP] // + release LDSETALB Ws|WZR, Wt, [Xn|SP] // + acquire+release LDEORB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte XOR (relaxed) LDEORAB Ws|WZR, Wt, [Xn|SP] // + acquire LDEORLB Ws|WZR, Wt|WZR, [Xn|SP] // + release LDEORALB Ws|WZR, Wt, [Xn|SP] // + acquire+release SWPB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte swap (relaxed) SWPAB Ws|WZR, Wt, [Xn|SP] // + acquire SWPLB Ws|WZR, Wt|WZR, [Xn|SP] // + release SWPALB Ws|WZR, Wt, [Xn|SP] // + acquire+release CASB Ws|WZR, Wt|WZR, [Xn|SP] // Byte compare-and-swap (relaxed) CASAB Ws, Wt|WZR, [Xn|SP] // + acquire CASLB Ws|WZR, Wt|WZR, [Xn|SP] // + release CASALB Ws, Wt|WZR, [Xn|SP] // + acquire+release STADDB Ws|WZR, [Xn|SP] // Byte add, no return (relaxed) STADDLB Ws|WZR, [Xn|SP] // + release STSETB Ws|WZR, [Xn|SP] // Byte OR, no return (relaxed) STSETLB Ws|WZR, [Xn|SP] // + release STCLRB Ws|WZR, [Xn|SP] // Byte AND-NOT, no return (relaxed) STCLRLB Ws|WZR, [Xn|SP] // + release STEORB Ws|WZR, [Xn|SP] // Byte XOR, no return (relaxed) STEORLB Ws|WZR, [Xn|SP] // + release ``` **Halfword atomics** (H suffix — operate on 16-bit values): ```asm LDADDH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword add (relaxed) LDADDAH Ws|WZR, Wt, [Xn|SP] // + acquire LDADDLH Ws|WZR, Wt|WZR, [Xn|SP] // + release LDADDALH Ws|WZR, Wt, [Xn|SP] // + acquire+release LDCLRH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword AND-NOT (relaxed) LDCLRAH Ws|WZR, Wt, [Xn|SP] // + acquire LDCLRLH Ws|WZR, Wt|WZR, [Xn|SP] // + release LDCLRALH Ws|WZR, Wt, [Xn|SP] // + acquire+release LDSETH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword OR (relaxed) LDSETAH Ws|WZR, Wt, [Xn|SP] // + acquire LDSETLH Ws|WZR, Wt|WZR, [Xn|SP] // + release LDSETALH Ws|WZR, Wt, [Xn|SP] // + acquire+release LDEORH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword XOR (relaxed) LDEORAH Ws|WZR, Wt, [Xn|SP] // + acquire LDEORLH Ws|WZR, Wt|WZR, [Xn|SP] // + release LDEORALH Ws|WZR, Wt, [Xn|SP] // + acquire+release SWPH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword swap (relaxed) SWPAH Ws|WZR, Wt, [Xn|SP] // + acquire SWPLH Ws|WZR, Wt|WZR, [Xn|SP] // + release SWPALH Ws|WZR, Wt, [Xn|SP] // + acquire+release CASH Ws|WZR, Wt|WZR, [Xn|SP] // Halfword compare-and-swap (relaxed) CASAH Ws, Wt|WZR, [Xn|SP] // + acquire CASLH Ws|WZR, Wt|WZR, [Xn|SP] // + release CASALH Ws, Wt|WZR, [Xn|SP] // + acquire+release STADDH Ws|WZR, [Xn|SP] // Halfword add, no return (relaxed) STADDLH Ws|WZR, [Xn|SP] // + release STSETH Ws|WZR, [Xn|SP] // Halfword OR, no return (relaxed) STSETLH Ws|WZR, [Xn|SP] // + release STCLRH Ws|WZR, [Xn|SP] // Halfword AND-NOT, no return (relaxed) STCLRLH Ws|WZR, [Xn|SP] // + release STEORH Ws|WZR, [Xn|SP] // Halfword XOR, no return (relaxed) STEORLH Ws|WZR, [Xn|SP] // + release ``` These are essential for lock bytes, flag bytes, and any sub-word atomic field. **128-bit atomic RMW (FEAT_LSE128 — optional from Armv9.3-A):** two general-purpose X registers carry the 128-bit operand as a low-half/high-half pair, with four memory-ordering flavors each. **ISA-level truth — register constraint is NOT CASP-style.** LSE128's encoding has two independent 5-bit register fields (Rt and Rt2), not CASP's "even + even+1" single-field-plus-implicit. The only encoding constraints per ARM ARM pseudocode are: (1) Rt ≠ 31, (2) Rt2 ≠ 31, (3) Rt ≠ Rt2. Any other combination works — `SWPP X5, X12, [X0]` is perfectly legal. Attempting to encode XZR is UNDEFINED at decode time — this is stricter than LSE1 (where XZR in the destination works but silently weakens the acquire semantic). ```asm SWPP Xt1, Xt2, [Xn|SP] // 128-bit atomic swap: (Xt1:Xt2) ↔ *[Xn] (old memory → Xt1:Xt2) SWPPA Xt1, Xt2, [Xn|SP] // + acquire SWPPL Xt1, Xt2, [Xn|SP] // + release SWPPAL Xt1, Xt2, [Xn|SP] // + acquire + release LDCLRP Xt1, Xt2, [Xn|SP] // 128-bit atomic AND-NOT: *[Xn] = *[Xn] AND NOT(Xt1:Xt2); old → Xt1:Xt2 LDCLRPA Xt1, Xt2, [Xn|SP] // + A / L / AL as above LDCLRPL Xt1, Xt2, [Xn|SP] LDCLRPAL Xt1, Xt2, [Xn|SP] LDSETP Xt1, Xt2, [Xn|SP] // 128-bit atomic OR: *[Xn] |= (Xt1:Xt2); old → Xt1:Xt2 LDSETPA Xt1, Xt2, [Xn|SP] LDSETPL Xt1, Xt2, [Xn|SP] LDSETPAL Xt1, Xt2, [Xn|SP] // Pair bit-ordering: Xt1 holds the LOW 64 bits of the 128-bit value; Xt2 holds the HIGH 64 bits. // Xt1/Xt2 must be distinct real registers (X0..X30, any two — NO even-pair constraint, NO XZR). // FEAT_LSE128 gives you single-copy-atomic 128-bit RMW without the LDXP/STXP retry loop; useful // for 128-bit sequence counters, RCU-style atomic pointer+counter updates, and lock-free queues. // Detection: ID_AA64ISAR0_EL1.Atomic == 0b0011 (LSE2=0b0010, LSE128=0b0011). // Atomicity requires 16-byte-aligned [Xn]; misalignment produces an Alignment fault. ``` ### 24.2 Load-Acquire / Store-Release In a multi-core system, memory operations can appear to happen in a different order than you wrote them (due to CPU reordering for performance). `LDAR` (Load-Acquire) and `STLR` (Store-Release) enforce ordering. **Why CPUs reorder memory**: Modern CPUs have store buffers, write-combining buffers, and out-of-order pipelines. A store might sit in a buffer while later loads execute. This is invisible to single-threaded code but breaks multi-threaded algorithms that depend on the order of writes being visible to other cores. Acquire/release ordering is how you tell the CPU "this ordering matters." ```asm LDAR Xt, [Xn|SP] // Load-acquire 64-bit (Xt shown as a real register — see acquire/ZR note below, D24800) LDAR Wt, [Xn|SP] // Load-acquire 32-bit STLR Xt|XZR, [Xn|SP] // Store-release 64-bit (XZR source = store 0 with release; legitimate) STLR Wt|WZR, [Xn|SP] // Store-release 32-bit // Byte and halfword variants (for lock bytes, flag bytes, etc.): LDARB Wt, [Xn|SP] // Load-acquire byte LDARH Wt, [Xn|SP] // Load-acquire halfword STLRB Wt|WZR, [Xn|SP] // Store-release byte STLRH Wt|WZR, [Xn|SP] // Store-release halfword LDAXR Xt, [Xn|SP] // Load-acquire exclusive 64-bit (LDAR + LDXR combined) LDAXR Wt, [Xn|SP] // Load-acquire exclusive 32-bit STLXR Ws|WZR, Xt|XZR, [Xn|SP] // Store-release exclusive 64-bit (Ws = result flag; Xt = value) STLXR Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive 32-bit // Byte and halfword exclusive: LDAXRB Wt, [Xn|SP] // Load-acquire exclusive byte LDAXRH Wt, [Xn|SP] // Load-acquire exclusive halfword STLXRB Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive byte STLXRH Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive halfword ``` **Alignment**: `LDAR`/`STLR` require natural alignment, same as LDXR/STXR. **XZR/WZR as destination on acquire loads (corrected per DDI 0487 M.b erratum D24800)**: The syntax lines above show the destination as a real register `Xt`/`Wt` for `LDAR`, `LDARB`, `LDARH`, `LDAXR`, `LDAXRB`, `LDAXRH`, and `LDAXP`. Older ARM ARM text said that with an XZR/WZR destination "it is impossible for software to observe the presence of the acquire semantic," and the decode pseudocode gated acquire on `t != 31` — i.e. XZR silently *dropped* the acquire ordering. **D24800 reverses this for plain load-acquire instructions**: `LDAR`/`LDARB`/`LDARH`, `LDAXR`/`LDAXRB`/`LDAXRH`, `LDAXP`, the RCpc `LDAPR`/`LDAPRB`/`LDAPRH`/`LDAPUR…` family, `LDLAR…`, and `LDIAPP` now load **with Acquire (or AcquirePC) semantics unconditionally — the ordering holds even when the destination is XZR/WZR**. The decode becomes `acquire = TRUE`, and the per-ZR carve-outs in `ESR_ELx.AR` and `SCTLR_ELx.nAA` are deleted. The "XZR drops the ordering" rule now survives in exactly one place: **atomic** read-modify-write acquire-variants (the LSE `LD<op>A`/`SWPA`/`CAS*A` family and LSE128) with an XZR/WZR destination — this is the corrected "Barrier-ordered-before" definition, which now reads "an *atomic* instruction whose destination register is WZR or XZR" (note the added word *atomic*). See the LSE128 note above. Practically the doc still shows a real destination, because loading into XZR is pointless — there's no value for downstream code to depend on, and a pure fence is better expressed as `DMB ISH` (full) or `DMB ISHLD` (load). The D24800 change is about *architectural correctness* (a ZR-destination acquire load no longer silently loses its ordering), not about XZR-load-acquire becoming useful. Note the published M.b manual still prints the old "if the destination is not WZR or XZR…" wording in the LDAR/LDAPR/LDAXR pages; D24800 is on the Known-Issues list, i.e. a pending correction. These implement the C11/C++11 memory ordering model: - `LDAR` ≈ `memory_order_acquire` load - `STLR` ≈ `memory_order_release` store On ARMv8, `STLR` has a stronger guarantee than plain release: all `STLR` stores are ordered before any subsequent `LDAR` loads, even to unrelated addresses. This specific STLR→LDAR ordering (called RCsc — Release Consistency with sequential consistency for special operations) is what allows compilers to map `memory_order_seq_cst` loads to `LDAR` and seq_cst stores to `STLR`. This works because of ARM's specific hardware guarantee, NOT because acquire + release equals seq_cst in general — in the abstract C++ memory model, they do not. ### 24.3 LDAPR — Load-Acquire RCpc (Weaker Acquire) ARMv8.3-A adds `LDAPR` (Load-Acquire, Processor Consistent), which is a **weaker** acquire than `LDAR`. The difference: `LDAR` is RCsc (sequential consistency for special ops — it orders with respect to all prior `STLR`s). `LDAPR` is RCpc (processor consistency — it only orders with respect to `STLR` to the **same address**). ```asm LDAPR Xt, [Xn|SP] // Load-acquire RCpc 64-bit (weaker than LDAR; see acquire/ZR note in §24.2, D24800) LDAPR Wt, [Xn|SP] // Load-acquire RCpc 32-bit LDAPRB Wt, [Xn|SP] // Byte version (no offset field — address is exactly Xn) LDAPRH Wt, [Xn|SP] // Halfword version (no offset field — address is exactly Xn) ``` **Why LDAPR exists**: `LDAR` is stronger than what C++ `memory_order_acquire` actually requires. C++ acquire only needs ordering with respect to the matching release on the same variable, not all releases everywhere. `LDAPR` gives exactly this weaker guarantee, which is cheaper on hardware. Compilers targeting ARMv8.3+ can map `memory_order_acquire` to `LDAPR` instead of `LDAR`, improving performance. `memory_order_seq_cst` still requires `LDAR`. **FEAT_LRCPC2 (ARMv8.4-A, optional from 8.2, mandatory from 8.4)**: Adds unscaled-offset variants of LDAPR and of the store-release, so compilers can fold a small immediate into an acquire load or release store without an extra `ADD`. The offset is a 9-bit signed immediate (−256 to +255 bytes, unscaled). Per D24800 (§24.2) these load variants retain AcquirePC semantics even with an XZR/WZR destination; a real register is still the sensible choice, since XZR yields no usable value. ```asm // Load-Acquire RCpc with unscaled offset (FEAT_LRCPC2): LDAPUR Xt, [Xn|SP{, #simm9}] // 64-bit load-acquire RCpc, offset −256 to +255 LDAPUR Wt, [Xn|SP{, #simm9}] // 32-bit LDAPURB Wt, [Xn|SP{, #simm9}] // Byte LDAPURH Wt, [Xn|SP{, #simm9}] // Halfword LDAPURSB Xt, [Xn|SP{, #simm9}] // Signed byte → 64-bit (sign-extend) LDAPURSB Wt, [Xn|SP{, #simm9}] // Signed byte → 32-bit LDAPURSH Xt, [Xn|SP{, #simm9}] // Signed halfword → 64-bit LDAPURSH Wt, [Xn|SP{, #simm9}] // Signed halfword → 32-bit LDAPURSW Xt, [Xn|SP{, #simm9}] // Signed word → 64-bit (loads 32 bits, sign-extends) // Store-Release with unscaled offset (FEAT_LRCPC2): STLUR Xt|XZR, [Xn|SP{, #simm9}] // 64-bit store-release (XZR source = store 0 with release) STLUR Wt|WZR, [Xn|SP{, #simm9}] // 32-bit STLURB Wt|WZR, [Xn|SP{, #simm9}] // Byte STLURH Wt|WZR, [Xn|SP{, #simm9}] // Halfword // Load-Acquire RCpc / Store-Release RCpc ORDERED PAIR (FEAT_LRCPC3 — optional from Armv8.2-A, no mandatory version; added to ARM ARM in the 2022 Armv8.9/v9.4 spec release): // Pair-wise analogs of LDAPUR/STLUR: load or store two contiguous 64-bit or 32-bit values atomically // with RCpc acquire or release semantics. The pair is guaranteed to be single-copy atomic relative to // another LDIAPP/STILP on the same address region — i.e., no other core can see a "torn" intermediate. // Useful for SeqLock-style readers and for reading 128-bit shared state without a lock. LDIAPP Xt1, Xt2, [Xn|SP] // Load-acquire RCpc pair, 64-bit (destinations shown bare; per D24800 acquire is retained even with XZR, but a real register is the sensible choice) LDIAPP Wt1, Wt2, [Xn|SP] // 32-bit pair STILP Xt1|XZR, Xt2|XZR, [Xn|SP] // Store-release RCpc pair, 64-bit (source allows XZR) STILP Wt1|WZR, Wt2|WZR, [Xn|SP] // 32-bit pair // Offset: ZERO only (no immediate field in the encoding — base must already point to the pair). // Alignment: 16-byte for 64-bit pair, 8-byte for 32-bit pair — misaligned is CONSTRAINED UNPREDICTABLE. // Xt1 and Xt2 must be distinct (same rule as LDP). // FEAT_LRCPC3 also adds single-lane SIMD&FP acquire/release — load or store ONE vector lane // with RCpc ordering. Completes the set for atomic single-copy FP state sharing. LDAP1 {Vt.D}[i], [Xn|SP] // Load-acquire RCpc: one 64-bit lane of Vt ← [Xn] // i ∈ 0..1. Other lanes of Vt are preserved. STL1 {Vt.D}[i], [Xn|SP] // Store-release RCpc: [Xn] ← one 64-bit lane of Vt // i ∈ 0..1. // These are 64-bit-element ONLY — ARM ARM does not define .B/.H/.S LDAP1/STL1 forms. // Alignment requirement: 8-byte (as SCTLR_ELx.nAA rules for FEAT_LRCPC3 accesses state). ``` ### 24.4 Mutex / Spinlock Patterns **Simple spinlock (using LDXR/STXR with proper acquire/release):** ```asm // Lock: X0 = address of lock word (0 = unlocked, 1 = locked) lock: MOV W3, #1 spin: LDAXR W1, [X0] // Load-acquire exclusive (see latest value + acquire ordering) CBNZ W1, wait // If locked, go to wait loop STXR W2, W3, [X0] // Try to store 1 (lock it) CBNZ W2, spin // If exclusive failed, retry from LDAXR RET // Lock acquired — acquire ordering ensures all reads/writes // in the critical section see data from before the lock wait: // Spin without exclusive — reduces bus traffic (no cache-line bouncing) LDR W1, [X0] // Plain load (no exclusive monitor overhead) CBNZ W1, wait // Still locked? Keep waiting B spin // Unlocked — try to acquire // Unlock: just store 0 with release ordering unlock: STLR WZR, [X0] // Store-release: all critical section writes complete // before the lock appears unlocked to other cores RET ``` **Why LDAXR in the lock, STLR in the unlock**: The lock acquire needs acquire semantics so that everything read inside the critical section sees data published before the previous `STLR` unlock. The unlock needs release semantics so all writes inside the critical section are visible before the lock appears free. This is the classic acquire/release pair for mutual exclusion. **Why the WFE spin loop (from §19.2) is better**: The `wait` loop above burns CPU cycles. The version with `WFE` puts the core in a low-power state until another core sends an event (the unlock path should use `SEV` after `STLR` to wake waiters). **LSE-based lock (faster under contention):** ```asm lock_lse: MOV W1, #1 SWPA W1, W1, [X0] // Atomic swap with acquire: W1 = old value, [X0] = 1 CBNZ W1, lock_lse // If old value was 1 (locked), retry RET // Lock acquired unlock_lse: STLR WZR, [X0] // Store-release zero RET ``` --- ## 25. Memory Barriers & Ordering Memory barriers (also called fences) are instructions that enforce ordering of memory operations. They don't access memory themselves — they constrain the order in which surrounding loads and stores become visible. This matters on multi-core systems where each core has its own cache and memory operations can be reordered. ### 25.1 Barrier Instructions `DMB` (Data Memory Barrier): Ensures that all memory accesses before the barrier are visible before any memory accesses after it. Does NOT wait for them to complete — just orders them. `DSB` (Data Synchronization Barrier): Stronger than DMB — it waits for all preceding memory accesses to actually complete before any instruction after the barrier executes. `ISB` (Instruction Synchronization Barrier): Flushes the CPU pipeline, ensuring all subsequent instructions are fetched fresh. Needed after modifying page tables, writing self-modifying code, or changing system registers that affect instruction execution. ```asm DMB <option> // Data Memory Barrier (option from table below) DSB <option> // Data Synchronization Barrier ISB {SY} // Instruction Synchronization Barrier. Optional SY suffix; SY is the default and the only meaningful option. ``` The `<option>` specifies the shareability domain and access types. You must pick one from this table: | Option | Meaning | |---|---| | `OSHLD` | Outer Shareable, prior **loads** ordered before subsequent loads and stores | | `OSHST` | Outer Shareable, prior **stores** ordered before subsequent stores | | `OSH` | Outer Shareable, any-to-any | | `NSHLD` | Non-shareable, prior loads ordered before subsequent loads and stores | | `NSHST` | Non-shareable, prior stores ordered before subsequent stores | | `NSH` | Non-shareable, any-to-any | | `ISHLD` | Inner Shareable, prior loads ordered before subsequent loads and stores | | `ISHST` | Inner Shareable, prior stores ordered before subsequent stores | | `ISH` | Inner Shareable, any-to-any (most common) | | `LD` | Full system, prior loads ordered before subsequent loads and stores | | `ST` | Full system, prior stores ordered before subsequent stores | | `SY` | Full system, any-to-any (strongest) | **Important — LD/ST do NOT mean "only orders loads/stores":** > **Errata note (DDI 0487 M.b, R24234) — DMB shareability domains are being deprecated.** The option table above matches the published M.b manual and the conventional model. M.b's Known-Issues list reworks the **`DMB`** decode to drop the shareability *domain* entirely — only the `CRm<1:0>` access-type bits remain — and turns the domain-qualified spellings into deprecated aliases. Verbatim: `OSHLD`/`NSHLD`/`ISHLD` *"has the same behavior as LD and is deprecated"*, `OSHST`/`NSHST`/`ISHST` *"same behavior as ST and is deprecated"*, and `OSH`/`NSH`/`ISH` *"same behavior as SY and is deprecated"*. So for **`DMB`** the only architecturally-distinct options going forward are **`LD`, `ST`, `SY`**, and `DMB ISH` carries the same *ordering guarantee* as `DMB SY`. This is about the architectural guarantee, not performance — implementations may still execute `DMB ISH` more cheaply, and existing `DMB ISH` code (e.g. Linux `smp_mb()`) stays correct (it gets at-least-`SY` ordering). **`DSB` is different**: M.b keeps a shareability *maintenance scope* for `DSB` (it still governs ordering of cache/TLB maintenance), so `DSB ISH`/`DSB NSH`/`DSB SY` remain meaningful. R24234 is a pending erratum — the shipping M.b manual still prints the domain-qualified `DMB` options as distinct. - `DMB LD` orders **prior loads** with respect to **both subsequent loads AND subsequent stores** (load→load AND load→store). - `DMB ST` orders **prior stores** with respect to **subsequent stores only** (store→store only; does NOT order stores before subsequent loads). - This asymmetry matters: `DMB ST` is a weaker barrier than `DMB LD` because it provides no load/store ordering for prior stores. ```asm DMB ISH // All loads/stores before this are observed before any after (inner shareable) DMB ISHST // All stores before are observed before stores after DSB SY // Full system sync: nothing crosses this barrier, waits for completion ISB // Flush pipeline, re-fetch instructions (needed after modifying code/page tables) ``` **DMB vs DSB**: DMB only orders memory accesses relative to each other. DSB waits for all preceding memory accesses to actually complete before continuing. DSB is stronger and slower. ISB additionally flushes the pipeline. **Speculation barrier (FEAT_SB — ARMv8.5-A, also retrofitted to some ARMv8.0-A cores):** ```asm SB // Speculation Barrier — no operand. ``` `SB` prevents speculative execution past this point based on any prediction or state derived from instructions preceding it. Unlike `DSB SY` (which waits for memory accesses to complete), `SB` specifically targets the speculation pipeline — it's a cheaper, more surgical fence for mitigating Spectre-v1 style bounds-check-bypass attacks. The canonical pattern is: validate an index, then `SB`, then use the index to load secret data. Without `SB`, the CPU may speculatively execute the load with an unvalidated index and leave cache residue an attacker can probe. `CSDB` (Consumption of Speculative Data Barrier, baseline ARMv8) is a weaker variant that only orders consumption of speculatively-loaded data; `SB` is strictly stronger and is preferred when available. Detection: `ID_AA64ISAR1_EL1.SB`. Linux surfaces as HWCAP `sb`. **Speculative Store Bypass Barriers — SSBB / PSSBB** (Spectre-v4 / CVE-2018-3639 mitigation): these block the specific speculative forwarding that allows a later load to bypass an earlier store to the same address. They are **encoded as** the `DSB` instruction with reserved option values, not standalone mnemonics at the encoding level — assemblers and disassemblers recognize the mnemonics for readability. ```asm SSBB // Speculative Store Bypass Barrier. Encoding: DSB #0 (option=0000). // Blocks store-bypass speculation across this instruction only for the virtual-address // aliases the current EL can observe directly (normally all of them for EL0 userland). PSSBB // Physical Speculative Store Bypass Barrier. Encoding: DSB #4 (option=0100). // Same, but for physical-address aliases — covers the case where two different virtual // addresses map to the same physical location and a hypervisor/OS cares about the // store-bypass forwarding across that alias. ``` Neither takes an operand. Both behave as NOPs on cores that don't need them (the architecture allows a benign implementation since the option values land inside the reserved-encoding space that older `DSB` decoders ignore). See Arm's "Cache Speculation Side-Channels" whitepaper for the threat model — these are legacy mitigations; newer cores use `SSBS` (Speculative Store Bypass Safe) PSTATE bit or SB instead. ### 25.2 The ARM Memory Model ARM uses a **weakly ordered** memory model. This means the CPU is allowed to reorder memory accesses for performance, as long as the reordering is invisible to the current core's own execution. Other cores, however, may see the reordered result. **What reorderings ARM allows** (observable by other cores): - **Load-Load**: A later load can complete before an earlier load. (Rare in practice on most ARMs, but architecturally allowed.) - **Load-Store**: A later store can complete before an earlier load. - **Store-Load**: A later load can complete before an earlier store. (This is the most common and impactful reordering.) - **Store-Store**: A later store can become visible before an earlier store. **What ARM does NOT reorder**: - **Data-dependent loads**: If load B's address depends on the value loaded by load A, then B always sees A's result. This is called "address dependency ordering" and it's guaranteed by ARM hardware. Example: `LDR X1, [X0]; LDR X2, [X1]` — the second load always uses the value from the first, even without barriers. - **Overlapping accesses**: Loads and stores to the same address from the same core always appear in program order to that core. **Why this matters**: On x86, the memory model is much stronger (Total Store Order — stores are never reordered with each other). Code that works on x86 by accident may break on ARM because ARM's weaker model exposes more reorderings. This is why correct multi-threaded code must use acquire/release or barriers. **The one-page summary**: Use `LDAR`/`STLR` for synchronization variables (locks, flags, message passing). Use `DMB ISH` when you need a full fence. Don't use barriers for single-threaded code — they're expensive and unnecessary. When in doubt, use C11 atomics and let the compiler figure it out. **Concrete example — message passing race**: ```asm // Core 1 (producer): // Core 2 (consumer): // W7 = 1 (preloaded) STR X1, [X3] // Write data loop: STR W7, [X4] // Set flag = 1 LDR W5, [X4] // Read flag CBZ W5, loop // Wait for flag LDR X6, [X3] // Read data — MAY SEE STALE DATA! ``` The bug: ARM can reorder the two stores on Core 1, so Core 2 sees flag=1 before the data is written. Fix: use `STLR` for the flag (release) and `LDAR` for reading it (acquire): ```asm // Core 1 (fixed): // Core 2 (fixed): STR X1, [X3] // Write data loop: STLR W7, [X4] // Release-store flag LDAR W5, [X4] // Acquire-load flag CBZ W5, loop LDR X6, [X3] // Guaranteed to see the data ``` The `STLR` ensures the data write is visible before the flag. The `LDAR` ensures the data read happens after the flag is seen. --- ## 26. Pseudo-instructions & Assembler Directives Pseudo-instructions are things you write in assembly source that don't map to a single hardware instruction — the assembler translates them into one or more real instructions. Directives (starting with `.`) control the assembler itself — section placement, alignment, data emission — rather than generating instructions. **Why pseudo-instructions exist**: The ISA has strict encoding constraints (fixed 32-bit instructions, limited immediate ranges). Pseudo-instructions like `LDR X0, =constant` and `MOV X0, #large_value` hide this complexity — you write what you mean, and the assembler figures out the best encoding. Without them, you'd need to manually decompose every large constant into MOVZ/MOVK sequences. ### 26.1 Common Pseudo-instructions (GNU as) ```asm LDR X0, =0x12345678 // Load arbitrary constant (assembler picks best encoding or literal pool) ADR X0, label // (real instruction, but often used like a pseudo-instruction) ADRP X0, label // (real instruction) MOV X0, #large_const // Assembler picks MOVZ/MOVN/MOVK/ORR as needed ``` ### 26.2 GNU Assembler Directives ```asm .text // Code section .data // Data section .bss // Uninitialized data section .rodata // Read-only data .global main // Make symbol globally visible .type main, %function // Symbol is a function .size main, .-main // Size = current address minus start of main .align 4 // Align to 2^4 = 16 bytes .balign 16 // Align to 16 bytes (explicit) .p2align 4 // Align to 2^4 = 16 bytes (power of 2) .byte 0x42 // Emit 1 byte .hword 0x1234 // Emit 2 bytes (halfword) .word 0x12345678 // Emit 4 bytes .dword 0x123456789ABCDEF0 // Emit 8 bytes (AKA .quad or .xword) .ascii "Hello" // String without null terminator (just raw bytes) .asciz "Hello" // String WITH null terminator (a 0x00 byte at the end) // Also called "null-terminated" or "C string". Same as .string .equ BUFFER_SIZE, 1024 // Define constant .set MY_CONST, 42 // Same as .equ .macro my_push reg // Define macro STR \reg, [SP, #-16]! .endm .if CONDITION // Conditional assembly .else .endif .include "other.s" // Include file .section .note.GNU-stack,"",@progbits // Mark stack as non-executable ``` ### 26.3 Relocation Operators Used with ADRP/ADD/LDR to reference symbols: ```asm ADRP X0, symbol // Page address of symbol ADD X0, X0, :lo12:symbol // Low 12 bits (page offset) // GOT (Global Offset Table) access — used for shared library symbols whose // address isn't known until the dynamic linker resolves them at runtime: ADRP X0, :got:symbol LDR X0, [X0, :got_lo12:symbol] // Thread-local storage (TLS) — for variables that have a separate copy per thread // (like C's _Thread_local or __thread). The runtime provides a descriptor function: ADRP X0, :tlsdesc:symbol LDR X1, [X0, :tlsdesc_lo12:symbol] ADD X0, X0, :tlsdesc_lo12:symbol BLR X1 ``` ### 26.4 Practical Tools & Workflow **Assembling and linking:** ```bash # Assemble a .s file to object file: aarch64-linux-gnu-as -o program.o program.s # Link to executable: aarch64-linux-gnu-ld -o program program.o # Or combine with GCC (handles C runtime startup): aarch64-linux-gnu-gcc -o program program.s # Cross-compile C to assembly (to study compiler output): aarch64-linux-gnu-gcc -S -O2 -o output.s input.c ``` **Disassembly (reading compiled binaries):** ```bash # Disassemble an ELF (Executable and Linkable Format) binary: aarch64-linux-gnu-objdump -d program # With source interleaving (if compiled with -g): aarch64-linux-gnu-objdump -dS program # LLVM disassembler (often better formatting): llvm-objdump -d program # Disassemble a single function: aarch64-linux-gnu-objdump -d program | sed -n '/<my_function>:/,/^$/p' ``` **Testing on x86 with emulation:** ```bash # Run AArch64 binary on x86 using QEMU user-mode emulation: qemu-aarch64 ./program # Or with a specific library path: qemu-aarch64 -L /usr/aarch64-linux-gnu ./program ``` ### 26.5 Volatile and Compiler Barriers In C/C++, `volatile` tells the compiler "don't optimize away or reorder this memory access." In assembly, there's no `volatile` keyword — every load and store you write is exactly what the CPU executes. But when writing **inline assembly** in C, you need to understand how `volatile` maps: ```c // C volatile load → compiler emits a plain LDR (no optimization, no reordering by compiler) volatile int *ptr = ...; int val = *ptr; // Compiler MUST emit: LDR Wn, [Xptr] // It cannot cache the value, combine with other loads, or skip it. // For HARDWARE memory ordering (multi-core visibility), volatile is NOT enough. // You need atomic operations or explicit barriers: __atomic_load_n(ptr, __ATOMIC_ACQUIRE); // → LDAR __atomic_store_n(ptr, val, __ATOMIC_RELEASE); // → STLR ``` **Key distinction**: `volatile` prevents the **compiler** from reordering. Memory barriers (`DMB`, `LDAR`/`STLR`) prevent the **CPU** from reordering. For single-core memory-mapped I/O, `volatile` is sufficient. For multi-core synchronization, you need both. --- ## 27. Instruction Aliases — The Master Table This is the comprehensive list of "instructions" that are actually aliases for other instructions. Both 64-bit and 32-bit forms are shown for every alias. Register 31 alternatives (`|XZR`, `|WZR`, `|SP`, `|WSP`) are shown per the encoding rules — see the table in §1.2. | Alias | Real instruction | Notes | |---|---|---| | **Move** | | | | MOV Xd|XZR, Xm|XZR | ORR Xd|XZR, XZR, Xm|XZR | Reg-to-reg (shifted-reg encoding) | | MOV Wd|WZR, Wm|WZR | ORR Wd|WZR, WZR, Wm|WZR | 32-bit (zeroes upper 32) | | MOV Xd|SP, SP | ADD Xd|SP, SP, #0 | From SP (immediate encoding; reg 31 = SP in Rd) | | MOV SP, Xn|SP | ADD SP, Xn|SP, #0 | To SP (immediate encoding; reg 31 = SP in Rn) | | MOV Xd|XZR, #imm | MOVZ Xd|XZR, #imm{, LSL #s} | 16-bit imm fits (s=0/16/32/48) | | MOV Wd|WZR, #imm | MOVZ Wd|WZR, #imm{, LSL #s} | 16-bit imm fits (s=0/16 only) | | MOV Xd|XZR, #imm | MOVN Xd|XZR, #~imm{, LSL #s} | Inverted fits in 16 bits | | MOV Wd|WZR, #imm | MOVN Wd|WZR, #~imm{, LSL #s} | Inverted fits (32-bit NOT, s=0/16) | | MOV Xd|SP, #imm | ORR Xd|SP, XZR, #bitmask_imm | Bitmask immediate (Rd=SP!) | | MOV Wd|WSP, #imm | ORR Wd|WSP, WZR, #bitmask_imm | Bitmask imm (32-bit, Rd=WSP!) | | MVN Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} | ORN Xd|XZR, XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} | Bitwise NOT (logical shifted-reg includes ROR) | | MVN Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} | ORN Wd|WZR, WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} | 32-bit | | **Negate** | | | | NEG Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | SUB Xd|XZR, XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | Negate (arithmetic shifted-reg, no ROR) | | NEG Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | SUB Wd|WZR, WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | 32-bit | | NEGS Xd|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | SUBS Xd|XZR, XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | Negate + flags | | NEGS Wd|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | SUBS Wd|WZR, WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | 32-bit | | NGC Xd|XZR, Xm|XZR | SBC Xd|XZR, XZR, Xm|XZR | Negate with carry (no shift form) | | NGC Wd|WZR, Wm|WZR | SBC Wd|WZR, WZR, Wm|WZR | 32-bit | | NGCS Xd|XZR, Xm|XZR | SBCS Xd|XZR, XZR, Xm|XZR | Negate with carry + flags | | NGCS Wd|WZR, Wm|WZR | SBCS Wd|WZR, WZR, Wm|WZR | 32-bit | | **Compare / Test** | | | | CMP Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | SUBS XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | Compare (shifted-reg, no ROR) | | CMP Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | SUBS WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | 32-bit shifted-reg | | CMP Xn|SP, #imm12{, LSL #12} | SUBS XZR, Xn|SP, #imm12{, LSL #12} | Compare (immediate) | | CMP Wn|WSP, #imm12{, LSL #12} | SUBS WZR, Wn|WSP, #imm12{, LSL #12} | 32-bit immediate | | CMP Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} | SUBS XZR, Xn|SP, Wm|WZR, (extend) {#0-4} | Compare (extended-reg, Wm source for B/H/W extends) | | CMP Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} | SUBS XZR, Xn|SP, Xm|XZR, (extend) {#0-4} | Compare (extended-reg, Xm source for X extends) | | CMP Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} | SUBS WZR, Wn|WSP, Wm|WZR, (extend) {#0-4} | 32-bit extended-reg | | CMN Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | ADDS XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63} | Compare negative (shifted-reg, no ROR) | | CMN Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | ADDS WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31} | 32-bit shifted-reg | | CMN Xn|SP, #imm12{, LSL #12} | ADDS XZR, Xn|SP, #imm12{, LSL #12} | Compare negative (immediate) | | CMN Wn|WSP, #imm12{, LSL #12} | ADDS WZR, Wn|WSP, #imm12{, LSL #12} | 32-bit immediate | | CMN Xn|SP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} | ADDS XZR, Xn|SP, Wm|WZR, (extend) {#0-4} | Compare negative (extended-reg, Wm source for B/H/W extends) | | CMN Xn|SP, Xm|XZR, UXTX {#0-4}|SXTX {#0-4} | ADDS XZR, Xn|SP, Xm|XZR, (extend) {#0-4} | Compare negative (extended-reg, Xm source for X extends) | | CMN Wn|WSP, Wm|WZR, UXTB {#0-4}|UXTH {#0-4}|UXTW {#0-4}|SXTB {#0-4}|SXTH {#0-4}|SXTW {#0-4} | ADDS WZR, Wn|WSP, Wm|WZR, (extend) {#0-4} | 32-bit extended-reg | | TST Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} | ANDS XZR, Xn|XZR, Xm|XZR{, LSL #0-63|LSR #0-63|ASR #0-63|ROR #0-63} | Test bits (logical shifted-reg, includes ROR) | | TST Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} | ANDS WZR, Wn|WZR, Wm|WZR{, LSL #0-31|LSR #0-31|ASR #0-31|ROR #0-31} | 32-bit shifted-reg | | TST Xn|XZR, #bitmask_imm | ANDS XZR, Xn|XZR, #bitmask_imm | Test bits (immediate) | | TST Wn|WZR, #bitmask_imm | ANDS WZR, Wn|WZR, #bitmask_imm | 32-bit immediate | | **Multiply** | | | | MUL Xd|XZR, Xn|XZR, Xm|XZR | MADD Xd|XZR, Xn|XZR, Xm|XZR, XZR | Multiply | | MUL Wd|WZR, Wn|WZR, Wm|WZR | MADD Wd|WZR, Wn|WZR, Wm|WZR, WZR | 32-bit | | MNEG Xd|XZR, Xn|XZR, Xm|XZR | MSUB Xd|XZR, Xn|XZR, Xm|XZR, XZR | Multiply-negate | | MNEG Wd|WZR, Wn|WZR, Wm|WZR | MSUB Wd|WZR, Wn|WZR, Wm|WZR, WZR | 32-bit | | SMULL Xd|XZR, Wn|WZR, Wm|WZR | SMADDL Xd|XZR, Wn|WZR, Wm|WZR, XZR | Signed long multiply | | UMULL Xd|XZR, Wn|WZR, Wm|WZR | UMADDL Xd|XZR, Wn|WZR, Wm|WZR, XZR | Unsigned long multiply | | SMNEGL Xd|XZR, Wn|WZR, Wm|WZR | SMSUBL Xd|XZR, Wn|WZR, Wm|WZR, XZR | Signed long multiply-negate | | UMNEGL Xd|XZR, Wn|WZR, Wm|WZR | UMSUBL Xd|XZR, Wn|WZR, Wm|WZR, XZR | Unsigned long multiply-negate | | **Shifts (immediate)** | | | | LSL Xd|XZR, Xn|XZR, #0-63 | UBFM Xd|XZR, Xn|XZR, #(-s MOD 64), #(63-s) | Shift left (s is the LSL amount) | | LSL Wd|WZR, Wn|WZR, #0-31 | UBFM Wd|WZR, Wn|WZR, #(-s MOD 32), #(31-s) | 32-bit | | LSR Xd|XZR, Xn|XZR, #0-63 | UBFM Xd|XZR, Xn|XZR, #s, #63 | Shift right logical | | LSR Wd|WZR, Wn|WZR, #0-31 | UBFM Wd|WZR, Wn|WZR, #s, #31 | 32-bit | | ASR Xd|XZR, Xn|XZR, #0-63 | SBFM Xd|XZR, Xn|XZR, #s, #63 | Arith shift right | | ASR Wd|WZR, Wn|WZR, #0-31 | SBFM Wd|WZR, Wn|WZR, #s, #31 | 32-bit | | ROR Xd|XZR, Xn|XZR, #0-63 | EXTR Xd|XZR, Xn|XZR, Xn|XZR, #s | Rotate right (same Rn twice in EXTR) | | ROR Wd|WZR, Wn|WZR, #0-31 | EXTR Wd|WZR, Wn|WZR, Wn|WZR, #s | 32-bit | | **Shifts (register)** | | | | LSL Xd|XZR, Xn|XZR, Xm|XZR | LSLV Xd|XZR, Xn|XZR, Xm|XZR | Shift left (register) | | LSL Wd|WZR, Wn|WZR, Wm|WZR | LSLV Wd|WZR, Wn|WZR, Wm|WZR | 32-bit | | LSR Xd|XZR, Xn|XZR, Xm|XZR | LSRV Xd|XZR, Xn|XZR, Xm|XZR | Shift right (register) | | LSR Wd|WZR, Wn|WZR, Wm|WZR | LSRV Wd|WZR, Wn|WZR, Wm|WZR | 32-bit | | ASR Xd|XZR, Xn|XZR, Xm|XZR | ASRV Xd|XZR, Xn|XZR, Xm|XZR | Arith shift right (register) | | ASR Wd|WZR, Wn|WZR, Wm|WZR | ASRV Wd|WZR, Wn|WZR, Wm|WZR | 32-bit | | ROR Xd|XZR, Xn|XZR, Xm|XZR | RORV Xd|XZR, Xn|XZR, Xm|XZR | Rotate right (register) | | ROR Wd|WZR, Wn|WZR, Wm|WZR | RORV Wd|WZR, Wn|WZR, Wm|WZR | 32-bit | | **Extension** | | | | SXTB Xd|XZR, Wn|WZR | SBFM Xd|XZR, Xn|XZR, #0, #7 | Sign-extend byte → 64 | | SXTB Wd|WZR, Wn|WZR | SBFM Wd|WZR, Wn|WZR, #0, #7 | Sign-extend byte → 32 | | SXTH Xd|XZR, Wn|WZR | SBFM Xd|XZR, Xn|XZR, #0, #15 | Sign-extend halfword → 64 | | SXTH Wd|WZR, Wn|WZR | SBFM Wd|WZR, Wn|WZR, #0, #15 | Sign-extend halfword → 32 | | SXTW Xd|XZR, Wn|WZR | SBFM Xd|XZR, Xn|XZR, #0, #31 | Sign-extend word → 64 (no Wd form) | | UXTB Wd|WZR, Wn|WZR | UBFM Wd|WZR, Wn|WZR, #0, #7 | Zero-extend byte | | UXTH Wd|WZR, Wn|WZR | UBFM Wd|WZR, Wn|WZR, #0, #15 | Zero-extend halfword | | UXTW Xd|XZR, Wn|WZR | UBFM Xd|XZR, Xn|XZR, #0, #31 | Zero-extend word → 64 (rarely needed; W→X auto-zeros) | | **Bitfield** | | | | UBFX Xd|XZR, Xn|XZR, #l, #w | UBFM Xd|XZR, Xn|XZR, #l, #(l+w-1) | Unsigned BF extract | | UBFX Wd|WZR, Wn|WZR, #l, #w | UBFM Wd|WZR, Wn|WZR, #l, #(l+w-1) | 32-bit | | SBFX Xd|XZR, Xn|XZR, #l, #w | SBFM Xd|XZR, Xn|XZR, #l, #(l+w-1) | Signed BF extract | | SBFX Wd|WZR, Wn|WZR, #l, #w | SBFM Wd|WZR, Wn|WZR, #l, #(l+w-1) | 32-bit | | UBFIZ Xd|XZR, Xn|XZR, #l, #w | UBFM Xd|XZR, Xn|XZR, #(-l MOD 64), #(w-1) | Unsigned BF insert in zero | | UBFIZ Wd|WZR, Wn|WZR, #l, #w | UBFM Wd|WZR, Wn|WZR, #(-l MOD 32), #(w-1) | 32-bit | | SBFIZ Xd|XZR, Xn|XZR, #l, #w | SBFM Xd|XZR, Xn|XZR, #(-l MOD 64), #(w-1) | Signed BF insert in zero | | SBFIZ Wd|WZR, Wn|WZR, #l, #w | SBFM Wd|WZR, Wn|WZR, #(-l MOD 32), #(w-1) | 32-bit | | BFI Xd|XZR, Xn|XZR, #l, #w | BFM Xd|XZR, Xn|XZR, #(-l MOD 64), #(w-1) | Bitfield insert | | BFI Wd|WZR, Wn|WZR, #l, #w | BFM Wd|WZR, Wn|WZR, #(-l MOD 32), #(w-1) | 32-bit | | BFXIL Xd|XZR, Xn|XZR, #l, #w | BFM Xd|XZR, Xn|XZR, #l, #(l+w-1) | BF extract and insert low | | BFXIL Wd|WZR, Wn|WZR, #l, #w | BFM Wd|WZR, Wn|WZR, #l, #(l+w-1) | 32-bit | | **Conditional select aliases** | | | | CINC Xd|XZR, Xn|XZR, cond | CSINC Xd|XZR, Xn|XZR, Xn|XZR, inv(cond) | Conditional increment | | CINC Wd|WZR, Wn|WZR, cond | CSINC Wd|WZR, Wn|WZR, Wn|WZR, inv(cond) | 32-bit | | CSET Xd|XZR, cond | CSINC Xd|XZR, XZR, XZR, inv(cond) | Conditional set | | CSET Wd|WZR, cond | CSINC Wd|WZR, WZR, WZR, inv(cond) | 32-bit | | CINV Xd|XZR, Xn|XZR, cond | CSINV Xd|XZR, Xn|XZR, Xn|XZR, inv(cond) | Conditional invert | | CINV Wd|WZR, Wn|WZR, cond | CSINV Wd|WZR, Wn|WZR, Wn|WZR, inv(cond) | 32-bit | | CSETM Xd|XZR, cond | CSINV Xd|XZR, XZR, XZR, inv(cond) | Conditional set mask | | CSETM Wd|WZR, cond | CSINV Wd|WZR, WZR, WZR, inv(cond) | 32-bit | | CNEG Xd|XZR, Xn|XZR, cond | CSNEG Xd|XZR, Xn|XZR, Xn|XZR, inv(cond) | Conditional negate | | CNEG Wd|WZR, Wn|WZR, cond | CSNEG Wd|WZR, Wn|WZR, Wn|WZR, inv(cond) | 32-bit | | **System** | | | | `NOP` | `HINT #0` | No operation | | `YIELD` | `HINT #1` | Yield | | `WFE` | `HINT #2` | Wait for event | | `WFI` | `HINT #3` | Wait for interrupt | | `SEV` | `HINT #4` | Send event | | `SEVL` | `HINT #5` | Send event local | | `RET` | `RET X30` | Return (default LR) | | `PACIASP` | `PACIA X30, SP` | Sign LR with key A | | `AUTIASP` | `AUTIA X30, SP` | Authenticate LR with key A | | **System instruction aliases** | | | | AT S1E1R, Xt|XZR | SYS #0, C7, C8, #0, Xt|XZR | Address translate | | DC ZVA, Xt|XZR | SYS #3, C7, C4, #1, Xt|XZR | Data cache zero | | IC IVAU, Xt|XZR | SYS #3, C7, C5, #1, Xt|XZR | Instruction cache invalidate | | `TLBI ...` | Various `SYS` encodings | TLB invalidate | --- ## 28. AArch32 (ARM/Thumb) Key Differences ### 28.1 Conditional Execution In AArch32 (ARM state), **almost every instruction** can be conditional: ```asm // AArch32: CMP R0, #10 ADDGT R1, R1, #1 // Only executes if R0 > 10 MOVLE R1, #0 // Only executes if R0 <= 10 ``` AArch64 **removed** this. You must use `CSEL`/`B.cond`/etc. instead. ### 28.2 S Suffix in AArch32 In AArch32, the S suffix is optional on most instructions (like AArch64), but combining it with a condition code gives you things like: ```asm ADDGTS R1, R1, #1 // Conditionally add AND set flags ``` ### 28.3 Register Differences - AArch32: R0–R15, where R13=SP, R14=LR, R15=PC - PC is a general-purpose register! You can do `ADD PC, PC, R0` (computed branch). This doesn't exist in AArch64. - Writing to PC is a branch. This is why there's no separate `RET` in AArch32 — you just do `BX LR` or `MOV PC, LR`. ### 28.4 Barrel Shifter Everywhere In AArch32, EVERY data-processing instruction's second operand can include a shift: ```asm ADD R0, R1, R2, LSL R3 // R0 = R1 + (R2 << R3) — register-controlled shift ``` AArch64 limits shifts to specific forms per instruction class, and **never** allows register-controlled shifts in the operand position (you need `LSLV` separately). ### 28.5 Thumb / Thumb-2 Thumb is a compressed 16-bit instruction set (subset of ARM). Thumb-2 adds 32-bit instructions to Thumb, making it nearly as capable as ARM state but more code-dense. Modern ARM Cortex-M processors only support Thumb. AArch64 **has no Thumb mode**. It is always 32-bit fixed-width A64 instructions. --- ## 29. Calling Convention (AAPCS64) The **AAPCS64** (Arm Architecture Procedure Call Standard for AArch64) defines how functions pass arguments, return values, and which registers they must preserve. This is essential for understanding compiled code and for writing assembly that interoperates with C. **Why these specific register assignments?** X0-X7 for arguments gives 8 register-passed arguments before spilling to the stack — enough for the vast majority of functions (most have ≤4 arguments). Having the return value in X0 (the same as the first argument) is common across architectures because many functions transform their first argument and return the result. The split between caller-saved (X9-X15: temporaries the callee can freely trash) and callee-saved (X19-X28: preserved across calls) is a balance — too many callee-saved means every small function wastes time saving/restoring; too few means callers waste time saving around every call. X29 as frame pointer enables debuggers and stack unwinders to walk the call stack. X30 as link register holds the return address from `BL`/`BLR`. ### 29.1 Parameter Passing | Register | Usage | |---|---| | X0–X7 | Arguments and return values | | X0 | First argument / return value | | X1 | Second argument / second return value (for 128-bit returns) | | X8 | Indirect result location (struct return pointer) | | X9–X15 | Temporary (caller-saved) | | X16–X17 | Intra-procedure scratch (PLT stubs, caller-saved) | | X18 | Platform register — **reserved on Windows (TEB), Darwin/iOS, Android (SCS), Fuchsia, VxWorks**; usable as a temporary on Linux and bare-metal when the platform doesn't claim it. Check the platform ABI before using. | | X19–X28 | Callee-saved | | X29 | Frame pointer (callee-saved) | | X30 | Link register (overwritten by BL/BLR — must be saved by callee if it makes calls) | | SP | Stack pointer (16-byte aligned at public interfaces) | ### 29.2 SIMD/FP Parameter Passing - V0–V7 (D0–D7 / S0–S7 / Q0–Q7): FP/SIMD arguments and return values - V8–V15: Callee-saved (only the **lower 64 bits** D8–D15 are callee-saved; upper 64 bits are scratch) - V16–V31: Temporary (caller-saved) ### 29.3 Stack Frame The stack grows **downward** in memory — pushing data decreases SP, popping increases it. This is a convention shared with x86 and most other architectures. "Top of stack" means the lowest address (where SP points). ``` High address ┌──────────────────────┐ │ Caller's frame │ ├──────────────────────┤ │ Arguments (if >8) │ ← Passed on stack ├──────────────────────┤ │ Return address (X30) │ ← Saved by callee │ Old frame ptr (X29) │ ← X29 points here ├──────────────────────┤ │ Callee-saved regs │ ├──────────────────────┤ │ Local variables │ ├──────────────────────┤ │ Outgoing args (if >8) │ ← For calls this function makes └──────────────────────┘ ← SP (must be 16-byte aligned) Low address ``` **Standard prologue/epilogue:** ```asm my_function: // Prologue STP X29, X30, [SP, #-64]! // Save FP, LR; allocate 64 bytes MOV X29, SP // Set frame pointer STP X19, X20, [SP, #16] // Save callee-saved regs STP X21, X22, [SP, #32] // ... save more if needed ... // Function body... // Epilogue LDP X21, X22, [SP, #32] LDP X19, X20, [SP, #16] LDP X29, X30, [SP], #64 // Restore FP, LR; deallocate RET ``` **No generic red zone on AArch64**: Unlike x86-64 System V (which has a 128-byte "red zone" below SP that leaf functions can use without adjusting SP), the AAPCS64 does **not** define any red zone. Signal handlers and interrupts can clobber memory below SP at any time. You MUST adjust SP before storing anything on the stack. Platform ABIs layer different policies on top: - **Apple Darwin** defines a 128-byte red zone that leaf functions may use freely — similar vibe to x86-64 System V. - **Windows** reserves 16 bytes below SP for analysis and dynamic-patching scenarios (profiler-inserted `stp` sequences), which is **not** the same as a general-purpose leaf-function red zone; normal code should not use it as scratch. - **Linux**, most bare-metal, and most other platforms: no red zone at all. Check your target platform's ABI document before relying on any space below SP. **Stack canaries** (stack protector): Compilers insert a random value ("canary") between local variables and the saved frame pointer/return address. Before returning, the function checks if the canary was overwritten — if so, a buffer overflow occurred and the program aborts. In AArch64 assembly, you'll see loads from a thread-local `__stack_chk_guard` symbol at the start, and a comparison before `RET`. --- ## 30. Common Patterns & Idioms This section shows how common C/C++ constructs translate to AArch64 assembly. Understanding these patterns is essential for reading compiler output and writing efficient assembly. ### 30.1 If/Else Compilers translate `if/else` into either a **branching** version (using `B.cond`) or a **branchless** version (using `CSEL`). The branchless version avoids branch misprediction penalties and is preferred for simple value assignments. The branching version is better when the if/else bodies are complex (many instructions). ```asm // C: if (x > 10) { a = 1; } else { a = 2; } // X0 = x, result in W1 // Branching version: CMP X0, #10 B.LE else_branch // If x <= 10, skip to else MOV W1, #1 // a = 1 (if body) B end_if else_branch: MOV W1, #2 // a = 2 (else body) end_if: // Branchless version (compiler usually prefers this for simple assignments): CMP X0, #10 MOV W1, #1 // Prepare "if" value MOV W2, #2 // Prepare "else" value CSEL W1, W1, W2, GT // W1 = (x > 10) ? 1 : 2 ``` ### 30.2 Loops Compilers prefer **do-while** style loops (condition at the bottom) because they use one branch per iteration instead of two. A `for` loop is converted to: check if zero iterations needed (branch over), then do-while. `CBZ`/`CBNZ` are commonly used for zero-test loop exits because they combine the comparison and branch into one instruction. ```asm // C: for (int i = 0; i < n; i++) { sum += array[i]; } // X0 = array pointer, X1 = n, result in X2 MOV X2, #0 // sum = 0 MOV X3, #0 // i = 0 loop: CMP X3, X1 // i < n? B.GE loop_end // If i >= n, exit loop LDR X4, [X0, X3, LSL #3] // X4 = array[i] (8-byte elements, index scaled by 8) ADD X2, X2, X4 // sum += array[i] ADD X3, X3, #1 // i++ B loop // Back to top loop_end: // While loop: while (x != 0) { x = x >> 1; count++; } MOV W1, #0 // count = 0 while_loop: CBZ W0, while_end // If x == 0, exit (CBZ = Compare and Branch if Zero) LSR W0, W0, #1 // x >>= 1 ADD W1, W1, #1 // count++ B while_loop while_end: // Do-while: more efficient because the branch is at the bottom (one branch per iteration): MOV X3, #0 do_loop: LDR X4, [X0, X3, LSL #3] ADD X2, X2, X4 ADD X3, X3, #1 CMP X3, X1 B.LT do_loop // Loop while i < n (branch at bottom = 1 branch/iter) ``` ### 30.3 Array and Struct Access Arrays use the shifted/extended register addressing modes — the index is scaled by the element size using `LSL #n`. Structs use immediate offsets from a base pointer — each field has a fixed offset known at compile time. Arrays of structs combine both: compute the struct pointer from the index, then use an immediate offset for the field. ```asm // Array access: int64_t array[100]; val = array[i]; // X0 = array base, X1 = index i LDR X2, [X0, X1, LSL #3] // X2 = array[i] (each element is 8 bytes, LSL #3 = ×8) // Struct access: // struct { int32_t x; int32_t y; int64_t z; } point; // x at +0, y at +4, z at +8 // X0 = pointer to struct LDR W1, [X0] // W1 = point.x (offset 0) LDR W2, [X0, #4] // W2 = point.y (offset 4) LDR X3, [X0, #8] // X3 = point.z (offset 8) // Array of structs: points[i].y (struct size = 16 bytes, y at offset 4) // X0 = array base, W1 = index i ADD X2, X0, W1, UXTW #4 // X2 = base + i*16 (UXTW #4 = zero-extend and shift left 4 = ×16) LDR W3, [X2, #4] // W3 = points[i].y ``` ### 30.4 Branchless Min/Max ```asm // min(X0, X1) → X0 (signed) CMP X0, X1 CSEL X0, X0, X1, LE // max(X0, X1) → X0 (unsigned) CMP X0, X1 CSEL X0, X0, X1, HI ``` ### 30.5 Branchless Absolute Value ```asm // abs(X0) → X0 (signed) CMP X0, #0 CNEG X0, X0, LT ``` ### 30.6 Division by Constant (Multiply by Reciprocal) Compilers do this automatically, but understanding it helps when reading disassembly: ```asm // X0 = X1 / 10 (unsigned) // The compiler finds a "magic multiplier" M and shift s such that // UMULH(n, M) >> s == n / d for all n in range. // For d=10: M = 0xCCCCCCCCCCCCCCCD, s = 3 MOV X2, #0xCCCD MOVK X2, #0xCCCC, LSL #16 MOVK X2, #0xCCCC, LSL #32 MOVK X2, #0xCCCC, LSL #48 UMULH X0, X1, X2 // high 64 bits of X1 × magic LSR X0, X0, #3 // post-shift ``` ### 30.7 Swap Two Registers ```asm // Using EOR (no temp register needed, but 3 instructions): EOR X0, X0, X1 EOR X1, X0, X1 EOR X0, X0, X1 // Better — just use a temp: MOV X2, X0 MOV X0, X1 MOV X1, X2 ``` ### 30.8 Test Power of Two ```asm // Check if X0 is a power of 2 (and not zero): // Power of 2 means exactly one bit set: X0 != 0 && (X0 & (X0-1)) == 0 SUB X1, X0, #1 // X1 = X0 - 1 TST X0, X1 // X0 & (X0-1): sets Z=1 if zero (candidate) CCMP X0, #0, #4, EQ // If Z=1: compare X0 vs 0 (sets Z=1 if X0==0, Z=0 if X0!=0) // If Z=0: set flags to #4 (Z=1), so B.NE won't fire B.NE is_power_of_two // Taken only if (X0 & (X0-1))==0 AND X0!=0 ``` ### 30.9 Align Address ```asm // Align X0 down to 16-byte boundary: AND X0, X0, #~0xF // Clear low 4 bits (bitmask immediate: 0xFFFFFFFFFFFFFFF0) // Align X0 up to 16-byte boundary: ADD X0, X0, #15 AND X0, X0, #~0xF ``` ### 30.10 Position-Independent Hello World (Linux) ```asm .global _start .text _start: // write(1, msg, len) MOV X8, #64 // __NR_write MOV X0, #1 // fd = stdout ADR X1, msg // buffer (PC-relative) MOV X2, #14 // length SVC #0 // exit(0) MOV X8, #93 // __NR_exit MOV X0, #0 // status = 0 SVC #0 .data msg: .asciz "Hello, world!\n" ``` ### 30.11 Jump Table (Switch Statement) Compilers translate large `switch` statements into jump tables — an array of branch offsets indexed by the switch value. This is O(1) instead of a chain of comparisons. ```asm // switch (X0) { case 0: ...; case 1: ...; case 2: ...; case 3: ...; } // X0 = switch value (already range-checked to 0-3) ADR X1, jump_table // X1 = address of the jump table LDRH W2, [X1, X0, LSL #1] // Load 16-bit offset for case X0 (each entry is 2 bytes) ADR X3, case_base // X3 = base address for offset computation ADD X3, X3, W2, UXTH // X3 = base + zero-extended offset BR X3 // Jump to the case handler .align 2 jump_table: .hword case0 - case_base // 16-bit offset to case 0 handler .hword case1 - case_base // 16-bit offset to case 1 handler .hword case2 - case_base .hword case3 - case_base case_base: case0: // ... handler for case 0 ... case1: // ... handler for case 1 ... ``` **What REALLY happens**: The CPU loads a small offset from a table in memory (indexed by the switch value), adds it to a base address, and does an indirect branch. The `ADR` + table approach generates position-independent code. Compilers may also use `TBB`/`TBH` (AArch32) or the `ADR`+`ADD`+`BR` pattern (AArch64). In disassembly, seeing `BR Xn` after an `LDR` from a table-like structure is the tell-tale sign of a switch statement. ### 30.12 Atomic Reference Counting Reference counting (used in `std::shared_ptr`, Python objects, Linux kernel `kref`) atomically increments/decrements a counter. When it reaches zero, the object is freed. ```asm // Increment reference count (relaxed ordering is fine — no data dependency): LDADD X1, X2, [X0] // Atomically: old=[X0], [X0]+=X1 (X1=1 for refcount++) // Or without LSE: LDXR/ADD/STXR loop // Decrement and check for zero (needs release ordering on the decrement, // acquire ordering before freeing — to ensure all accesses to the object // are visible before we free it): MOV X1, #-1 // Decrement by 1 (add -1) LDADDAL X1, X2, [X0] // Atomically: old=X2=[X0], [X0]+=-1 // Acquire+Release: ensures all prior accesses complete // and the zero-check sees the final count CMP X2, #1 // Was old value 1? (means new value is 0) B.EQ free_object // If refcount hit zero, free the object ``` **Why release on decrement?** The release ordering ensures that all reads/writes to the object's data (done while holding a reference) are visible to whoever ends up freeing the object. Without release, the free path might not see all the modifications made by other threads that already dropped their references. **Why acquire before free?** The acquire on the final decrement (via LDADDAL) ensures the freeing thread sees all modifications made by all other threads that previously decremented the count. ### 30.13 Byte/Halfword Atomics for Flags Sometimes you only need a 1-byte or 2-byte atomic (e.g., a boolean flag, a status byte): ```asm // Set a byte flag with release ordering: MOV W1, #1 STLRB W1, [X0] // Release-store a single byte // Read a byte flag with acquire ordering: LDARB W1, [X0] // Acquire-load a single byte CBZ W1, not_set // Atomic byte swap (LSE): MOV W1, #1 SWPAB W1, W2, [X0] // Atomically swap byte, acquire semantics // W2 = old byte value, [X0] = 1 ``` ### 30.14 Leaf Function Optimization A **leaf function** is one that doesn't call any other functions. Since it never executes `BL` (which overwrites X30/LR), it doesn't need to save/restore LR. It also doesn't need to set up a frame pointer if it doesn't use the stack. This makes leaf functions very cheap: ```asm // Non-leaf function (must save LR because it calls other functions): my_func: STP X29, X30, [SP, #-16]! // Save FP+LR (4 bytes + memory access) MOV X29, SP BL other_func // This overwrites X30 LDP X29, X30, [SP], #16 // Restore RET // Leaf function (no calls → no save/restore needed): add_two: ADD X0, X0, X1 // Just do the work RET // X30 still has the return address from our caller ``` **Why this matters**: Most small helper functions (getters, simple math, comparisons) are leaf functions. The compiler skips the prologue/epilogue entirely, making them just 1-2 instructions. When reading disassembly, a function with no `STP`/`LDP` at the start/end is a leaf function. ### 30.15 Tail Call Optimization When the last thing a function does is call another function and return its result, the compiler can replace `BL target; RET` with just `B target`. This reuses the current stack frame instead of creating a new one — saving the prologue/epilogue of the tail-called function AND the call/return overhead. ```asm // Without tail call optimization: wrapper: STP X29, X30, [SP, #-16]! MOV X29, SP // ... setup arguments ... BL real_function // Call (pushes return address) LDP X29, X30, [SP], #16 // Restore RET // Return to our caller // With tail call optimization: wrapper: // ... setup arguments ... B real_function // Jump directly — real_function will RET to OUR caller ``` **Why B instead of BL**: `BL` saves the return address in X30. But if we're about to return anyway, the correct return address is already in X30 (from our caller). `B` preserves X30, so `real_function`'s `RET` returns directly to our caller, skipping us entirely. **How to recognize a tail call in disassembly**: A function that ends with `B <other_function>` (unconditional branch to a different function) instead of `BL` + `RET` is a tail call. The function may restore callee-saved registers first (LDP X29, X30), then `B` to the target. If you see a function with no `RET` at the end, look for a `B` — that's the tail call. Conditional tail calls look like `B.cond <other_function>` followed by a fallthrough to a different return path. ### 30.16 Random Number Generation (FEAT_RNG) ARMv8.5-A adds hardware random number generation: ```asm MRS X0, RNDR // X0 = hardware random number (conditionalized — check flags) // If successful: NZCV = 0000 (Z=0). If entropy unavailable: Z=1. B.EQ retry // If Z=1, entropy pool depleted — retry or fall back MRS X0, RNDRRS // Reseeded random: forces a reseed before generating // Same Z-flag convention as RNDR ``` **Why RNDR exists**: Cryptographic applications need true random numbers for key generation, nonces, and ASLR. Before FEAT_RNG, ARM code had to call into kernel or firmware for randomness. RNDR provides user-space access to hardware entropy without syscall overhead. ### 30.17 Crypto & Dot Product Instructions (Brief) AArch64 has dedicated NEON instructions for common crypto and ML operations. These are optional features — check `ID_AA64ISAR0_EL1` for availability: ```asm // AES (FEAT_AES): // AES round-function primitives (FEAT_AES — ARMv8.0-A optional, advertised by HWCAP_AES). // **ISA-level truth** — each instruction performs a specific subset of an AES round, not a full round. // A complete encryption round is AESE followed by AESMC (except the last round which skips MixColumns). AESE V0.16B, V1.16B // AES encrypt: Vd = ShiftRows(SubBytes(Vd XOR Vn)) — AddRoundKey + SubBytes + ShiftRows AESD V0.16B, V1.16B // AES decrypt: Vd = InvShiftRows(InvSubBytes(Vd XOR Vn)) AESMC V0.16B, V1.16B // AES MixColumns: Vd = MixColumns(Vn). Vd and Vn may be the same register. AESIMC V0.16B, V1.16B // AES InvMixColumns: Vd = InvMixColumns(Vn). Vd and Vn may be the same register. // Polynomial multiply (GF(2) carry-less multiply) — used heavily by AES-GCM for the GHASH // authenticator, by CRC accelerators (Castagnoli and polynomial-32 CRCs), and by error-correcting // codes over binary extension fields. // Size 8-bit → 16-bit: baseline NEON (does NOT need FEAT_PMULL). Eight 1-byte multiplies per lane. // Size 64-bit → 128-bit: FEAT_PMULL required (historically sometimes called "PMULL crypto"; ARM ARM // decodes as UNDEFINED when FEAT_PMULL is absent). // Element sizes 16-bit and 32-bit are UNDEFINED at the ISA level (the pseudocode rejects `size == 01` // and `size == 10` unconditionally). Vd is always the widening destination. PMULL V0.8H, V1.8B, V2.8B // 8 × (byte × byte) → 8 × halfword result (baseline NEON) PMULL2 V0.8H, V1.16B, V2.16B // Upper-half variant: reads upper 8 bytes of V1/V2 PMULL V0.1Q, V1.1D, V2.1D // 1 × (64-bit poly × 64-bit poly) → 128-bit result (FEAT_PMULL) PMULL2 V0.1Q, V1.2D, V2.2D // Upper-half: reads lane [1] of V1/V2 as the 64-bit inputs // The .1Q arrangement on a SIMD register means "one 128-bit quadword lane" — the full Q-register. // AES-GCM's GHASH is computed by a long chain of PMULL + EOR reducing back into GF(2^128), so the // 64-bit → 128-bit form is the critical inner-loop instruction for authenticated encryption. // SHA (FEAT_SHA256): SHA256H Q0, Q1, V2.4S // SHA-256 hash update (part 1) SHA256H2 Q0, Q1, V2.4S // SHA-256 hash update (part 2) SHA256SU0 V0.4S, V1.4S // SHA-256 schedule update 0 SHA256SU1 V0.4S, V1.4S, V2.4S // SHA-256 schedule update 1 // SHA-1 (FEAT_SHA1 — sibling of SHA256; implementations typically ship both together): SHA1C Q0, S1, V2.4S // SHA-1 hash update: choose (rounds 0..19) SHA1P Q0, S1, V2.4S // SHA-1 hash update: parity (rounds 20..39 and 60..79) SHA1M Q0, S1, V2.4S // SHA-1 hash update: majority (rounds 40..59) SHA1H Sd, Sn // SHA-1 fixed rotate (left rotate of Sn by 30) SHA1SU0 V0.4S, V1.4S, V2.4S // SHA-1 schedule update 0 SHA1SU1 V0.4S, V1.4S // SHA-1 schedule update 1 // Dot product (FEAT_DotProd — optional from Armv8.1-A, mandatory from Armv8.4-A): // **ISA-level truth** — both 64-bit (.2S / .8B) and 128-bit (.4S / .16B) forms exist. UDOT V0.2S, V1.8B, V2.8B // 64-bit vector: 2 dot products of 4 bytes each UDOT V0.4S, V1.16B, V2.16B // 128-bit vector: 4 dot products of 4 bytes each SDOT V0.2S, V1.8B, V2.8B // Signed — same arrangement set SDOT V0.4S, V1.16B, V2.16B UDOT V0.2S, V1.8B, V2.4B[i] // By-element: i ∈ 0..3 selects one of four 4-byte groups from V2 UDOT V0.4S, V1.16B, V2.4B[i] SDOT V0.2S, V1.8B, V2.4B[i] SDOT V0.4S, V1.16B, V2.4B[i] // Each 32-bit destination lane accumulates 4 byte-multiplies. For .4S with .16B, that's // 16 multiply-accumulates per instruction — critical for ML inference. // Mixed-sign dot product (FEAT_I8MM — ARMv8.6-A): signed × unsigned 8-bit operand mixing. // Useful when one input is an activation (signed) and the other is a weight or quantized zero-point (unsigned). USDOT V0.2S, V1.8B, V2.8B // V1 unsigned bytes, V2 signed bytes (64-bit) USDOT V0.4S, V1.16B, V2.16B // 128-bit SUDOT V0.2S, V1.8B, V2.4B[i] // By-element ONLY (no 3-vector form): V1 signed, V2 unsigned SUDOT V0.4S, V1.16B, V2.4B[i] USDOT V0.2S, V1.8B, V2.4B[i] // By-element variant of USDOT USDOT V0.4S, V1.16B, V2.4B[i] // Integer matrix multiply (FEAT_I8MM — ARMv8.6-A): 8×8-bit → 32-bit accumulate, 2×2 matrix of dots. // Each instruction performs a 2×8 × 8×2 matrix multiply-accumulate into a 2×2 int32 tile. // Only .4S destination with .16B operands — this is the ONLY form (no 64-bit variants). SMMLA V0.4S, V1.16B, V2.16B // Signed 8-bit matrix multiply-accumulate UMMLA V0.4S, V1.16B, V2.16B // Unsigned USMMLA V0.4S, V1.16B, V2.16B // Mixed: V1 unsigned × V2 signed // BFloat16 (FEAT_BF16 — ARMv8.6-A): 16-bit "brain float" format (1 sign + 8 exponent + 7 mantissa), // same exponent range as IEEE single but half the bits. The de-facto ML training format. BFCVT Hd, Sn // Scalar: convert single-precision Sn to BFloat16 in Hd (with IEEE round-to-nearest) BFCVTN V0.4H, V1.4S // Narrow 4× single → 4× BFloat16 (writes lower 64 bits of V0) BFCVTN2 V0.8H, V1.4S // Narrow 4× single → upper 64 bits of V0 (pair with BFCVTN) BFDOT V0.2S, V1.4H, V2.4H // 64-bit BFloat16 dot: each .2S lane = sum of 2 BFloat16 multiplies BFDOT V0.4S, V1.8H, V2.8H // 128-bit BFloat16 dot product BFDOT V0.2S, V1.4H, V2.2H[i] // By-element BFDOT: Vm restricted to V0..V15 (same as FMLA.H rule) BFDOT V0.4S, V1.8H, V2.2H[i] BFMLALB V0.4S, V1.8H, V2.8H // BFloat16 multiply-add Long to single — uses EVEN lanes of V1/V2 BFMLALT V0.4S, V1.8H, V2.8H // Same but uses ODD lanes (the "top" half of each pair) BFMLALB V0.4S, V1.8H, V2.H[i] // By-element (bottom lane); Vm V0..V15 restriction BFMLALT V0.4S, V1.8H, V2.H[i] // By-element (top lane) BFMMLA V0.4S, V1.8H, V2.8H // BFloat16 matrix multiply-accumulate (2×4 × 4×2 → 2×2 single) // FP16 widening multiply-add to single (FEAT_FHM — optional from Armv8.1-A, mandatory from Armv8.4-A). // Takes FP16 inputs, accumulates into FP32 — a fast path for FP16 training/inference. // **ISA-level truth** — both 64-bit (.2S / .2H) and 128-bit (.4S / .4H) forms exist. // FMLAL uses the BOTTOM (even-indexed) pair of halfwords; FMLAL2 uses the TOP (odd-indexed) pair. FMLAL V0.2S, V1.2H, V2.2H // 64-bit: V0[i] += convert_to_f32(V1[i]) * convert_to_f32(V2[i]) for i=0,1 (bottom) FMLAL V0.4S, V1.4H, V2.4H // 128-bit: bottom half, i=0..3 FMLAL2 V0.2S, V1.2H, V2.2H // 64-bit: uses TOP half (indices 2,3 of an 8-lane view) FMLAL2 V0.4S, V1.4H, V2.4H // 128-bit: uses top half (indices 4..7) FMLSL V0.2S, V1.2H, V2.2H // Subtract variants — same arrangement set FMLSL V0.4S, V1.4H, V2.4H FMLSL2 V0.2S, V1.2H, V2.2H FMLSL2 V0.4S, V1.4H, V2.4H // By-element: FMLAL V0.4S, V1.4H, V2.H[i] — Vm restricted to V0..V15 (same .H by-element rule). FMLAL V0.2S, V1.2H, V2.H[i] FMLAL V0.4S, V1.4H, V2.H[i] FMLAL2 V0.2S, V1.2H, V2.H[i] FMLAL2 V0.4S, V1.4H, V2.H[i] FMLSL V0.2S, V1.2H, V2.H[i] FMLSL V0.4S, V1.4H, V2.H[i] FMLSL2 V0.2S, V1.2H, V2.H[i] FMLSL2 V0.4S, V1.4H, V2.H[i] ``` **Why hardware AES/SHA**: Software AES takes ~10 cycles/byte. Hardware `AESE`+`AESMC` does a full round in 2 instructions. For servers doing HTTPS, this is the difference between CPU-bound and I/O-bound TLS. **Why dot product / matrix multiply / BFloat16**: Neural network inference and training are dominated by matrix multiply, which decomposes into dot products. `UDOT`/`SDOT` process 16 byte-multiplies per instruction; `UMMLA`/`SMMLA` do a full 2×2 INT8 tile MMA per instruction; `BFMMLA` does the same in BFloat16 for 4× the arithmetic density of FP32. These give 4–16× speedup over scalar code for ML workloads. **SHA-3 and related bitwise ops (FEAT_SHA3 — optional from Armv8.1-A):** ```asm // Three-operand XOR — saves an instruction compared to EOR+EOR. EOR3 V0.16B, V1.16B, V2.16B, V3.16B // V0 = V1 XOR V2 XOR V3 // Bit-clear-and-XOR fused op. BCAX V0.16B, V1.16B, V2.16B, V3.16B // V0 = V1 XOR (V2 AND NOT V3) // Rotate-and-XOR by 1 bit, 64-bit lanes — used in Keccak/SHA-3 theta step. RAX1 V0.2D, V1.2D, V2.2D // V0[i] = V1[i] XOR ROL(V2[i], 1) (i ∈ 0..1, 64-bit rotate) // XOR-then-rotate-right by immediate — Keccak chi/rho. XAR V0.2D, V1.2D, V2.2D, #imm6 // V0[i] = ROR(V1[i] XOR V2[i], imm) (imm ∈ 0..63) // Although named "SHA3", EOR3 and BCAX are useful outside SHA-3 — three-input bitwise ops appear in // ChaCha20, BLAKE2/3, LEA, Gimli, and general bit-manipulation code. ``` **SHA-512 (FEAT_SHA512 — optional from Armv8.1-A):** ```asm // SHA-512 hash update — works on 128-bit Q-register pairs (2× 64-bit halves). // **ISA-level truth** — SHA512H/SHA512H2 take Qd (accumulator), Qn (state), Vm.2D (schedule). // Vd and Vn are always 128-bit Q-registers here; .2D is the source arrangement. SHA512H Qd, Qn, Vm.2D // SHA-512 hash update part 1 (Maj/Ch round function) SHA512H2 Qd, Qn, Vm.2D // SHA-512 hash update part 2 (sigma mixing) SHA512SU0 Vd.2D, Vn.2D // Schedule update part 0 — sigma_0 SHA512SU1 Vd.2D, Vn.2D, Vm.2D // Schedule update part 1 — sigma_1, with prev schedule word ``` **SM3 / SM4 (Chinese national cipher standards — FEAT_SM3, FEAT_SM4, both optional from Armv8.1-A):** ```asm // SM3 — 256-bit hash function (GM/T 0004-2012). All 128-bit state ops on Vd.4S with Qd accumulator // for the main hash compression. SM3SS1 Vd.4S, Vn.4S, Vm.4S, Va.4S // SM3 SS1 round-intermediate computation SM3TT1A Vd.4S, Vn.4S, Vm.S[i] // SM3 round 0..15, function A (i ∈ 0..3) SM3TT1B Vd.4S, Vn.4S, Vm.S[i] // SM3 round 16..63, function B SM3TT2A Vd.4S, Vn.4S, Vm.S[i] // SM3 round 0..15, P_0 permutation SM3TT2B Vd.4S, Vn.4S, Vm.S[i] // SM3 round 16..63 SM3PARTW1 Vd.4S, Vn.4S, Vm.4S // SM3 schedule update W1 SM3PARTW2 Vd.4S, Vn.4S, Vm.4S // SM3 schedule update W2 // SM4 — 128-bit block cipher (GM/T 0002-2012). Operates on .4S state. SM4E Vd.4S, Vn.4S // SM4 encryption round (4 rounds per instruction) SM4EKEY Vd.4S, Vn.4S, Vm.4S // SM4 key-schedule round (also 4 rounds per instruction) // Together, one SM4 full encryption = 8× SM4E instructions (32 rounds / 4 per instruction). ``` **Complex number arithmetic (FEAT_FCMA — mandatory from ARMv8.3-A):** Each complex number is stored as an even/odd lane pair: `(real, imaginary)` at lanes `(2i, 2i+1)`. So a `.4S` vector holds 2 complex numbers; a `.2D` vector holds 1. ```asm // FCADD — complex add with Vm rotated by 90° or 270° before the add. // Per lane pair (re,im) of Vm: #90 → (-im, re); #270 → (im, -re). // Then: Vd = Vn + rot(Vm). Valid rotations: #90 and #270 only. // **ISA-level truth** — arrangements: {.4H, .8H (FEAT_FP16), .2S, .4S, .2D}. Because a complex // number occupies 2 lanes, the arrangement specifies TWICE the number of complex values: // .4H → 2 complex halfs, .8H → 4 complex halfs, .2S → 1 complex single, .4S → 2 complex singles, // .2D → 1 complex double. FCADD V0.4H, V1.4H, V2.4H, #90 // FEAT_FP16 — 2 complex halfs FCADD V0.8H, V1.8H, V2.8H, #90 // FEAT_FP16 — 4 complex halfs FCADD V0.2S, V1.2S, V2.2S, #90 // 1 complex single (64-bit vector) FCADD V0.4S, V1.4S, V2.4S, #90 // 2 complex singles FCADD V0.2D, V1.2D, V2.2D, #90 // 1 complex double FCADD V0.4H, V1.4H, V2.4H, #270 // Same arrangement set with rotation #270 FCADD V0.8H, V1.8H, V2.8H, #270 FCADD V0.2S, V1.2S, V2.2S, #270 FCADD V0.4S, V1.4S, V2.4S, #270 FCADD V0.2D, V1.2D, V2.2D, #270 // FCMLA — complex multiply-accumulate with Vm rotated by #rot. // For rot ∈ {0, 180}: multiply by Vn's REAL part (duplicated into both lanes of the pair). // For rot ∈ {90, 270}: multiply by Vn's IMAGINARY part (duplicated into both lanes). // Vm is rotated by #rot degrees counterclockwise in the complex plane, then FMA'd into Vd. // Same arrangement set as FCADD: {.4H, .8H (FEAT_FP16), .2S, .4S, .2D}. Rot ∈ {0, 90, 180, 270}. FCMLA V0.4H, V1.4H, V2.4H, #0 // FEAT_FP16 — Vd += Vn.re-dup * Vm FCMLA V0.8H, V1.8H, V2.8H, #0 // FEAT_FP16 FCMLA V0.2S, V1.2S, V2.2S, #0 FCMLA V0.4S, V1.4S, V2.4S, #0 // Vd += Vn.re-duplicated * Vm (Vm unchanged) FCMLA V0.2D, V1.2D, V2.2D, #0 FCMLA V0.4H, V1.4H, V2.4H, #90 // Same arrangement set with rotation #90 (plus #180, #270) FCMLA V0.8H, V1.8H, V2.8H, #90 FCMLA V0.2S, V1.2S, V2.2S, #90 FCMLA V0.4S, V1.4S, V2.4S, #90 // Vd += Vn.im-duplicated * rot(Vm, 90°) (Vm → (-im, re)) FCMLA V0.2D, V1.2D, V2.2D, #90 FCMLA V0.4S, V1.4S, V2.4S, #180 // Vd += Vn.re-duplicated * -Vm (Vm negated) FCMLA V0.4S, V1.4S, V2.4S, #270 // Vd += Vn.im-duplicated * rot(Vm, 270°) (Vm → (im, -re)) // A full complex MAC (Vd += Vn * Vm) is TWO FCMLAs: one with #0, then one with #90 — // together they compute (a+bi)(c+di) = (ac-bd) + (ad+bc)i. // FCMLA by-element — pick one complex value from Vm (at lane i, pair [i, i+1 conceptually // via 1-lane-of-complex indexing]). By-element arrangements are restricted: // .4H with Vm.H[i] (i ∈ 0..1 — pairs are at (0,1) and (2,3)), .8H with Vm.H[i] (i ∈ 0..3), // .4S with Vm.S[i] (i ∈ 0..1 — pair at (0,1)). No .2S or .2D by-element (only one pair, nothing to select). // For .H by-element, Vm ∈ V0..V15 (same 4-bit Rm rule). For .S by-element, Vm ∈ V0..V31. FCMLA V0.4H, V1.4H, V2.H[i], #0 // FEAT_FP16; Vm ∈ V0..V15; i ∈ 0..1 FCMLA V0.8H, V1.8H, V2.H[i], #90 // FEAT_FP16; Vm ∈ V0..V15; i ∈ 0..3 FCMLA V0.4S, V1.4S, V2.S[i], #180 // Vm ∈ V0..V31; i ∈ 0..1 FCMLA V0.4S, V1.4S, V2.S[0], #90 // Concrete example — pick first complex from V2 ``` **Why FCMA**: one complex multiply used to take 6–8 regular NEON instructions (separate real/imag handling, sign flips, combines). A complex-MAC via two FCMLAs is the hot inner loop of any FFT, radar/sonar DSP pipeline, or code operating on arrays of C99 `complex float`/`complex double`. LLVM and GCC both auto-vectorize complex-arithmetic loops into FCMLA pairs when `-march` includes `+fcma` (or any `armv8.3-a` or later target). **CRC32 (FEAT_CRC32 — mandatory from ARMv8.1):** ```asm // Syntax (generic forms): CRC32B Wd|WZR, Wn|WZR, Wm|WZR // Wd = CRC32(Wn, byte-data(Wm)) poly 0x04C11DB7 (Ethernet/zlib) CRC32H Wd|WZR, Wn|WZR, Wm|WZR // Wd = CRC32(Wn, halfword-data(Wm)) CRC32W Wd|WZR, Wn|WZR, Wm|WZR // Wd = CRC32(Wn, word-data(Wm)) CRC32X Wd|WZR, Wn|WZR, Xm|XZR // Wd = CRC32(Wn, doubleword-data(Xm)) NOTE: Xm is 64-bit! CRC32CB Wd|WZR, Wn|WZR, Wm|WZR // CRC-32C (Castagnoli) variant — poly 0x1EDC6F41 (iSCSI, ext4, btrfs) CRC32CH Wd|WZR, Wn|WZR, Wm|WZR CRC32CW Wd|WZR, Wn|WZR, Wm|WZR CRC32CX Wd|WZR, Wn|WZR, Xm|XZR // Concrete usage examples: CRC32B W0, W0, W1 // Update running CRC (W0) with 1 byte from W1 CRC32W W0, W0, W1 // Update with 4 bytes CRC32X W0, W0, X1 // Update with 8 bytes (note the X-register data input) CRC32CX W0, W0, X1 // Castagnoli variant with 8-byte data ``` **ISA-level truth**: CRC32 and CRC32C are two distinct polynomials, not aliases of each other. CRC32 uses the IEEE 802.3 (Ethernet / zlib / PNG / gzip) polynomial 0x04C11DB7; CRC32C uses the Castagnoli polynomial 0x1EDC6F41, which has better error-detection for short burst errors and is used by iSCSI, SCTP, ext4 metadata, btrfs, and Google Protobuf. Both variants use reflected-input / reflected-output / XORed-with-all-ones convention (same as the common software implementation) — the hardware result matches `crc32`/`crc32c` from zlib/Boost/Linux crypto API directly with no bit-reversal needed. Each instruction replaces ~20 instructions of table-lookup CRC computation per data chunk. --- ## 31. Pointer Authentication (PAC) PAC (ARMv8.3-A) protects against **Return-Oriented Programming (ROP)** and **Jump-Oriented Programming (JOP)** attacks by cryptographically signing pointers. The idea: before using a pointer (like a return address), the CPU verifies a cryptographic signature embedded in the pointer's high bits. If an attacker overwrites the pointer, the signature won't match, and the CPU faults. **Why PAC exists**: Stack buffer overflows let attackers overwrite return addresses. Without PAC, the CPU blindly follows the corrupted return address. With PAC, the corrupted address has a wrong signature and the authentication instruction faults before the branch. ### 31.1 How PAC Works ARM64 pointers use a subset of their 64 bits for the actual virtual address. The remaining high bits are available for the PAC signature. The exact number of PAC bits is **not fixed** — it depends on the configured virtual address size, whether Top Byte Ignore (TBI) is enabled, and whether MTE is using tag bits. The Linux kernel documents the exact formula: **PAC width = 55 − VA_size** (when TBI is enabled). So with the typical 48-bit VA and TBI on, the PAC is **7 bits** (bits [54:48], with bit 55 reserved as the address-space selector). With TBI disabled you also get the upper byte [63:56], giving **15 bits** total. Smaller VA sizes yield more PAC bits — e.g., a 39-bit VA gives 16 PAC bits with TBI on, or 24 with TBI off. MTE, when enabled, claims bits [59:56] for its tag, reducing PAC's upper-byte contribution by 4 bits. PAC stores a cryptographic hash (the **Pointer Authentication Code**) in whatever bits remain. ```asm // Sign a pointer (add PAC): PACIA Xd|XZR, Xn|SP // Sign Xd (e.g. return address) using key A and Xn|SP as context // The PAC is computed from: the pointer, the context (SP), and a secret key PACIB Xd|XZR, Xn|SP // Same with key B PACDA Xd|XZR, Xn|SP // Sign data pointer with key A PACDB Xd|XZR, Xn|SP // Sign data pointer with key B // Zero-modifier forms (general Xd, context = 0) — distinct encodings from the HINT-space // X30 variants below (the encoding's Z bit, not Rn, selects the no-modifier variant). // Useful when the context is semantically zero for an arbitrary Xd. PACIZA Xd|XZR // Sign Xd with key A, zero context (PACIA-with-zero-modifier; single-operand) PACIZB Xd|XZR // Same with key B PACDZA Xd|XZR // Sign data ptr with key A, zero context PACDZB Xd|XZR // Same with key B // HINT-space aliases — these use NOP-encoding so they execute as NOP on pre-8.3 CPUs. // All operate implicitly on X30 (or X16/X17 for the 1716 forms): PACIA1716 // Sign X17 using key A with X16 as context PACIB1716 // Sign X17 using key B with X16 as context PACIASP // Alias for PACIA X30, SP (sign LR with SP modifier — function prologue) PACIBSP // Alias for PACIB X30, SP PACIAZ // HINT-space; functionally equivalent to PACIZA X30 (sign LR with zero modifier). // NOT "PACIA X30, XZR" — in PACIA's encoding Rn=31 means SP, not XZR; the only way // to get the "LR + zero modifier" semantic is this HINT encoding or PACIZA X30. PACIBZ // HINT-space; functionally equivalent to PACIZB X30 // Authenticate (verify + strip PAC): AUTIA Xd|XZR, Xn|SP // Verify Xd's PAC against key A and Xn|SP; if valid, strip the PAC // If invalid: the upper bits are corrupted, causing a fault on use AUTIB Xd|XZR, Xn|SP // Same with key B AUTDA Xd|XZR, Xn|SP // Authenticate data pointer with key A AUTDB Xd|XZR, Xn|SP // Authenticate data pointer with key B // Zero-modifier authenticate forms: AUTIZA Xd|XZR // Authenticate Xd with key A, zero context AUTIZB Xd|XZR // Same with key B AUTDZA Xd|XZR // Authenticate data ptr with key A, zero context AUTDZB Xd|XZR // Same with key B // Generic PAC (PACGA) — uses the 5th key (APGAKey). Different from all other PAC* // instructions: computes a keyed 32-bit MAC over arbitrary 64-bit data (NOT a pointer). // Result: Xd[63:32] = 32-bit PAC, Xd[31:0] = 0 (lower half is zeroed). // Used for signing data structures, chained over a buffer for a rolling MAC: PACGA Xd|XZR, Xn|XZR, Xm|SP // Xd = PACGA(Xn, Xm) — 32-bit MAC in upper half of Xd // HINT-space authenticate aliases: AUTIA1716 // Authenticate X17 with key A + X16 context AUTIB1716 // Authenticate X17 with key B + X16 context AUTIASP // Alias for AUTIA X30, SP (authenticate LR with SP modifier — function epilogue) AUTIBSP // Alias for AUTIB X30, SP AUTIAZ // HINT-space; functionally equivalent to AUTIZA X30 (authenticate LR with zero modifier) AUTIBZ // HINT-space; functionally equivalent to AUTIZB X30 // Strip PAC without authenticating (base FEAT_PAuth — ARMv8.3-A): XPACI Xd|XZR // Strip PAC from instruction address (Xd is modified in-place) XPACD Xd|XZR // Strip PAC from data address XPACLRI // HINT alias — strip PAC from X30 (LR); NOP on pre-8.3 CPUs // Combined branch instructions: RETAA // Authenticate LR with key A + SP, then RET (AUTIA + RET) RETAB // Same with key B BRAA Xn|XZR, Xm|SP // Authenticate Xn with key A + Xm as context, then branch BRAB Xn|XZR, Xm|SP // Same with key B BLRAA Xn|XZR, Xm|SP // Authenticate + branch with link (key A) BLRAB Xn|XZR, Xm|SP // Same with key B BRAAZ Xn|XZR // Authenticate Xn with key A + zero context, then branch BRABZ Xn|XZR // Same with key B BLRAAZ Xn|XZR // Authenticate + branch with link, zero context (key A) BLRABZ Xn|XZR // Same with key B // Authenticated load (useful for protected vtable dispatch): LDRAA Xt|XZR, [Xn|SP{, #simm10}] // Auth-load (no writeback): authenticate [Xn] with key A + zero // context, then load from the authenticated address. simm10 is // a 10-bit signed multiple of 8 (−4096 to +4088 bytes). LDRAA Xt|XZR, [Xn|SP, #simm10]! // Same with pre-index writeback (Xn updated to Xn + simm10). LDRAB Xt|XZR, [Xn|SP{, #simm10}] // Same with key B (no writeback) LDRAB Xt|XZR, [Xn|SP, #simm10]! // Same with key B (pre-index writeback) // No post-index or register-offset forms exist. // Exception return with authentication (EL1+ only, for kernel use): ERETAA // Authenticate ELR_ELx with key A + current SP as modifier, then ERET ERETAB // Same with key B // Per ARM ARM: target = AuthIA(ELR_ELx, SP[], TRUE) — SP[] is the current // exception level's stack pointer per PSTATE.SPSel, not a specific SP_ELx. ``` **PAC keys** (five total, per the ARMv8.3-A Pointer Authentication specification): `APIAKey` and `APIBKey` for instruction-address signing (PACIA/AUTIA, PACIB/AUTIB), `APDAKey` and `APDBKey` for data-address signing (PACDA/AUTDA, PACDB/AUTDB), and `APGAKey` — the generic key used **only** by `PACGA` for computing a 32-bit MAC over arbitrary data. All five keys are managed by the kernel via system registers (`APIAKeyLo_EL1`/`APIAKeyHi_EL1`, etc.) and never exposed to user code. Linux assigns fresh random keys per process at `exec()`; they survive `fork()` unchanged. Presence is advertised via `HWCAP_PACA` (address authentication — IA/IB/DA/DB) and `HWCAP_PACG` (generic — GA), which are independent: a CPU can implement one without the other. ### 31.2 PAC in Practice Compilers emit PAC instructions in function prologues/epilogues: ```asm my_func: PACIASP // Sign X30 with key A, using SP as context STP X29, X30, [SP, #-16]! // Save signed LR MOV X29, SP // ... function body ... LDP X29, X30, [SP], #16 AUTIASP // Authenticate X30 — faults if tampered RET ``` `PACIASP` is an alias for `PACIA X30, SP`. On CPUs without PAC, these instructions execute as NOPs (they're HINT encodings), so PAC-enabled binaries run safely on older hardware — they just lack the protection. --- ## 32. Branch Target Identification (BTI) BTI (ARMv8.5-A) prevents **Jump-Oriented Programming (JOP)** by restricting which instructions can be the target of an indirect branch (`BR`, `BLR`). When BTI is enabled for a memory page (via page table attributes), an indirect branch that lands on an instruction that is not a valid landing pad causes a fault. Valid landing pads include `BTI` instructions and certain PAC instructions (`PACIASP`, `PACIBSP`) that have implicit BTI behavior — this is important because function prologues typically start with PACIASP, making them valid indirect-branch targets on guarded pages without needing a separate BTI instruction. Additionally, indirect branches via `X16` or `X17` (used by PLT stubs and veneers) have relaxed BTI requirements, as these registers are treated as intra-procedure-call scratch. **Why BTI exists**: Even with PAC protecting return addresses, an attacker might redirect an indirect call (function pointer, virtual method) to the middle of a function — skipping the prologue, landing on a "gadget" that does something useful to the attacker. BTI ensures indirect branches can only land at explicitly marked entry points. **How BTI actually works** — PSTATE.BTYPE (2 bits, PSTATE[11:10]) encodes which kind of indirect branch was just executed; the target instruction either accepts that BTYPE or triggers a Branch Target Exception. Per ARM ARM: | PSTATE.BTYPE | Set by | In a guarded page, next instruction must be | |---|---|---| | `00` | Direct branches, fall-through, RET, and non-branch instructions | *(no check — anything is fine)* | | `01` | `BR X16` or `BR X17` (PLT stubs, intra-procedure veneers) | `BTI c`, `BTI j`, `BTI jc`, or `PACIxSP` | | `10` | `BLR Xn` (indirect call) | `BTI c`, `BTI jc`, or `PACIxSP` | | `11` | `BR Xn` where n ∉ {16, 17} (indirect jump) | `BTI j` or `BTI jc` | Each BTI variant has its own 2-bit `targets` field in the encoding, and the ARM ARM predicate `BTypeCompatible_BTI(targets)` determines which BTYPE values it accepts. The four forms: ```asm BTI c // targets=01. Accepts PSTATE.BTYPE ∈ {01, 10} (BR X16/X17, or BLR) BTI j // targets=10. Accepts PSTATE.BTYPE ∈ {01, 11} (BR X16/X17, or BR Xn non-X16/X17) BTI jc // targets=11. Accepts PSTATE.BTYPE ∈ {01, 10, 11} — the most permissive form BTI // targets=00. A distinct encoding from the three above, with its OWN compatibility rule: // BTypeCompatible_BTI('00') is FALSE for every non-zero PSTATE.BTYPE. // Practical consequence in a guarded page: no indirect branch (BR or BLR of any kind) // may land on bare BTI without raising a Branch Target Exception. // This is not a more-permissive or more-restrictive landing pad — it is best thought of // as "this location is explicitly declared NOT to be an indirect branch target." // Use BTI c / BTI j / BTI jc for actual landing pads; use bare BTI only if that exact // never-a-target semantics is what you want. ``` The sets above list the PSTATE.BTYPE values that *require* a matching landing pad. PSTATE.BTYPE=`00` (meaning "the most recent branch was not one that arms the BTI check — e.g. a direct B/BL, or a RET") is always compatible with any instruction and triggers no check at all; it is implicitly accepted everywhere and omitted from the sets above because the BTYPE check only fires when BTYPE ≠ `00`. Functions that can be called via function pointers need `BTI c` or `BTI jc` at their entry. These are HINT instructions — on older CPUs without BTI, they execute as NOPs. **What BTI does NOT check** — `RET` is exempt: BTI restricts `BR`/`BLR` (indirect branches via register), but `RET` is **not** under BTI control. Per LLVM's BTI codegen pass: "RET instructions are not restricted by branch target identification, the reason for this is that return addresses can be protected more effectively using return address signing." That protection comes from **PAC** — specifically the `PACIASP`/`AUTIASP` pair (sign on function entry, authenticate on return). The SiPearl CFI white paper gives the design rationale: requiring a BTI landing pad after every call would inflate code size and create too many valid gadgets, undermining BTI's whole point. So the division of labor is clean: **PAC protects return targets, BTI protects call/jump targets.** Direct branches (`B`, `BL` with a PC-relative offset) are also not BTI-checked — an attacker can't alter them because code pages are read-only. If you build with `-mbranch-protection=standard`, you get both BTI and PAC-ret, and this division is how the full scheme works. --- ## 33. Scalable Vector Extension (SVE / SVE2) SVE (FEAT_SVE, optional from ARMv8.2-A; **still architecturally OPTIONAL in ARMv9.0-A**) is ARM's answer to future-proof SIMD. Unlike NEON's fixed 128-bit vectors, SVE supports **variable-length vectors** from 128 to 2048 bits (in 128-bit increments). Code written for SVE works on any SVE implementation without recompilation — the hardware determines the vector length at runtime. SVE2 (FEAT_SVE2, optional from ARMv9.0-A; requires FEAT_SVE) extends SVE with more operations (intended to make it a full NEON replacement). **Scope caveat on "Armv9 mandatory" claims**: ARM's own architecture reference (DDI 0487/0608) defines both FEAT_SVE and FEAT_SVE2 as **OPTIONAL** in Armv9.0-A. The ARM ARM contains a deployment *note* that "all Armv9-A systems supporting standard operating systems with rich application environments also provide SVE2" — but that is a market/deployment expectation, not an architectural requirement. A conforming Armv9-A implementation may omit SVE/SVE2 — and some do. Apple M4 (2024) and M5 (2025) are ARMv9.2-A implementations that explicitly omit SVE and SVE2; Apple's own LLVM source comment reads *"Technically apple-m4 is ARMv9.2a, but a quirk of LLVM defines v9.0 as requiring SVE, which is optional according to the Arm ARM and not supported by the core."* So Apple silicon from M4 onward is a concrete counterexample to "all shipping Armv9 cores have SVE2." Treat SVE2 as "common but not universal on Armv9, still optional architecturally" — always runtime-check `ID_AA64ZFR0_EL1` or the corresponding HWCAP before dispatching to an SVE2 code path. **Why SVE exists**: NEON vectors are fixed at 128 bits. If ARM makes a chip with 512-bit data paths, NEON can't use them — you'd need new instructions and recompilation. SVE's variable-length model means the same binary automatically uses wider vectors on more capable hardware. ### 33.1 Key SVE Concepts **Vector Length (VL)**: The number of bits in each Z register. Hardware-defined, read via `RDVL` (Read Vector Length). Always a multiple of 128. Your code must NOT assume a specific VL — it must work for any VL. The VL can be set by the OS (up to the hardware maximum) via `ZCR_EL1`. **Z registers**: `Z0`–`Z31`, each VL bits wide. The lower 128 bits of Zn overlap with the NEON Vn register. SVE uses these for all vector data. **P registers (predicates)**: `P0`–`P15`, each VL/8 bits wide (one bit per byte-lane). Predicates control which lanes are active — inactive lanes don't produce results and don't cause faults. This eliminates the need for "remainder loops" at the end of vectorized loops. **FFR (First Fault Register)**: Used for speculative memory access — lets you try to load a whole vector, and the hardware tells you which lanes faulted (instead of crashing). ### 33.2 SVE Programming Model ```asm // SVE loop: add two arrays, works for ANY vector length // X0 = dst, X1 = src_a, X2 = src_b, X3 = count MOV X4, #0 // i = 0 loop: WHILELT P0.S, X4, X3 // P0 = predicate: which lanes have i+lane < count B.NONE done // If no active lanes, we're done LD1W {Z0.S}, P0/Z, [X1, X4, LSL #2] // Load active elements from src_a LD1W {Z1.S}, P0/Z, [X2, X4, LSL #2] // Load active elements from src_b ADD Z0.S, Z0.S, Z1.S // Add all lanes (unpredicated — inactive lanes are zero from P0/Z loads) ST1W {Z0.S}, P0, [X0, X4, LSL #2] // Store active elements to dst INCW X4 // i += VL/32 (number of 32-bit elements per vector) B loop done: ``` **Why WHILELT and predicates matter**: In a traditional NEON loop, if your array has 1000 elements and vectors hold 4 elements, you do 250 iterations cleanly. But if it's 1001 elements, you need a separate scalar loop for the last 1. SVE predicates handle this automatically — the last iteration simply has a predicate that activates only 1 lane. ### 33.3 SVE2 SVE2 (FEAT_SVE2, **optional** from ARMv9.0-A; requires FEAT_SVE) adds operations from NEON that SVE was missing: byte-level permutations, polynomial multiply, complex number multiply-accumulate, histograms, and crypto (SM4, SHA3). SVE2 is intended to be a complete superset of NEON functionality, so eventually all NEON code can be replaced by SVE2 code that also benefits from wider vectors. Note that FEAT_SVE2 is **not architecturally mandatory** and adoption is not universal — Apple M4 and M5 are ARMv9.2-A implementations that explicitly omit SVE/SVE2 (Apple's SME/SME2 matrix-math support is present, but there is no non-streaming SVE execution). Always test for SVE2 at runtime (via `ID_AA64ZFR0_EL1` or HWCAP2_SVE2) before dispatching to SVE2 code paths — it is NOT safe to assume any given Armv9 core has it. --- ## 34. Memory Tagging Extension (MTE) MTE (ARMv8.5-A) detects **memory safety bugs** — use-after-free, buffer overflow, and similar errors — by associating a 4-bit **tag** with each 16-byte region of memory and each pointer. If a pointer's tag doesn't match the memory's tag, the CPU faults (or logs the mismatch, depending on configuration). This catches bugs that would otherwise be silent data corruption or security vulnerabilities. **Why MTE exists**: C/C++ memory bugs are the #1 source of security vulnerabilities. Tools like AddressSanitizer (ASan) detect them but with 2× memory overhead and 2× slowdown. MTE provides similar detection with roughly 3-8% overhead, making it usable in production. ### 34.1 How MTE Works Each pointer carries a 4-bit tag in bits [59:56] (the "logical tag"). Each 16-byte aligned block of memory has a corresponding 4-bit tag stored in a separate "tag memory" (managed by the hardware, not directly visible in the address space). When you access memory, the CPU compares the pointer's tag with the memory's tag — a mismatch indicates a bug. ```asm // Allocate tagged memory: IRG Xd|SP, Xn|SP{, Xm|XZR} // Insert Random tag into pointer: Xd = Xn with a random tag in bits [59:56] // Optional Xm excludes specific tags from the random selection STG Xt|SP, [Xn|SP{, #simm}] // Store Allocation Tag (offset: multiple of 16, −4096 to +4080; when omitted, offset is 0) // Covers 16 bytes starting at the aligned address ST2G Xt|SP, [Xn|SP{, #simm}] // Store tags 2× granules (multiple of 16, −4096 to +4080; when omitted, offset is 0) STZ2G Xt|SP, [Xn|SP{, #simm}] // Store tags+zero 2× granules (multiple of 16, −4096 to +4080; when omitted, offset is 0) STZG Xt|SP, [Xn|SP{, #simm}] // Store tag+zero granule (multiple of 16, −4096 to +4080; when omitted, offset is 0) // Store tag + pair — combines STG with an STP-style 2×64-bit store in one instruction. // Sets the allocation tag AND stores two 64-bit values to the tagged granule. // Heavily used by compilers for tagged-stack-slot prologues. STGP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Signed offset (multiple of 16, −1024 to +1008; when omitted, offset is 0) STGP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index STGP Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index // Load allocation tag: LDG Xt|XZR, [Xn|SP{, #simm}] // Load memory tag into Xt (multiple of 16, −4096 to +4080; when omitted, offset is 0) // Add/subtract with tag manipulation: ADDG Xd|SP, Xn|SP, #uimm6, #uimm4 // Xd = Xn + uimm6 (must be multiple of 16, range 0–1008); tag = Xn_tag + uimm4 (0–15) // Encoding stores uimm6 ÷ 16 in a 6-bit field (same scaling trick as LDR offsets) SUBG Xd|SP, Xn|SP, #uimm6, #uimm4 // Xd = Xn - uimm6 (same constraints); tag = Xn_tag - uimm4 // Tag mask: GMI Xd|XZR, Xn|SP, Xm|XZR // Get tag mask: Xd = Xm | (1 << tag_of(Xn)) SUBP Xd|XZR, Xn|SP, Xm|SP // Subtract pointers, ignoring tags: Xd = Xn - Xm (tag bits stripped) SUBPS Xd|XZR, Xn|SP, Xm|SP // Same + set flags ``` ### 34.2 MTE Modes MTE can operate in three modes (configured per-thread via `SCTLR_EL1` / `PSTATE.TCO`): - **Synchronous**: Tag mismatch causes an immediate synchronous exception. Best for debugging — gives you the exact faulting instruction. - **Asynchronous**: Tag mismatches are accumulated and reported later (e.g., at the next system call). Lower overhead than synchronous, useful for production. - **Off**: Tags are ignored. Used to disable MTE for performance-critical code. **MTE in practice**: Memory allocators (like Android's Scudo) tag heap allocations with random tags. When you `free()` a block, the allocator changes the memory's tag. If code later accesses the freed block through a stale pointer, the pointer's old tag won't match the new memory tag → fault → bug caught. --- ## 35. Rules, Gotchas & Pitfalls This section collects every non-obvious rule and common mistake in one place. Each shows what goes wrong and why. ### 35.1 The W-Register Zeroing Rule (and when it bites) **The rule**: Any instruction that writes to `Wd` zeroes bits [63:32] of `Xd`. Always. No exceptions. ```asm // This LOOKS like it only modifies the low 32 bits: MOV X0, #0xDEADBEEF12345678 // X0 = 0xDEADBEEF12345678 ADD W0, W0, #1 // W0 = 0x12345679, BUT X0 = 0x0000000012345679 // The 0xDEADBEEF is GONE. Zeroed by the W-register write. // MOVK Wd also zeros upper 32 — this surprises people: MOV X0, #0xFFFFFFFF00000000 // X0 = 0xFFFFFFFF00000000 MOVK W0, #0x1234 // W0 low 16 = 0x1234 (keeps bits [31:16] of W0) // BUT upper 32 of X0 zeroed → X0 = 0x0000000000001234 // The 0xFFFFFFFF is GONE. // To modify only 16 bits while preserving the full 64-bit value, use: MOVK X0, #0x1234 // This truly keeps all other 48 bits ``` ### 35.2 Signed Extension: Wd vs Xd Gives Different Results ```asm // If W1 = 0x0000ABCD and you extract bits [11:4] = 0xBC (bit 7 = 1): SBFX W0, W1, #4, #8 // Sign-extend to 32 bits: W0 = 0xFFFFFFBC // Then W→X zeroing: X0 = 0x00000000FFFFFFBC // X0 is POSITIVE (as 64-bit signed)! SBFX X0, X1, #4, #8 // Sign-extend to 64 bits: X0 = 0xFFFFFFFFFFFFFFBC // X0 is NEGATIVE (as 64-bit signed)! // These give DIFFERENT mathematical values for the same input. // Use Xd when you need the signed value for 64-bit arithmetic. // Use Wd when you're staying in 32-bit land and the upper bits don't matter. ``` ### 35.3 CMP Wn vs CMP Xn: Different Flags, Different Branches ```asm // X0 = 0x00000001_80000000 (upper word = 1, lower word = 0x80000000) CMP W0, #0 // Compares 0x80000000 as 32-bit: this is INT32_MIN (negative!) B.LT negative_32 // TAKEN — 0x80000000 is negative as a 32-bit signed value CMP X0, #0 // Compares 0x0000000180000000 as 64-bit: this is +6442450944 (positive!) B.LT negative_64 // NOT taken — it's positive as a 64-bit signed value // The SAME register value gives OPPOSITE comparison results depending on W vs X. // Rule: match your CMP width to your data type. If the value is a 32-bit int, use CMP Wn. ``` ### 35.4 ARM Carry Is Inverted for Subtraction ```asm CMP X0, X1 // SUBS XZR, X0, X1 // After CMP, C=1 means X0 >= X1 (unsigned) — NO borrow // After CMP, C=0 means X0 < X1 (unsigned) — borrow occurred // This is OPPOSITE to x86: // x86: CF=1 after CMP means a < b (borrow) // ARM: C=1 after CMP means a >= b (no borrow) // Consequence: if you're porting x86 code that checks CF after SUB, // you need to invert the condition. x86's JC (Jump if Carry) = ARM's B.CC (not B.CS). ``` ### 35.5 NEG / ABS of INT_MIN Wraps to Itself ```asm // NEG X0, X0 when X0 = INT64_MIN = 0x8000000000000000: // -(-2^63) = +2^63, but that doesn't fit in signed 64-bit (max is 2^63-1) // Result: X0 = 0x8000000000000000 = INT64_MIN again! // This means branchless abs() has an edge case: // CMP X0, #0; CNEG X0, X0, LT // If X0 = INT64_MIN → after CNEG, X0 is STILL INT64_MIN (negative!). // abs(INT64_MIN) cannot be represented. This is a fundamental limitation of two's complement. ``` ### 35.6 NaN Breaks FP Comparisons (and Some Conditions Include It) ```asm // After FCMP with NaN, flags = N=0, Z=0, C=1, V=1 (unordered) // This means some conditions are TRUE even though NaN is not really comparable: // Conditions that EXCLUDE NaN (safe for ordered comparison): // B.EQ → not taken ✓ (NaN is not equal to anything) // B.GT → not taken ✓ (NaN is not greater than anything) // B.GE → not taken ✓ (NaN is not greater-or-equal) // B.MI → not taken ✓ (use instead of B.LT for "less than, excluding NaN") // B.LS → not taken ✓ (use instead of B.LE for "less-or-equal, excluding NaN") // Conditions that INCLUDE NaN (will trigger on unordered!): // B.NE → TAKEN! (NaN is "not equal" — be careful) // B.LT → TAKEN! ← SURPRISE! LT means "less than OR unordered" // B.LE → TAKEN! ← SURPRISE! LE means "less-or-equal OR unordered" // B.HI → TAKEN! ← HI means "greater OR unordered" // B.VS → TAKEN (this is the NaN detector) // This is a common trap: if you write "FCMP S0, S1; B.LT less_than", // the branch IS taken when either operand is NaN — even though NaN // is not less than anything! Use B.MI instead for "less than, not NaN". FCMP S0, S0 // Comparing a value to ITSELF // If S0 is NaN: flags = unordered (V=1) // B.EQ → NOT taken (NaN != NaN) // To check if a value is NaN: FCMP S0, S0 // Compare with self B.VS is_nan // VS = unordered = NaN (the only value that doesn't equal itself) // Safe FP comparison pattern (handles NaN correctly): FCMP S0, S1 B.VS handle_nan // Check for NaN FIRST B.MI is_less // Then: ordered less-than (MI, not LT!) B.GT is_greater // Ordered greater-than (GT is safe) // Fall through: equal ``` **Why this happens**: ARM's condition codes were designed for integer comparisons. When reused for FP, the "unordered" result (NaN) maps to flags that accidentally satisfy some conditions. Specifically, NaN sets V=1, and `LT` checks `N!=V` which is true when V=1 and N=0. ARM intentionally arranged this so that each condition has an inverse that covers the "unordered" case: GT and LE are inverses (GT excludes NaN, LE includes it), GE and LT are inverses (GE excludes NaN, LT includes it). ### 35.7 SP and XZR Share Encoding 31 ```asm // Register 31 means SP in some instructions and XZR in others. // The instruction's opcode determines which. You CANNOT choose. ADD X0, SP, #16 // Reg 31 = SP here (ADD immediate allows SP) ADD X0, XZR, X1 // Reg 31 = XZR here (ADD shifted-register uses XZR for reg 31) // SUBTLE: when the base is SP, the assembler uses the EXTENDED register form, // where LSL is an alias for UXTX. So these ARE valid: ADD X0, SP, X1, LSL #2 // VALID — assembler encodes as ADD (extended): SP + UXTX(X1, #2) CMP SP, X0 // VALID — assembler encodes as CMP (extended): SUBS XZR, SP, X0, UXTX // These are genuinely ILLEGAL (no encoding exists): // AND X0, SP, X1 ← shifted-register AND doesn't accept SP as source // ORR X0, SP, X1 ← same (but ORR IMMEDIATE can write to SP: ORR SP, X0, #imm) // ADDS X0, SP, X1, LSL #5 ← extended register shift max is #4, so #5 is out of range // Rule of thumb: // SP is usable in: ADD/SUB immediate, ADD/SUB extended register (including LSL alias), // logical immediate (AND/ORR/EOR #bitmask as destination), LDR/STR addressing, // CMP/CMN extended register // XZR is used in: shifted-register forms, as the discard destination for CMP/TST/CMN ``` ### 35.8 TST Clears C and V ```asm // After TST (= ANDS XZR), C=0 and V=0 ALWAYS. // This matters when TST is followed by CCMP: TST X0, #1 // Sets Z based on bit 0. But also: C=0, V=0! CCMP X1, #5, #0, NE // If NE (bit 0 was set): compare X1 vs 5 // If EQ (bit 0 was clear): flags = #0 (NZCV=0000) // The C=0,V=0 from TST won't affect anything here because CCMP overwrites flags. // But if you chain TST → B.HI (unsigned higher), remember HI needs C=1 && Z=0. // TST always clears C, so B.HI after TST is ALWAYS not taken! // (B.NE is what you want after TST — it checks Z, which TST does set correctly.) ``` ### 35.9 Divide By Zero Returns 0, Not an Exception ```asm UDIV X0, X1, XZR // X0 = X1 / 0 = 0 (no exception, no trap, no NaN — just 0) SDIV X0, X1, XZR // Same: 0 // This is DIFFERENT from: // - x86: divide by zero triggers a #DE exception // - FP: FDIV S0, S1, S2 with S2=0 gives ±infinity (IEEE 754), not 0 // If you need to catch divide-by-zero, check before dividing: CBZ X2, div_by_zero_handler UDIV X0, X1, X2 ``` ### 35.10 SDIV Overflow: INT_MIN / -1 ```asm // SDIV X0, X1, X2 where X1 = INT64_MIN, X2 = -1: // Mathematical result: +2^63, which overflows signed 64-bit (max is 2^63 - 1) // ARM returns: INT64_MIN (0x8000000000000000) — it wraps! // No exception, no flag, just a silently wrong result. // Same issue for 32-bit: SDIV W0, W1, W2 with W1=INT32_MIN, W2=-1 → INT32_MIN ``` ### 35.11 Branch Range Limits ```asm // Each branch type has a different range. If your target is out of range, // the assembler/linker errors (or silently inserts a trampoline): B far_away // ±128 MB — almost always enough B.EQ far_away // ±1 MB — CAN fail for large functions or distant targets CBZ X0, far_away // ±1 MB — same range as B.cond TBZ X0, #3, far // ±32 KB — VERY limited! Easily exceeded in large functions // Fix for out-of-range B.cond: invert and trampoline // Instead of: B.EQ far_away (out of range) // Write: B.NE skip; B far_away; skip: ``` ### 35.12 Extended Register Shift Is Only 0–4 ```asm // ADD X0, X1, W2, SXTW #5 ← ILLEGAL! Max shift is #4 // The #amount in extended register form is 0, 1, 2, 3, or 4. // This covers element sizes 1, 2, 4, 8, 16 bytes — enough for any C data type. // If you need a larger shift, use a separate LSL instruction first. ``` ### 35.13 LDXR/STXR Rules ```asm // Between LDXR and STXR, AVOID these (they may cause STXR to always fail): // 1. Accessing other memory addresses (may clear the exclusive monitor on some CPUs) // 2. Calling functions (they access memory and may trigger context switches) // 3. Executing too many instructions (increases the window for monitor to be cleared) // The ARM architecture PERMITS the monitor to be cleared by other memory accesses, // so even if it works on your CPU today, it may fail on a different implementation. // BAD (may cause infinite retry on some implementations): LDXR X1, [X0] LDR X3, [X4] // ← Other memory access — may clear the monitor ADD X1, X1, #1 STXR W2, X1, [X0] // STXR may always fail → infinite retry loop // GOOD (only register operations between LDXR and STXR): LDXR X1, [X0] ADD X1, X1, #1 // Pure register operation — safe STXR W2, X1, [X0] // Addresses MUST be naturally aligned (this one IS absolute — not a guideline): // LDXR Xt → 8-byte aligned, LDXR Wt → 4-byte aligned // LDXP Xt → 16-byte aligned // Unaligned → alignment fault (always, regardless of SCTLR.A) ``` --- ## 36. Quick Reference Cheat Sheet ### Instruction Format Summary ``` ┌─ Shifted Register ──────── ADD Xd|XZR, Xn|XZR, Xm|XZR, LSL #n │ ADD Wd|WZR, Wn|WZR, Wm|WZR, LSL #n Data Processing ────┼─ Extended Register ──────── ADD Xd|SP, Xn|SP, Wm|WZR, SXTW #n │ ADD Wd|WSP, Wn|WSP, Wm|WZR, SXTW #n ├─ Immediate ──────────────── ADD Xd|SP, Xn|SP, #imm12{, LSL #12} │ ADD Wd|WSP, Wn|WSP, #imm12{, LSL #12} └─ Bitmask Immediate ──────── AND Xd|SP, Xn|XZR, #bitmask_imm AND Wd|WSP, Wn|WZR, #bitmask_imm Load/Store ─────────── LDR Xt|XZR, [Xn|SP, #imm] STR Xt|XZR, [Xn|SP, #imm] LDP Xt1|XZR, Xt2|XZR, [Xn|SP] STP Xt1|XZR, Xt2|XZR, [Xn|SP] Reg 31 rule ────────── Shifted register / most data-proc: reg 31 = XZR Immediate ADD/SUB, extended reg: reg 31 = SP (Rd,Rn), XZR (Rm) Logical immediate (non-S): reg 31 = SP (Rd), XZR (Rn) Load/store base: reg 31 = SP ``` ### Flag-Setting Quick Ref | Want flags? | Arithmetic | Logical | |---|---|---| | No flags | ADD/SUB | AND/ORR/EOR/BIC | | Set flags | ADDS/SUBS | ANDS/BICS | | Discard result | CMP (=SUBS XZR) / CMN (=ADDS XZR) | TST (=ANDS XZR) | ### Encoding Constraints Cheat Sheet | Operand type | 64-bit (Xd) | 32-bit (Wd) | |---|---|---| | 12-bit immediate | 0–4095, optionally LSL #12 | Same | | Bitmask immediate | Repeating rotated ones, element ≤64 | Element ≤32 (fewer valid patterns) | | MOVZ/MOVK/MOVN | 16-bit value at LSL #0/16/32/48 | LSL #0/16 ONLY (2 slots) | | Shifted register amount | 0–63 | 0–31 | | BFM #immr, #imms | 0–63 each, MOD 64 | 0–31 each, MOD 32 | | Branch offset (B) | ±128 MB (26-bit signed × 4) | — | | Branch offset (B.cond) | ±1 MB (19-bit signed × 4) | — | | Branch offset (TBZ) | ±32 KB (14-bit signed × 4), bit 0–63 | bit 0–31 for Wn form | | LDR unsigned offset | #imm12 × element_size | Same | | LDUR signed offset | −256 to +255 (9-bit signed) | Same | | LDP signed offset | −512 to +504 (7-bit × 8) | −256 to +252 (7-bit × 4) | | LDP Qt signed offset | −1024 to +1008 (7-bit × 16) | — | | Extended register shift | #0–4 only (×1, ×2, ×4, ×8, ×16) | Same | ### Common Mnemonics Reference ``` Arithmetic: ADD ADDS SUB SUBS ADC ADCS SBC SBCS MUL MADD MSUB SMULL UMULL SMULH UMULH UDIV SDIV ABS SMAX SMIN UMAX UMIN (FEAT_CSSC) Logical: AND ANDS ORR EOR BIC BICS ORN EON Shift: LSL LSR ASR ROR (aliases for UBFM/SBFM/EXTR/xSLV) Move: MOV MVN MOVZ MOVK MOVN Compare: CMP CMN TST CCMP CCMN Bitfield: SBFM UBFM BFM (base), BFI BFXIL SBFX UBFX SBFIZ UBFIZ (aliases) Extension: SXTB SXTH SXTW UXTB UXTH UXTW Bit manip: CLZ CLS RBIT REV REV16 REV32 EXTR CTZ CNT (FEAT_CSSC) CondSelect: CSEL CSINC CSINV CSNEG (base), CSET CSETM CINC CINV CNEG (aliases) Load: LDR LDRB LDRH LDRSW LDRSH LDRSB LDUR LDP LDXR LDAR LDAPR Store: STR STRB STRH STUR STP STXR STLR Prefetch: PRFM PRFUM (PLD/PLI/PST, L1/L2/L3, KEEP/STRM) MOPS: CPYFP/M/E CPYP/M/E SETP/M/E SETGP/M/E (hardware memcpy/memset, FEAT_MOPS) Branch: B BL BR BLR RET B.cond CBZ CBNZ TBZ TBNZ CB<cc>/CBH<cc>/CBB<cc> (FEAT_CMPBR) System: SVC HVC SMC BRK HLT UDF MRS MSR NOP WFE WFI ERET Cache: DC ZVA/CVAC/CVAU/CIVAC, IC IALLU/IVAU FP: FADD FSUB FMUL FDIV FSQRT FMADD FCMP FCVT SCVTF UCVTF FMOV FCVTAS/MS/NS/PS/ZS (+U variants, all rounding modes), FJCVTZS (JS ToInt32) FACGT FACGE (abs compare), AXFLAG XAFLAG (FlagM2 flag conv) NEON: LD1-4 ST1-4 ADD FADD MUL ZIP UZP TBL INS UMOV CNT ADDV SQADD UQADD SQSUB UQSUB SQRDMULH SQRDMLAH SQRDMLSH (saturating) FCMEQ FCMGT FCMGE (vector FP compare), FCMLA FCADD (complex, FEAT_FCMA) DUP SHL BSL BIT BIF SDOT UDOT (dot product) SVE: LD1W ST1W ADD MUL WHILELT INCW RDVL (predicated, VL-agnostic) Atomic: LDADD CAS SWP LDXR STXR LDAXR STLXR (LSE: +A/L/AL variants) STADD STSET STCLR (fire-and-forget atomics) Barrier: DMB DSB ISB Security: PACIA AUTIA PACIASP AUTIASP RETAA BTI (PAC + BTI) MTE: IRG STG ST2G STZ2G STZG STGP LDG GMI ADDG SUBG SUBP (memory tagging) ``` --- *This document covers AArch64 (ARMv8-A/ARMv9-A) with notes on AArch32 differences. For the full authoritative reference, see the "Arm Architecture Reference Manual for A-profile architecture" (DDI 0487).*