# The Comprehensive ARM Assembly Reference > Targeting **AArch64 (ARMv8-A / ARMv9-A)** with AArch32 notes where relevant. > A reference with explanations — every instruction tells you what it does and why. --- ## Table of Contents 1. [Registers](#1-registers) 2. [Instruction Encoding Basics](#2-instruction-encoding-basics) 3. [The S Suffix & Condition Flags](#3-the-s-suffix--condition-flags) 4. [Condition Codes](#4-condition-codes) 5. [Data Processing — Arithmetic](#5-data-processing--arithmetic) 6. [Data Processing — Logical](#6-data-processing--logical) 7. [Shift & Rotate Operations](#7-shift--rotate-operations) 8. [Shifted Register & Extended Register Forms](#8-shifted-register--extended-register-forms) 9. [Move Instructions & Aliases](#9-move-instructions--aliases) 10. [Comparison & Test Instructions](#10-comparison--test-instructions) 11. [Multiply & Divide](#11-multiply--divide) 12. [Sign Extension & Zero Extension](#12-sign-extension--zero-extension) 13. [Bitfield Operations (BFM family)](#13-bitfield-operations-bfm-family) 14. [Bit Manipulation Instructions](#14-bit-manipulation-instructions) 15. [Load & Store Instructions](#15-load--store-instructions) 16. [Load/Store Pair, Non-Temporal & Exclusive](#16-loadstore-pair-non-temporal--exclusive) 17. [Branching & Control Flow](#17-branching--control-flow) 18. [Conditional Select & Increment](#18-conditional-select--increment) 19. [System Registers & Special Instructions](#19-system-registers--special-instructions) 20. [Overflow, Underflow & Carry](#20-overflow-underflow--carry) 21. [Exceptions, Interrupts & Exception Levels](#21-exceptions-interrupts--exception-levels) 22. [Floating Point (SIMD/FP)](#22-floating-point-simdfp) 23. [NEON / Advanced SIMD Overview](#23-neon--advanced-simd-overview) 24. [Atomic & Synchronization Instructions](#24-atomic--synchronization-instructions) 25. [Memory Barriers & Ordering](#25-memory-barriers--ordering) 26. [Pseudo-instructions & Assembler Directives](#26-pseudo-instructions--assembler-directives) 27. [Instruction Aliases — The Master Table](#27-instruction-aliases--the-master-table) 28. [AArch32 (ARM/Thumb) Key Differences](#28-aarch32-armthumb-key-differences) 29. [Calling Convention (AAPCS64)](#29-calling-convention-aapcs64) 30. [Common Patterns & Idioms](#30-common-patterns--idioms) 31. [Pointer Authentication (PAC)](#31-pointer-authentication-pac) 32. [Branch Target Identification (BTI)](#32-branch-target-identification-bti) 33. [Scalable Vector Extension (SVE / SVE2)](#33-scalable-vector-extension-sve--sve2) 34. [Memory Tagging Extension (MTE)](#34-memory-tagging-extension-mte) 35. [Rules, Gotchas & Pitfalls](#35-rules-gotchas--pitfalls) 36. [Quick Reference Cheat Sheet](#36-quick-reference-cheat-sheet) --- ## 1. Registers AArch64 has 31 general-purpose registers (X0-X30), a stack pointer (SP), a zero register (XZR), a program counter (PC), and 32 SIMD/FP registers (V0-V31). GPRs can be accessed as 64-bit (X) or 32-bit (W). SIMD/FP registers have multiple views: 8-bit (B), 16-bit (H), 32-bit (S), 64-bit (D), and 128-bit (Q/V). ### 1.1 General-Purpose Registers AArch64 has 31 general-purpose registers, each 64 bits wide. **Why 31 registers (not 32)?** The instruction encoding uses 5 bits for each register field, which can encode 32 values (0-31). But ARM uses register number 31 for two different things depending on context: it's either `SP` (stack pointer) or `XZR` (zero register). This dual use means you get 30 freely usable registers plus these two special ones — effectively 31 GPRs plus SP. The zero register is extremely useful (it eliminates many instructions that x86 needs, like `XOR reg, reg` to clear a register), and having SP accessible in the same encoding space means load/store instructions can use SP as a base without special opcodes. **Why condition flags instead of condition registers?** Some architectures (like PowerPC) use condition registers instead of flags. ARM uses flags because they're simpler and more compact — one set of 4 bits shared by all instructions, vs multiple condition register fields that need extra encoding bits. The downside is that flags are a single shared resource, so instructions must be carefully ordered to avoid clobbering flags before they're read. **Caller-saved** means the function you call is free to overwrite these registers — if you need the value after the call, you must save it yourself (the "caller" saves). **Callee-saved** means the function you call must preserve these registers — if it uses them, it saves and restores them (the "callee" saves). | 64-bit name | 32-bit name | Notes | |-------------|-------------|-------| | `X0`–`X7` | `W0`–`W7` | Arguments / results (caller-saved) | | `X8` | `W8` | Indirect result location (when a function returns a large struct that doesn't fit in X0, the caller passes a pointer in X8 to where the struct should be written) | | `X9`–`X15` | `W9`–`W15` | Temporary / scratch (caller-saved) | | `X16` (`IP0`) | `W16` | Intra-procedure-call scratch (used by the linker for PLT stubs — trampolines that redirect calls to shared library functions) | | `X17` (`IP1`) | `W17` | Intra-procedure-call scratch (same as X16 — the linker may clobber these between your BL and the actual function entry) | | `X18` (`PR`) | `W18` | Platform register (reserved on some OSes) | | `X19`–`X28`| `W19`–`W28`| Callee-saved registers | | `X29` (`FP`)| `W29` | Frame pointer (callee-saved) | | `X30` (`LR`)| `W30` | Link register (return address) | **Critical rule**: Writing to a `Wn` register **zeroes the upper 32 bits** of the corresponding `Xn`. This is not sign-extension — it is always zero-extension. **Why zero the upper 32?** Without this rule, the upper 32 bits would contain stale data from whatever previously used the X register. Code would need explicit zero-extension after every 32-bit operation, wasting instructions. By making the hardware always zero the upper half, 32-bit operations "just work" — the 64-bit register always holds the correct zero-extended 32-bit result. This also eliminates a class of security bugs where stale upper bits leak information between contexts. ```asm MOV W0, #-1 // W0 = 0xFFFFFFFF, X0 = 0x00000000FFFFFFFF (upper zeroed) MOV X0, #-1 // X0 = 0xFFFFFFFFFFFFFFFF ``` ### 1.2 Special Registers | Register | Description | |----------|-------------| | `SP` (or `XSP`) | Stack pointer. Not a GPR — only usable as an operand by specific instruction forms: ADD/SUB immediate, ADD/SUB extended register, logical immediate (AND/ORR/EOR with bitmask immediate can write to SP), and LDR/STR addressing. NOT usable in shifted-register data processing. Must be 16-byte aligned when the stack-alignment check is enabled (`SCTLR_EL1.SA0` for EL0, `SCTLR_ELx.SA` for the current EL; Linux enables SA0 by default). Uses register encoding 31, same as XZR — the instruction opcode determines which one register 31 means. | | `XZR` / `WZR` | Zero register. Reads as zero, writes are discarded. Encoded as register 31, same as SP; the instruction context determines which. This is why you can't use SP and XZR in the same operand position of the same instruction. | | `PC` | Program counter. Not directly accessible as a GPR in AArch64 (unlike AArch32). Readable only via `ADR`/`ADRP`. | | `NZCV` | Condition flags (in `PSTATE`): Negative, Zero, Carry, oVerflow. | | `FPCR` | Floating-point control register. | | `FPSR` | Floating-point status register. | | `DAIF` | Interrupt mask bits (Debug, SError, IRQ, FIQ). | | `CurrentEL` | Current exception level. | | `SPSel` | Stack pointer selection (EL0 vs ELx SP). | **Which instructions use SP vs XZR for register 31?** This is one of the most confusing aspects of AArch64. Here is the complete rule: | Encoding class | Rd (reg 31) | Rn (reg 31) | Rm (reg 31) | |---|---|---|---| | ADD/SUB **shifted register** | XZR | XZR | XZR | | ADD/SUB **immediate** | SP | SP | — | | ADDS/SUBS **immediate** | XZR | SP | — | | ADD/SUB **extended register** | SP | SP | XZR | | ADDS/SUBS **extended register** | XZR | SP | XZR | | AND/ORR/EOR **immediate** (no S) | SP | XZR | — | | ANDS **immediate** | XZR | XZR | — | | Logical **shifted register** (all) | XZR | XZR | XZR | | ADC/SBC/ADCS/SBCS | XZR | XZR | XZR | | MUL/MADD/MSUB/DIV | XZR | XZR | XZR | | SMULH/UMULH/SMULL/UMULL etc. | XZR | XZR | XZR | | BFM/UBFM/SBFM | XZR | XZR | — | | EXTR | XZR | XZR | XZR | | CLZ/CLS/RBIT/REV/REV16/REV32 | XZR | XZR | — | | CSEL/CSINC/CSINV/CSNEG | XZR | XZR | XZR | | CCMP/CCMN | — | XZR | XZR | | MOVZ/MOVK/MOVN | XZR | — | — | | ADR/ADRP | XZR | — | — | | Loads/Stores (base Xn) | — | SP | — | | Loads/Stores (data Xt/Wt) | XZR | — | — | | BR/BLR/RET (target Xn) | — | XZR | — | | CBZ/CBNZ/TBZ/TBNZ (test Rt) | XZR | — | — | | MRS/MSR | XZR | — | — | | FEAT_CSSC (ABS/SMAX/CTZ/CNT) | XZR | XZR | XZR | **Mnemonic**: SP appears only where address arithmetic happens (ADD/SUB immediate and extended, logical immediate destinations, load/store bases). Everything else uses XZR. ### 1.3 SIMD/FP Registers 32 registers, each 128 bits wide, with multiple views: | Name | Size | Description | |------|------|-------------| | `B0`–`B31` | 8 bits | Byte | | `H0`–`H31` | 16 bits | Half-word (also FP16) | | `S0`–`S31` | 32 bits | Single-precision float | | `D0`–`D31` | 64 bits | Double-precision float | | `Q0`–`Q31` | 128 bits | Quadword (NEON) | | `V0`–`V31` | 128 bits | Vector register (NEON), with arrangement specifiers like `V0.4S`, `V0.8H`, etc. | `Q0`, `D0`, `S0`, `H0`, `B0` all refer to the **same physical register** (different widths of V0). **Why FP and NEON share registers**: ARM could have had separate FP and SIMD register files, but sharing means you can use NEON instructions to manipulate float bit patterns (e.g., `FMOV S0, W0` puts an integer into the FP register, then `FADD S0, S0, S1` uses it as a float) without copying between register files. It also means the same `STP`/`LDP` save/restore callee-saved D8-D15 for both FP and NEON. **Writing a narrow view zeroes the upper bits**: If you write to `S0` (32 bits), bits [127:32] of V0 are zeroed. If you write to `D0` (64 bits), bits [127:64] are zeroed. This is analogous to the W-register zeroing rule for GPRs. --- ## 2. Instruction Encoding Basics All AArch64 instructions are **fixed-width 32 bits** (4 bytes), **aligned** to 4-byte boundaries. Instructions are **always little-endian** in memory, regardless of the data endianness setting. Data can be big-endian or little-endian (controlled by `SCTLR_EL1.EE`), but instruction fetch is always little-endian. In practice, nearly all AArch64 systems run little-endian for both — big-endian AArch64 is rare. Major encoding groups (bits [28:25]), shown as a **rough high-level grouping** — the actual hardware decoder uses a priority-based tree with additional bits, so these patterns are a teaching aid, not a formal decode rule: | Bits [28:25] | Rough group | |---|---| | `100x` | Data processing — immediate | | `x101` | Branches, exception generation, system | | `x1x0` | Loads and stores | | `x1x1` | Data processing — register | | `0111` | Data processing — SIMD and FP | The wildcard patterns overlap, and the real decoder resolves ambiguity via bit priority and additional fields. For exact decoding, consult the "Top-level A64 instruction set encoding" table in the ARM Architecture Reference Manual (DDI 0487). This fixed encoding is why many things that seem like they should be simple (e.g., loading a 64-bit constant) require multiple instructions or special tricks. --- ## 3. The S Suffix & Condition Flags Most AArch64 data-processing instructions come in two forms: one that silently computes the result, and one (with an `S` suffix) that also updates the four condition flags (N, Z, C, V). Understanding when flags are set — and what they mean — is essential for branches, conditional selects, and multi-precision arithmetic. ### 3.1 The PSTATE Condition Flags PSTATE (Process State) holds the current CPU state including the four condition flags. It's not a single register you can read — individual fields are accessed through special instructions and system registers. The processor has a set of state bits called **PSTATE** (Process State) that track things like the current exception level, interrupt masks, and condition flags. The four **condition flags** (N, Z, C, V) are the most important for everyday programming — they are how the CPU remembers the result of a comparison or arithmetic operation so a later instruction can act on it. Each flag is a single bit that is either **0** (clear/false) or **1** (set/true). These flags are updated by flag-setting instructions (like `ADDS`, `SUBS`, `ANDS`, `CMP`, `TST`) and read by conditional instructions (like `B.EQ`, `CSEL`, `CCMP`). | Flag | Name | Set to 1 when… | |------|------|-----------| | **N** | Negative | Result bit [63] (or [31] for 32-bit ops) is 1 | | **Z** | Zero | Result is zero | | **C** | Carry | Unsigned overflow occurred (carry out) | | **V** | oVerflow | Signed overflow occurred (2's complement) | ### 3.2 The S Suffix Most data-processing instructions have two forms: ```asm ADD X0, X1, X2 // X0 = X1 + X2, flags UNCHANGED ADDS X0, X1, X2 // X0 = X1 + X2, flags UPDATED (N, Z, C, V) SUB X0, X1, X2 // X0 = X1 - X2, flags UNCHANGED SUBS X0, X1, X2 // X0 = X1 - X2, flags UPDATED ``` The `S` suffix means "set flags." Without it, the instruction does not touch `NZCV`. **Instructions that ALWAYS set flags** (no non-S form): - `CMP` (alias for `SUBS` with `XZR`/`WZR` destination) - `CMN` (alias for `ADDS` with `XZR`/`WZR` destination) - `TST` (alias for `ANDS` with `XZR`/`WZR` destination) **Instructions that NEVER set flags** (no S form exists): - `UDIV`, `SDIV` - `MUL`, `MADD`, `MSUB`, `SMULL`, `UMULL`, `SMULH`, `UMULH` - All loads and stores - All branches - `MOV`, `MVN` (no `MOVS` exists — to set flags after a move, use `TST Xn, Xn` or `ANDS Xd, Xn, Xn`) ### 3.3 How Flags Are Set for ADD/SUB For `ADDS Xd, Xn, Xm`: - **N** = bit 63 of result - **Z** = 1 if result == 0 - **C** = 1 if unsigned addition produced a carry (i.e., result < Xn, treating as unsigned) - **V** = 1 if signed overflow (both operands same sign, result different sign) For `SUBS Xd, Xn, Xm` (computes Xn - Xm, which is Xn + NOT(Xm) + 1, where NOT means flipping every bit — every 0 becomes 1 and every 1 becomes 0): - **N** = bit 63 of result - **Z** = 1 if result == 0 - **C** = 1 if **no borrow** occurred (i.e., Xn >= Xm unsigned). Note: ARM uses inverted carry for subtraction. - **V** = 1 if signed overflow **Key subtlety**: ARM's carry flag for subtraction is **inverted** compared to x86. `SUBS` sets C=1 when there is NO borrow (i.e., the first operand is greater than or equal to the second, unsigned). This catches many people off guard. **Why inverted carry?** ARM implements subtraction as `Xn + NOT(Xm) + 1`. The carry-out of this addition naturally equals 1 when `Xn >= Xm` (no borrow) and 0 when `Xn < Xm` (borrow). ARM uses this carry-out directly rather than inverting it. This simplifies the hardware — the ALU's carry-out is the C flag without any extra logic. x86 inverts it to create a "borrow" flag, which is more intuitive but requires an extra NOT gate. The ARM convention also means `HS` (Higher or Same, unsigned >=) directly tests `C==1`, which is the natural ALU output. ### 3.4 How Flags Are Set for Logical Operations For `ANDS`, `BICS`: - **N** = MSB (most significant bit — the leftmost bit, bit 63 for 64-bit or bit 31 for 32-bit) of result - **Z** = 1 if result == 0 - **C** = 0 (always cleared) - **V** = 0 (always cleared) This means after a `TST` (which is `ANDS XZR, ...`), C and V are always 0. --- ## 4. Condition Codes Used with conditional branches (`B.cond`), conditional selects (`CSEL`, `CSINC`, etc.), and in AArch32 with conditional execution of most instructions. | Code | Meaning | Flags | |------|---------|-------| | `EQ` | Equal / zero | Z == 1 | | `NE` | Not equal / non-zero | Z == 0 | | `CS` / `HS` | Carry set / unsigned higher or same | C == 1 | | `CC` / `LO` | Carry clear / unsigned lower | C == 0 | | `MI` | Minus / negative | N == 1 | | `PL` | Plus / positive or zero | N == 0 | | `VS` | Overflow set | V == 1 | | `VC` | Overflow clear | V == 0 | | `HI` | Unsigned higher | C == 1 && Z == 0 | | `LS` | Unsigned lower or same | C == 0 \|\| Z == 1 | | `GE` | Signed greater or equal | N == V | | `LT` | Signed less than | N != V | | `GT` | Signed greater than | Z == 0 && N == V | | `LE` | Signed less or equal | Z == 1 \|\| N != V | | `AL` | Always (default) | Any | | `NV` | Never (behaves as AL in AArch64) | — | **Aliases**: `HS` is the same as `CS`. `LO` is the same as `CC`. They exist for readability — use `HS`/`LO` for unsigned comparisons, `CS`/`CC` when you care about the raw carry. **Why signed comparisons use N==V** (not just N): After `CMP X0, X1`, the result's sign bit (N) tells you whether `X0 - X1` is negative. If there's no overflow, negative result means X0 < X1, so N alone works. But with signed overflow, the sign bit is wrong — subtracting a large negative from a large positive overflows, giving a negative result even though X0 > X1. The V flag detects this overflow. When V=1, the sign bit is "inverted" from the true mathematical answer. So `N == V` correctly means "greater or equal": either both are 0 (positive result, no overflow = truly >=) or both are 1 (negative result, but overflow inverted it = actually >=). **Signed vs. unsigned after CMP**: ```asm CMP X0, X1 B.HI label // branch if X0 > X1 (unsigned) B.HS label // branch if X0 >= X1 (unsigned) B.LO label // branch if X0 < X1 (unsigned) B.LS label // branch if X0 <= X1 (unsigned) B.GT label // branch if X0 > X1 (signed) B.GE label // branch if X0 >= X1 (signed) B.LT label // branch if X0 < X1 (signed) B.LE label // branch if X0 <= X1 (signed) ``` **Traced example — what REALLY happens after CMP:** ```asm // If X0 = 5 and X1 = 3: CMP X0, X1 // SUBS XZR, X0, X1 → 5 - 3 = 2 // N=0 (result positive), Z=0 (result not zero) // C=1 (no borrow: 5 >= 3), V=0 (no signed overflow) // Flags: N=0 Z=0 C=1 V=0 B.GT label // GT = (Z==0 && N==V) = (true && 0==0) = true → TAKEN ✓ B.HI label // HI = (C==1 && Z==0) = (true && true) = true → TAKEN ✓ B.GE label // GE = (N==V) = (0==0) = true → TAKEN ✓ // If X0 = 3 and X1 = 5: CMP X0, X1 // 3 - 5: result wraps to 0xFFFF...FFFE // N=1 (bit 63 set), Z=0, C=0 (borrow: 3 < 5), V=0 B.LT label // LT = (N!=V) = (1!=0) = true → TAKEN ✓ B.LO label // LO = (C==0) = true → TAKEN ✓ // If X0 = 5 and X1 = 5: CMP X0, X1 // 5 - 5 = 0 // N=0, Z=1, C=1 (no borrow: 5 >= 5), V=0 B.EQ label // EQ = (Z==1) = true → TAKEN ✓ B.LE label // LE = (Z==1 || N!=V) = (true || false) = true → TAKEN ✓ B.HS label // HS = (C==1) = true → TAKEN ✓ (5 is "higher or same" as 5) ``` --- ## 5. Data Processing — Arithmetic ### Syntax Notation Used in This Document Every instruction in this document shows **all valid forms** with explicit operand constraints. Here is how to read the notation: | Notation | Meaning | |---|---| | `Xd\|XZR` | Register field where register 31 = XZR (zero register). `Xd` is any of X0–X30, or XZR. | | `Xd\|SP` | Register field where register 31 = SP (stack pointer). `Xd` is any of X0–X30, or SP. | | `Wd\|WZR` | 32-bit form of the above. Upper 32 bits of the corresponding Xd are zeroed on write. | | `Wd\|WSP` | 32-bit form where register 31 = WSP (32-bit view of SP). | | `{...}` | **Optional.** Everything inside braces can be omitted. For example, `{, LSL #12}` means the `, LSL #12` part is optional — if omitted, no shift is applied. | | `A\|B\|C` | **Choose one.** Exactly one of the listed options. For example, `LSL\|LSR\|ASR` means you must pick one of those three shifts. | | `#0-63` | Immediate value range. `#0-63` means any integer from 0 to 63 inclusive. | | `#imm12` | A 12-bit unsigned immediate (0–4095). | | `#simm9` | A 9-bit signed immediate (−256 to +255). | | `#simm` | A signed immediate (range depends on context — stated in the comment). | | `#pimm` | A positive (unsigned) scaled immediate (range depends on access size — stated in the comment). | | `Sd`, `Sn`, `Sm` | Single-precision (32-bit) FP/SIMD register. S0–S31. | | `Dd`, `Dn`, `Dm` | Double-precision (64-bit) FP/SIMD register. D0–D31. | | `Hd`, `Hn`, `Hm` | Half-precision (16-bit) FP/SIMD register. H0–H31. Requires FEAT_FP16 for arithmetic. | | `Qt`, `St`, `Dt`, `Bt`, `Ht` | SIMD/FP register at various widths for loads/stores: Q=128-bit, D=64-bit, S=32-bit, H=16-bit, B=8-bit. | | `Vn.4S`, `V0.16B`, etc. | NEON vector register with arrangement specifier: `.4S` = 4 lanes of 32-bit, `.16B` = 16 lanes of 8-bit, `.2D` = 2 lanes of 64-bit, etc. | **"These are ALL the valid forms"**: Every syntax block in this document shows every valid encoding of that instruction. If a form is not listed, it does not exist. For example, ADD has three separate encoding classes (shifted register, immediate, extended register) — each is listed in its own subsection with every valid operand combination. FP/SIMD registers (Sd/Dd/Hd) do NOT have the SP/XZR ambiguity — register 31 in the FP register file is simply register 31 (V31/D31/S31), with no special meaning. ### 5.1 ADD / SUB — Register Form `ADD` adds two values. `SUB` subtracts one value from another. These are the most fundamental arithmetic instructions. ``` ADD Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Xd = Xn + (shifted Xm) [64-bit] ADD Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // Wd = Wn + (shifted Wm) [32-bit] SUB Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Xd = Xn - (shifted Xm) SUB Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // Wd = Wn - (shifted Wm) ADDS Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // + set flags [64-bit] ADDS Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // + set flags [32-bit] SUBS Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // + set flags [64-bit] SUBS Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // + set flags [32-bit] ``` `Xd` is the destination, `Xn` is the first source, `Xm` is the second source. The `{...}` part is optional — if omitted, no shift is applied and the instruction is a plain register add/subtract. In the **shifted register** encoding, register 31 in any field means **XZR** (the zero register), NOT SP. So `ADD X0, XZR, X5` is valid (adds zero + X5 → moves X5 into X0), but you **cannot** use SP here — for that you need the immediate or extended register encoding (§5.2 / §5.3). When a disassembler shows `add x0, xzr, x0`, that's this encoding with Xn = XZR. **Why no ROR?** Arithmetic shifted register only allows LSL, LSR, and ASR — NOT ROR. This is because rotate-then-add has no common use case in compiled code, and omitting it freed up an encoding bit for other purposes. Logical instructions (AND/ORR/EOR/BIC, §6) DO allow ROR because rotate-and-mask is useful in crypto and hash functions. **What the register form REALLY does — traced:** ```asm // Plain register (no shift): // If X1 = 100, X2 = 25: ADD X0, X1, X2 // X0 = 100 + 25 = 125 // With shift — "add X1 plus (X2 shifted left by 2)": // If X1 = 0x1000 (base address), X2 = 5 (index): ADD X0, X1, X2, LSL #2 // X0 = 0x1000 + (5 << 2) = 0x1000 + 20 = 0x1014 // This computes base + index*4 (array of 4-byte elements) // With ASR — useful for signed division combined with addition: // If X1 = 100, X2 = -8: ADD X0, X1, X2, ASR #1 // X0 = 100 + (-8 >> 1) = 100 + (-4) = 96 ``` **32-bit note**: When using Wd forms, flags reflect the 32-bit result (N = bit 31, C/V from 32-bit arithmetic). The upper 32 bits of the Xd register are always zeroed — this is true for ALL instructions that write to a W register. ### 5.2 ADD / SUB — Immediate Form The immediate form adds or subtracts a constant value encoded directly in the instruction. This is the most common form for stack adjustments (`ADD SP, SP, #16`), small increments, and compile-time-known offsets. Register 31 means **SP** here (not XZR), which is why `ADD SP, SP, #16` works. ``` ADD Xd|SP, Xn|SP, #imm12{, LSL #12} // 64-bit; imm12 = 0–4095, optionally shifted left by 12 ADD Wd|WSP, Wn|WSP, #imm12{, LSL #12} // 32-bit; same encoding constraints SUB Xd|SP, Xn|SP, #imm12{, LSL #12} SUB Wd|WSP, Wn|WSP, #imm12{, LSL #12} ADDS Xd|XZR, Xn|SP, #imm12{, LSL #12} // + set flags (Rd is XZR not SP) [64-bit] ADDS Wd|WZR, Wn|WSP, #imm12{, LSL #12} // + set flags [32-bit] SUBS Xd|XZR, Xn|SP, #imm12{, LSL #12} // + set flags [64-bit] SUBS Wd|WZR, Wn|WSP, #imm12{, LSL #12} // + set flags [32-bit] ``` The immediate is a **12-bit unsigned value** (0–4095), optionally shifted left by 12 bits. This encoding is identical for both 32-bit and 64-bit forms — the same range of immediates is available. So the encodable values are `0–4095` OR `0–4095 shifted left by 12` (i.e., multiples of 4096 up to 4095×4096 = 16,773,120). **What the hardware actually encodes**: The instruction has a 12-bit immediate field and a 1-bit shift flag. The shift flag is either 0 (no shift) or 1 (shift the 12-bit value left by 12 positions). When you write a large number like `#0x123000`, the assembler breaks it down for you — it figures out that 0x123000 = 0x123 shifted left by 12, so it stores `imm12 = 0x123` with the shift flag set. You never see this in source code, but you might see it in a disassembler. ```asm // What you write: // What the hardware actually encodes: ADD X0, X1, #42 // imm12 = 42, shift = 0 → X1 + 42 ADD X0, X1, #0x1000 // imm12 = 1, shift = 1 → X1 + (1 << 12) = X1 + 4096 ADD X0, X1, #0x123000 // imm12 = 0x123, shift = 1 → X1 + (0x123 << 12) ADD X0, X1, #5000 // ERROR: 5000 = 0x1388 // 0x1388 > 4095, so it doesn't fit in 12 bits unshifted // 0x1388 is not a multiple of 4096, so shift doesn't help // The assembler cannot encode this — it will error ``` A disassembler may show `ADD X0, X1, #0x123, LSL #12` instead of `ADD X0, X1, #0x123000` — they mean the same thing, it's just showing the raw encoding fields. The assembler may silently convert `ADD Xd, Xn, #-5` into `SUB Xd, Xn, #5` if the negative immediate can be encoded as a positive immediate of the opposite instruction. This is a common assembler convenience. ### 5.3 ADD / SUB — Extended Register Form The extended register form sign-extends or zero-extends a narrow value (8/16/32-bit) from the second source register to the full width, optionally shifts it left by 0–4, then adds/subtracts. This is how the hardware computes array addresses like `base + (int32_index * element_size)` in one instruction. Register 31 means **SP** in Rd/Rn (so `ADD SP, SP, X0` works) but **XZR** in Rm. When you write `ADD X0, SP, X1, LSL #3`, the assembler automatically picks this encoding (not shifted register), because SP is only valid here. ``` ADD Xd|SP, Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} // 64-bit, 32-bit index ADD Xd|SP, Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} // 64-bit, 64-bit index ADD Wd|WSP, Wn|WSP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW|LSL {#0-4} SUB Xd|SP, Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} SUB Xd|SP, Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} // 64-bit, 64-bit index SUB Wd|WSP, Wn|WSP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW|LSL {#0-4} ADDS Xd|XZR, Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} // + set flags (Rd=XZR not SP) ADDS Xd|XZR, Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} ADDS Wd|WZR, Wn|WSP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW|LSL {#0-4} SUBS Xd|XZR, Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} // + set flags SUBS Xd|XZR, Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} SUBS Wd|WZR, Wn|WSP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW|LSL {#0-4} ``` The extend operations are: | Extend | Meaning | |--------|---------| | `UXTB` | Unsigned extend byte (bits [7:0]) | | `UXTH` | Unsigned extend halfword (bits [15:0]) | | `UXTW` | Unsigned extend word (bits [31:0]) | | `UXTX` | Unsigned extend doubleword (bits [63:0], effectively no extension) | | `SXTB` | Signed extend byte | | `SXTH` | Signed extend halfword | | `SXTW` | Signed extend word | | `SXTX` | Signed extend doubleword | | `LSL` | Alias for UXTX in 64-bit form (identity extension), alias for UXTW in 32-bit form. Used when you want a plain shift without any narrowing/sign extension — e.g., `ADD X0, SP, X1, LSL #3` | The `{#0-4}` shift is applied **after** extension — it left-shifts the extended value by 0–4 positions (i.e., multiply by 1, 2, 4, 8, or 16). This is extremely useful for array indexing: ```asm // X1 = base address, W2 = 32-bit index // Access array of 8-byte elements: base + sign_extend(index) * 8 LDR X0, [X1, W2, SXTW #3] // #3 means shift left by 3 = multiply by 8 // In ADD form: ADD X0, X1, W2, SXTW #3 // X0 = X1 + sign_extend(W2) << 3 ``` **What the extended register form REALLY does — traced step by step:** ```asm // ADD X0, X1, W2, SXTW #3 // If X1 = 0x00010000 (base address = 65536) and W2 = 0xFFFFFFFE (-2 as signed 32-bit): // // Step 1: Sign-extend W2 from 32 to 64 bits: // W2 = 0xFFFFFFFE → sign bit (bit 31) = 1 → extend with 1s // Extended = 0xFFFFFFFF_FFFFFFFE = -2 as 64-bit // // Step 2: Shift left by 3 (multiply by 8): // 0xFFFFFFFF_FFFFFFFE << 3 = 0xFFFFFFFF_FFFFFFF0 = -16 // // Step 3: Add to X1: // X0 = 0x00010000 + 0xFFFFFFFF_FFFFFFF0 = 0x00000000_0000FFF0 // = 0x10000 - 16 = 0xFFF0 = 65520 // // This is: base_address + (signed_index * element_size) // It computed &array[-2] for 8-byte elements — going backwards 2 elements from the base. // Without SXTW, you'd need to sign-extend manually: SXTW X3, W2 // X3 = sign_extend(W2) ADD X0, X1, X3, LSL #3 // X0 = X1 + X3 * 8 // The extended register form does both in one instruction. ``` **Why this form exists**: Array indexing with 32-bit indices into 64-bit address space is extremely common. C/C++ code uses `int` (32-bit) for array indices, but pointers are 64-bit. The extended register form does the sign/zero extension AND the element-size multiplication in a single instruction, saving 1-2 instructions per array access. **Note on SP**: Both the immediate form and the extended register form accept `SP` as source and destination (register 31 = SP). The shifted register encoding uses XZR (not SP) for register 31. In practice, if you write `ADD X0, SP, X1, LSL #2`, the assembler automatically selects the extended register encoding (where `LSL` is an alias for `UXTX`), so it works. This is why a disassembler may show `add x0, sp, x0` or `add x0, sp, x0, lsl #2` — these use the **extended register** encoding (not shifted register), because SP can only appear in that encoding. The distinction only matters if you're hand-encoding machine code. **How to tell which encoding was used from disassembly:** ```asm // 8b0003e0 add x0, xzr, x0 ← Shifted register encoding (Rn field = 31 = XZR) // 8b2063e0 add x0, sp, x0 ← Extended register encoding (Rn field = 31 = SP, extend = UXTX) // 8b206be0 add x0, sp, x0, lsl #2 ← Extended register encoding (Rn = SP, extend = UXTX #2) // These are DIFFERENT opcodes even though they both say "add". ``` ### 5.4 ADC / SBC — Add/Subtract with Carry `ADC` (Add with Carry) adds two registers **plus** the current carry flag (C). The carry flag is a single bit in the processor's flags register — it is either 0 or 1, and it was set by a previous flag-setting instruction like `ADDS` or `SUBS`. This lets you chain additions across multiple registers to handle numbers bigger than 64 bits. `SBC` (Subtract with Carry) subtracts using the carry flag as a "borrow" indicator. It computes `Xn + NOT(Xm) + C`. When C=1 (no borrow from previous subtraction), this simplifies to `Xn - Xm`. When C=0 (there was a borrow), this gives `Xn - Xm - 1`, propagating the borrow. ``` ADC Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn + Xm + C [64-bit] ADC Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn + Wm + C [32-bit] SBC Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn + NOT(Xm) + C [64-bit] SBC Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn + NOT(Wm) + C [32-bit] ADCS Xd|XZR, Xn|XZR, Xm|XZR // + set flags [64-bit] ADCS Wd|WZR, Wn|WZR, Wm|WZR // + set flags [32-bit] SBCS Xd|XZR, Xn|XZR, Xm|XZR // + set flags [64-bit] SBCS Wd|WZR, Wn|WZR, Wm|WZR // + set flags [32-bit] ``` Here, `C` is the carry flag value (0 or 1) from the PSTATE condition flags, as set by the most recent flag-setting instruction. No shift or immediate forms exist for ADC/SBC. These are essential for **multi-word arithmetic** (e.g., 128-bit addition): ```asm // 128-bit addition: (X1:X0) + (X3:X2) -> (X1:X0) ADDS X0, X0, X2 // add low 64 bits, set carry ADC X1, X1, X3 // add high 64 bits + carry ``` **What this REALLY does — traced with values:** ```asm // Add 0x00000000_00000001:FFFFFFFF_FFFFFFFE + 0x00000000_00000000:00000000_00000005 // X1:X0 = 0x0000000000000001 : 0xFFFFFFFFFFFFFFFE // X3:X2 = 0x0000000000000000 : 0x0000000000000005 ADDS X0, X0, X2 // 0xFFFFFFFFFFFFFFFE + 5 = 0x0000000000000003 (wraps! C=1) ADC X1, X1, X3 // 0x0000000000000001 + 0 + 1(carry) = 0x0000000000000002 // Result: X1:X0 = 0x0000000000000002:0000000000000003 = correct 128-bit sum ``` SBC is used similarly for multi-word subtraction. Note that SBC uses the carry flag in the same inverted sense as ARM subtraction: C=1 means no borrow. **What SBC REALLY does — traced:** ```asm // 128-bit subtraction: (X1:X0) - (X3:X2) → (X1:X0) // Subtract 0x00000000_00000001:0000000000000002 - 0x00000000_00000000:0000000000000005 SUBS X0, X0, X2 // 0x0000000000000002 - 5 = wraps to 0xFFFFFFFFFFFFFFFD // C=0 (borrow occurred: 2 < 5) SBC X1, X1, X3 // X1 + NOT(X3) + C = 1 + NOT(0) + 0 = 1 + 0xFFFFFFFFFFFFFFFF + 0 // = 0 (wraps, carry out but we ignore it) // Result: 0x0000000000000000:FFFFFFFFFFFFFFFD = correct (2 - 5 = -3 as 128-bit signed) ``` ### 5.5 NEG / NEGS — Negate `NEG` computes the two's complement negation of a value — it flips the sign. `NEG Xd, Xm` is equivalent to `0 - Xm`. It is an alias for `SUB Xd, XZR, Xm` (subtracting the value from the zero register). ``` NEG Xd|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Alias for: SUB Xd|XZR, XZR, Xm|XZR{, LSL|LSR|ASR #0-63} [64-bit] NEG Wd|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // Alias for: SUB Wd|WZR, WZR, Wm|WZR{, LSL|LSR|ASR #0-31} [32-bit] NEGS Xd|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Alias for: SUBS Xd|XZR, XZR, Xm|XZR{, LSL|LSR|ASR #0-63} NEGS Wd|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} ``` ### 5.6 NGC / NGCS — Negate with Carry `NGC` negates a value while incorporating the carry flag, used for multi-word negation. It is an alias for `SBC Xd, XZR, Xm`, which computes `0 + NOT(Xm) + C`. In a multi-word negate, the first word uses `NEGS` (which sets the carry), and subsequent words use `NGC` to propagate the borrow. ``` NGC Xd|XZR, Xm|XZR // Alias for: SBC Xd|XZR, XZR, Xm|XZR [64-bit] NGC Wd|WZR, Wm|WZR // Alias for: SBC Wd|WZR, WZR, Wm|WZR [32-bit] NGCS Xd|XZR, Xm|XZR // Alias for: SBCS Xd|XZR, XZR, Xm|XZR NGCS Wd|WZR, Wm|WZR ``` Useful in multi-word negation. --- ## 6. Data Processing — Logical Bitwise operations for masking, setting, clearing, and toggling bits. These use a special "bitmask immediate" encoding that can represent many (but not all) bit patterns. ### 6.1 Basic Logical Instructions These perform bitwise operations — they operate on each bit position independently: - `AND`: Result bit is 1 only if **both** input bits are 1. Used for masking (extracting specific bits). - `ORR`: Result bit is 1 if **either or both** input bits are 1 (inclusive OR). Used for setting specific bits. - `EOR`: Result bit is 1 if the input bits are **different** (exclusive OR). Used for toggling bits and simple encryption. - `ANDS`: Same as AND, but also updates the condition flags (N, Z, C=0, V=0). ``` AND Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn & (shifted Xm) [64-bit] AND Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} // Wd = Wn & (shifted Wm) [32-bit] ORR Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn | (shifted Xm) [64-bit] ORR Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} EOR Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn ^ (shifted Xm) [64-bit] EOR Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} ANDS Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // AND + set flags [64-bit] ANDS Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} // AND + set flags [32-bit] ``` **Why all four shifts?** Unlike arithmetic instructions (ADD/SUB, which only allow LSL/LSR/ASR), logical instructions also allow **ROR** because rotate-and-mask patterns are common in cryptography, hash functions, and CRC computations. There is **no ORRS or EORS** instruction in AArch64. Only `ANDS` and `BICS` have flag-setting variants. If you need flags after ORR/EOR, follow with `TST` or `CMP`. ### 6.2 BIC — Bit Clear `BIC` stands for "Bit Clear." It ANDs the first operand with the bitwise NOT of the second — every bit that is 1 in Xm gets cleared (set to 0) in the result. Think of it as using Xm as a mask of which bits to turn off. ``` BIC Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn & ~(shifted Xm) [64-bit] BIC Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} // Wd = Wn & ~(shifted Wm) [32-bit] BICS Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // same + set flags [64-bit] BICS Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} ``` **Important**: `BIC` only has a shifted-register form, not an immediate form. To clear bits by immediate, use `AND` with the inverted bitmask: ```asm // BIC X0, X0, #0xFF ← ILLEGAL, no BIC immediate form AND X0, X0, #0xFFFFFFFFFFFFFF00 // Correct: AND with inverted mask ``` ### 6.3 ORN / EON — OR-NOT / XOR-NOT `ORN` performs OR with the bitwise NOT of the second operand: it flips every bit of Xm, then ORs the result with Xn (`Xd = Xn | ~Xm`). `EON` does the same but with XOR: `Xd = Xn ^ ~Xm`. These save an instruction when you need to NOT a value before ORing or XORing it. ``` ORN Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn | ~(shifted Xm) [64-bit] ORN Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} EON Xd|XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Xd = Xn ^ ~(shifted Xm) [64-bit] EON Wd|WZR, Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} ``` These have no flag-setting forms and no immediate forms. ### 6.4 MVN — Move NOT (Bitwise NOT) `MVN` (Move NOT) flips every bit in the source register: every 0 becomes 1, every 1 becomes 0. This is called a bitwise NOT (written as `~Xm` in C). It is an alias for `ORN Xd, XZR, Xm` — ORing zero with `~Xm` just gives `~Xm`. ``` MVN Xd|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Alias for: ORN Xd|XZR, XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} [64-bit] MVN Wd|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} // Alias for: ORN Wd|WZR, WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} [32-bit] ``` **32-bit note**: `MVN W0, W1` inverts 32 bits and zeroes the upper 32 of X0. Different from `MVN X0, X1` which inverts all 64 bits. ### 6.5 Logical — Immediate Form The immediate form performs bitwise AND/ORR/EOR with a constant encoded in the instruction. Unlike ADD/SUB immediate (which uses a simple 12-bit value), logical immediate uses a special **bitmask encoding** that can represent many useful bit patterns (masks, alternating bits, aligned ranges) but NOT arbitrary constants. Register 31 in Rd means **SP** (not XZR) for the non-flag-setting forms, making `AND SP, X0, #0xFFF...` valid. Register 31 in Rn means **XZR**. ``` AND Xd|SP, Xn|XZR, #bitmask_imm // 64-bit; Rd is SP, Rn is XZR AND Wd|WSP, Wn|WZR, #bitmask_imm ORR Xd|SP, Xn|XZR, #bitmask_imm ORR Wd|WSP, Wn|WZR, #bitmask_imm EOR Xd|SP, Xn|XZR, #bitmask_imm EOR Wd|WSP, Wn|WZR, #bitmask_imm ANDS Xd|XZR, Xn|XZR, #bitmask_imm // flag-setting; Rd is XZR not SP [64-bit] ANDS Wd|WZR, Wn|WZR, #bitmask_imm ``` The bitmask immediate is **not** an arbitrary constant. Since every instruction must fit in 32 bits and the opcode and register fields already take up most of that space, there aren't enough bits left to store an arbitrary 64-bit constant. Instead, ARM uses a clever encoding that can represent a useful subset of bit patterns — things like masks, alternating bits, and aligned ranges — using only 13 bits. **How it works, step by step:** 1. **Pick an element size** `e`: must be 2, 4, 8, 16, 32, or 64 bits. 2. **Start with `s` consecutive 1-bits** at the bottom of an `e`-bit element, where `s` is at least 1 and at most `e-1`. (You can't have all zeros or all ones within the element.) For example, with `e=8` and `s=3`, you start with `00000111`. 3. **Right-rotate** that pattern within the `e`-bit element by `r` positions, where `r` is 0 to `e-1`. Bits that fall off the right wrap around to the left. For example, rotating `00000111` right by 1 gives `10000011` (the bottom 1 wraps to the top). 4. **Replicate** the `e`-bit element across the full 64-bit (or 32-bit) register. For example, with `e=8` and pattern `10000011`, you get `10000011_10000011_10000011_10000011_10000011_10000011_10000011_10000011` = `0x8383838383838383`. **Worked examples:** ``` // Example 1: e=64, s=8, r=0 // Step 2: 64-bit element with 8 ones at bottom: 0x00000000000000FF // Step 3: rotate right by 0 → unchanged: 0x00000000000000FF // Step 4: element is already 64 bits, no replication needed // Result: 0x00000000000000FF (= 0xFF) // Example 2: e=64, s=8, r=8 // Step 2: 8 ones at bottom: 0x00000000000000FF // Step 3: rotate right by 8 → 0xFF00000000000000 // (the 8 set bits at positions 0-7 wrap around to positions 56-63) // Step 4: no replication (64-bit element) // Result: 0xFF00000000000000 // Example 3: e=8, s=4, r=0 // Step 2: 8-bit element with 4 ones at bottom: 00001111 // Step 3: rotate right by 0 → 00001111 // Step 4: replicate 8 times: 0x0F0F0F0F0F0F0F0F // Result: 0x0F0F0F0F0F0F0F0F // Example 4: e=2, s=1, r=0 // Step 2: 2-bit element with 1 one at bottom: 01 // Step 3: rotate right by 0 → 01 // Step 4: replicate 32 times: 01010101...01 = 0x5555555555555555 // Result: 0x5555555555555555 // Example 5: e=2, s=1, r=1 // Step 2: 2-bit element with 1 one at bottom: 01 // Step 3: rotate right by 1 → 10 (the 1 wraps from bottom to top) // Step 4: replicate 32 times: 10101010...10 = 0xAAAAAAAAAAAAAAAA // Result: 0xAAAAAAAAAAAAAAAA // Example 6: e=64, s=32, r=32 // Step 2: 32 ones at bottom: 0x00000000FFFFFFFF // Step 3: rotate right by 32 → 0xFFFFFFFF00000000 // Step 4: no replication // Result: 0xFFFFFFFF00000000 ``` **What the hardware actually encodes**: The instruction stores these three parameters in 13 bits: 6 bits for `imms` (encodes both `e` and `s`), 6 bits for `immr` (encodes `r`), and 1 bit called `N` (helps determine `e`). The exact encoding is complex — consult the ARM ARM for the decode table — but the concept above is what it represents. **32-bit vs 64-bit difference**: For Wd forms, the element size can only go up to 32 (not 64), and the pattern replicates to fill 32 bits. This means some bitmask immediates valid for Xd are NOT valid for Wd: ```asm ORR X0, XZR, #0xFFFFFFFF00000000 // VALID: element=64, ones=32, rotate=32 ORR W0, WZR, #0xFFFFFFFF00000000 // ILLEGAL: needs element=64, but 32-bit only allows up to 32 ORR W0, WZR, #0xFFFF0000 // VALID: element=32, ones=16, rotate=16 ``` **Quick test — is my value encodable?** Ask: can I describe it as a run of consecutive 1-bits, optionally rotated, within a 2/4/8/16/32/64-bit chunk, tiled across the register? If yes, it's encodable. If not (like `0x12345678` or `5`), it's not. Not encodable: `0x12345678`, `5`, `0x1234`, anything without a repeating-rotated-ones pattern. Note that `AND`/`ORR`/`EOR` immediate forms accept **SP** as the destination register (not XZR), while `ANDS` immediate accepts XZR but NOT SP. This is because register 31 means different things in different contexts: in most instructions it means XZR (the zero register), but in certain instructions like `ADD` immediate and logical immediate (non-flag-setting), it means SP (the stack pointer). The hardware uses the opcode to decide which interpretation to use. --- ## 7. Shift & Rotate Operations Shift instructions move all bits in a register left or right by a specified number of positions. They are fundamental to assembly — used for multiplication/division by powers of 2, bit extraction, and building complex values. **Why shifts matter**: Shifting left by N is the same as multiplying by 2^N (but much faster — one cycle vs many for a multiply). Shifting right divides by 2^N. Compilers use shifts extensively: `x * 12` becomes `(x << 3) + (x << 2)` (two shifts and an add), which is faster than a multiply on many cores. Shifts are also how you access individual bits and build/parse packed data formats (network headers, pixel formats, bitfields). ### 7.1 Dedicated Shift Instructions These shift a register by a constant amount known at assemble time. They are all **aliases** — the hardware encodes them as bitfield (UBFM/SBFM) or extract (EXTR) instructions. The assembler and disassembler translate between the friendly names and the raw encodings automatically. ``` LSL Xd|XZR, Xn|XZR, #0-63 // Logical Shift Left (immediate) [alias: UBFM] LSL Wd|WZR, Wn|WZR, #0-31 LSR Xd|XZR, Xn|XZR, #0-63 // Logical Shift Right (immediate) [alias: UBFM] LSR Wd|WZR, Wn|WZR, #0-31 ASR Xd|XZR, Xn|XZR, #0-63 // Arithmetic Shift Right (immediate) [alias: SBFM] ASR Wd|WZR, Wn|WZR, #0-31 ROR Xd|XZR, Xn|XZR, #0-63 // Rotate Right (immediate) [alias: EXTR] ROR Wd|WZR, Wn|WZR, #0-31 ``` **These are all aliases.** They are not separate opcodes — the hardware encodes them as bitfield or extract instructions: | Instruction | Actually encodes as | |---|---| | `LSL Xd, Xn, #s` | `UBFM Xd, Xn, #(-s MOD 64), #(63-s)` | | `LSR Xd, Xn, #s` | `UBFM Xd, Xn, #s, #63` | | `ASR Xd, Xn, #s` | `SBFM Xd, Xn, #s, #63` | | `ROR Xd, Xn, #s` | `EXTR Xd, Xn, Xn, #s` | | `LSL Wd, Wn, #s` | `UBFM Wd, Wn, #(-s MOD 32), #(31-s)` | | `LSR Wd, Wn, #s` | `UBFM Wd, Wn, #s, #31` | | `ASR Wd, Wn, #s` | `SBFM Wd, Wn, #s, #31` | | `ROR Wd, Wn, #s` | `EXTR Wd, Wn, Wn, #s` | ### 7.2 Variable (Register) Shifts These shift a register by an amount stored in another register (determined at runtime). The shift amount is taken modulo the register width: `Xm MOD 64` for 64-bit, `Wm MOD 32` for 32-bit — so shifting by 65 is the same as shifting by 1. The real instruction mnemonics are LSLV/LSRV/ASRV/RORV; the assembler accepts LSL/LSR/ASR/ROR with three register operands as aliases. ``` LSL Xd|XZR, Xn|XZR, Xm|XZR // Alias for: LSLV Xd|XZR, Xn|XZR, Xm|XZR [64-bit] LSL Wd|WZR, Wn|WZR, Wm|WZR // Alias for: LSLV Wd|WZR, Wn|WZR, Wm|WZR [32-bit] LSR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: LSRV Xd|XZR, Xn|XZR, Xm|XZR LSR Wd|WZR, Wn|WZR, Wm|WZR ASR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: ASRV Xd|XZR, Xn|XZR, Xm|XZR ASR Wd|WZR, Wn|WZR, Wm|WZR ROR Xd|XZR, Xn|XZR, Xm|XZR // Alias for: RORV Xd|XZR, Xn|XZR, Xm|XZR ROR Wd|WZR, Wn|WZR, Wm|WZR ``` `LSLV`, `LSRV`, `ASRV`, `RORV` are the real instruction mnemonics (all register fields use XZR for register 31, never SP). The assembler resolves `LSL Xd, Xn, Xm` (three registers) as `LSLV` and `LSL Xd, Xn, #imm` (register + immediate) as `UBFM`. The shift amount uses only the lower 6 bits of `Xm` (for 64-bit) or lower 5 bits of `Wm` (for 32-bit). The actual shift is `Xm MOD 64` or `Wm MOD 32`. **32-bit note**: Even though the shift register is `Wm`, only its low 5 bits matter. `LSL W0, W1, W2` where W2=33 shifts by 33 MOD 32 = 1. ### 7.3 Shift Semantics - **LSL #n**: Shifts left, filling vacated bits with 0. Bits shifted out of the MSB are lost. Equivalent to unsigned multiply by 2^n (with truncation). - **LSR #n**: Shifts right, filling vacated bits with 0. Equivalent to unsigned divide by 2^n (truncating toward zero). - **ASR #n**: Shifts right, filling vacated bits with copies of the original MSB (sign bit). Equivalent to signed divide by 2^n (truncating toward negative infinity, NOT toward zero — this differs from C's `/` operator for negative numbers). - **ROR #n**: Rotates right — bits shifted out the bottom re-enter at the top. No information is lost. ```asm // ASR rounding toward -infinity example: // If X0 = -7 (0xFFFFFFFFFFFFFFF9) ASR X1, X0, #1 // X1 = -4 (not -3!) // -7 / 2 = -3.5, rounded toward -infinity = -4 // C's -7/2 = -3 (rounded toward 0) ``` --- ## 8. Shifted Register & Extended Register Forms Many instructions accept a modified second operand — shifted or extended before the operation. This happens in the same cycle as the main operation (the "barrel shifter") at no extra cost. ### 8.1 Shifted Register Operand Many data-processing instructions accept a final operand of the form `Xm, <shift> #<amount>`. In **all** shifted register encodings, register 31 means **XZR** (never SP): ``` <op> Xd|XZR, Xn|XZR, Xm|XZR, LSL #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, LSR #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, ASR #amount <op> Xd|XZR, Xn|XZR, Xm|XZR, ROR #amount // Only for logical ops (AND/ORR/EOR/BIC/ORN/EON) ``` The shift is applied to `Xm` before the operation. This is called a "barrel shift" — the hardware has a dedicated shifter circuit built into the data path, so the shift happens in the same clock cycle as the main operation (no extra cost on most implementations). **Which instructions support which shifts:** | Instruction class | LSL | LSR | ASR | ROR | |---|---|---|---|---| | ADD/SUB (shifted reg) | ✓ | ✓ | ✓ | ✗ | | AND/ORR/EOR/BIC/ORN/EON | ✓ | ✓ | ✓ | ✓ | | CMP/CMN (shifted reg form, alias of SUBS/ADDS) | ✓ | ✓ | ✓ | ✗ | | TST (shifted reg form, alias of ANDS) | ✓ | ✓ | ✓ | ✓ | | NEG/NEGS (alias of SUB/SUBS with XZR) | ✓ | ✓ | ✓ | ✗ | | MVN (alias of ORN with XZR) | ✓ | ✓ | ✓ | ✓ | ### 8.2 Extended Register Operand Only ADD/SUB (and their S variants) and CMP/CMN support extended register. In the extended register encoding, register 31 in the `Xd` and `Xn` positions means **SP** (not XZR), while register 31 in the `Rm` position means **XZR**. For the flag-setting variants (ADDS/SUBS/CMP/CMN), `Xd` uses XZR instead of SP: ``` ADD Xd|SP, Xn|SP, Wm|WZR, UXTB {#0-4} // Zero-extend byte from Wm, then shift left by 0–4 ADD Xd|SP, Xn|SP, Wm|WZR, SXTW {#0-4} // Sign-extend word from Wm, then shift left by 0–4 SUB Xd|SP, Xn|SP, Wm|WZR, SXTW {#0-4} // Subtract the extended value CMP Xn|SP, Wm|WZR, SXTW {#0-4} // Compare with extended value (SUBS XZR, ...) ``` The `{#0-4}` shift is applied **after** extension: `#0` = no shift (×1), `#1` = ×2, `#2` = ×4, `#3` = ×8, `#4` = ×16. This covers all common C data type sizes. This exists specifically for array indexing and address arithmetic. **What each extend does — concrete:** ```asm // If W3 = 0x800000AB: ADD X0, X1, W3, UXTB // Zero-extend byte: take bits [7:0] = 0xAB // Extended value = 0x00000000_000000AB // X0 = X1 + 0xAB ADD X0, X1, W3, SXTB // Sign-extend byte: take bits [7:0] = 0xAB (bit 7=1 → negative) // Extended value = 0xFFFFFFFF_FFFFFFAB = -85 // X0 = X1 - 85 ADD X0, X1, W3, UXTW // Zero-extend word: take all 32 bits = 0x800000AB // Extended value = 0x00000000_800000AB (positive as 64-bit!) // X0 = X1 + 0x800000AB ADD X0, X1, W3, SXTW // Sign-extend word: take all 32 bits, bit 31=1 → negative // Extended value = 0xFFFFFFFF_800000AB = -2147483477 // X0 = X1 - 2147483477 // With shift amount (multiply after extending): ADD X0, X1, W3, UXTB #2 // Zero-extend byte (0xAB), then shift left by 2 // = 0xAB << 2 = 0x2AC // X0 = X1 + 0x2AC ``` **UXTB vs UXTH vs UXTW vs UXTX**: Each extracts a different-width chunk from the bottom of the register. UXTB takes 8 bits, UXTH takes 16, UXTW takes 32, UXTX takes all 64 (effectively just a shift). The S variants (SXTB, SXTH, SXTW, SXTX) do the same but sign-extend instead of zero-extend. ### 8.3 When to Use Which Form - **Shifted register**: When you need to combine an ALU operation with a shift (common in hash functions, crypto, bitfield manipulation). - **Extended register**: When mixing 32-bit and 64-bit values, or computing addresses from an index that's smaller than 64 bits. - **Immediate**: When the constant fits the encoding constraints. **Side-by-side: three ways to compute `base + offset * 8`:** ```asm // All three compute X0 = X1 + X2*8, but suit different situations: // 1. Shifted register (when X2 is already 64-bit): ADD X0, X1, X2, LSL #3 // X0 = X1 + (X2 << 3) // Use when: X2 is a 64-bit value // 2. Extended register (when index is 32-bit): ADD X0, X1, W2, SXTW #3 // X0 = X1 + sign_extend(W2) << 3 // Use when: W2 is a 32-bit signed index (like C's int) ADD X0, X1, W2, UXTW #3 // X0 = X1 + zero_extend(W2) << 3 // Use when: W2 is a 32-bit unsigned index (like C's unsigned) // 3. Immediate (when offset is a constant): ADD X0, X1, #40 // X0 = X1 + 40 // Use when: the offset is known at compile time (40 = 5*8) ``` **What shifted register REALLY does — traced example:** ```asm // ADD X0, X1, X2, LSL #3 // "Add X1 and (X2 shifted left by 3)" — one instruction, one cycle // // If X1 = 0x1000 (base address) and X2 = 5 (index): // X2 LSL 3 = 5 × 8 = 40 = 0x28 // X0 = 0x1000 + 0x28 = 0x1028 // This computes &array[5] for an 8-byte element array in one instruction. // AND X0, X1, X2, ROR #16 // "AND X1 with (X2 rotated right by 16)" // // If X1 = 0xFFFF0000FFFF0000 and X2 = 0x00FF00FF00FF00FF: // X2 ROR 16 = 0x00FF00FF00FF00FF rotated right 16 = 0x00FF00FF00FF00FF (symmetric!) // X0 = X1 & (X2 ROR 16) — useful in crypto/hash mixing ``` --- ## 9. Move Instructions & Aliases `MOV` is the most heavily aliased instruction in AArch64 — it maps to different real instructions depending on the operand. Understanding these aliases is essential for reading disassembly. ### 9.1 MOV — The Most Aliased Instruction `MOV` in AArch64 is **never** its own instruction. It always assembles as something else: | What you write | What it actually is | When | |---|---|---| | `MOV Xd\|XZR, Xm\|XZR` | `ORR Xd\|XZR, XZR, Xm\|XZR` | Register-to-register move (shifted-reg encoding; reg 31 = XZR) | | `MOV Wd\|WZR, Wm\|WZR` | `ORR Wd\|WZR, WZR, Wm\|WZR` | 32-bit (zeroes upper 32) | | `MOV Xd\|XZR, #imm` | `MOVZ Xd\|XZR, #imm` | If imm fits in 16 bits at some position | | `MOV Wd\|WZR, #imm` | `MOVZ Wd\|WZR, #imm` | Same (only 2 positions: LSL #0, #16) | | `MOV Xd\|XZR, #imm` | `MOVN Xd\|XZR, #adjusted` | If NOT(imm) fits in 16 bits | | `MOV Wd\|WZR, #imm` | `MOVN Wd\|WZR, #adjusted` | NOT applied at 32-bit width | | `MOV Xd\|SP, #imm` | `ORR Xd\|SP, XZR, #bitmask_imm` | If imm is a valid bitmask immediate (note: Rd = **SP** here, not XZR!) | | `MOV Wd\|WSP, #imm` | `ORR Wd\|WSP, WZR, #bitmask_imm` | 32-bit bitmask immediate | | `MOV Xd\|SP, SP` | `ADD Xd\|SP, SP, #0` | Moving from SP (immediate encoding; reg 31 = SP in both Rd and Rn) | | `MOV SP, Xn\|SP` | `ADD SP, Xn\|SP, #0` | Moving to SP | ### 9.2 MOVZ — Move Wide with Zero ``` MOVZ Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: place 16-bit value at one of 4 positions MOVZ Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: only 2 positions available ``` Places a 16-bit immediate into the specified 16-bit slot and **zeroes** all other bits. **32-bit constraint**: The Wd form only allows `LSL #0` or `LSL #16` (two 16-bit slots in a 32-bit register). The Xd form allows `LSL #0`, `#16`, `#32`, or `#48` (four slots). ```asm MOVZ X0, #0xABCD, LSL #16 // X0 = 0x00000000ABCD0000 MOVZ W0, #0xABCD, LSL #16 // W0 = 0xABCD0000, X0 = 0x00000000ABCD0000 (upper zeroed) MOVZ W0, #0xABCD, LSL #32 // ILLEGAL — only 0/16 for Wd ``` ### 9.3 MOVK — Move Wide with Keep ``` MOVK Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: insert 16-bit value at one of 4 positions MOVK Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: only 2 positions available ``` Places a 16-bit immediate into the specified slot, **keeping** all other bits unchanged. **32-bit constraint**: Same as MOVZ — the Wd form only has two 16-bit slots. Building a 32-bit constant requires at most 2 instructions: ```asm // Load 0x12345678 into W0: MOVZ W0, #0x5678 // W0 = 0x00005678 MOVK W0, #0x1234, LSL #16 // W0 = 0x12345678 // Load 0x123456789ABCDEF0 into X0 (needs 4): MOVZ X0, #0xDEF0 // X0 = 0x000000000000DEF0 MOVK X0, #0x9ABC, LSL #16 // X0 = 0x000000009ABCDEF0 MOVK X0, #0x5678, LSL #32 // X0 = 0x00005678_9ABCDEF0 MOVK X0, #0x1234, LSL #48 // X0 = 0x12345678_9ABCDEF0 ``` ### 9.4 MOVN — Move Wide with NOT ``` MOVN Xd|XZR, #imm16{, LSL #0|#16|#32|#48} // 64-bit: place, then bitwise-NOT all 64 bits MOVN Wd|WZR, #imm16{, LSL #0|#16} // 32-bit: NOT applies to 32-bit result, upper 32 zeroed ``` Like MOVZ but inverts all bits after placing the immediate. Useful for loading values like -1, -2, etc. **32-bit form**: The NOT applies to the 32-bit result, and the upper 32 bits of Xd are zeroed (standard W-register write behavior): ```asm MOVN X0, #0 // X0 = ~0x0000000000000000 = 0xFFFFFFFFFFFFFFFF = -1 MOVN X0, #1 // X0 = ~0x0000000000000001 = 0xFFFFFFFFFFFFFFFE = -2 MOVN W0, #0 // W0 = ~0x00000000 = 0xFFFFFFFF, X0 = 0x00000000FFFFFFFF (NOT -1 as 64-bit!) MOVN W0, #1 // W0 = ~0x00000001 = 0xFFFFFFFE, X0 = 0x00000000FFFFFFFE ``` **RE (reverse engineering) trap**: `MOVN W0, #0` gives `X0 = 0x00000000FFFFFFFF`, NOT `0xFFFFFFFFFFFFFFFF`. If the compiler wants 64-bit -1, it uses `MOVN X0, #0`. Seeing `MOVN W0` in disassembly means the original code was working with 32-bit types. ### 9.5 MOV (bitmask immediate) When a `MOV Xd, #imm` has an immediate that is a valid bitmask immediate, the assembler encodes it as: ``` ORR Xd|SP, XZR, #bitmask_imm // Note: Rd=SP in logical immediate (non-S)! ORR Wd|WSP, WZR, #bitmask_imm ``` ```asm MOV X0, #0xFF // → ORR X0, XZR, #0xFF MOV X0, #0xAAAAAAAAAAAAAAAA // → ORR X0, XZR, #0xAAAAAAAAAAAAAAAA ``` ### 9.6 LDR (literal) for Arbitrary Constants When no encoding trick works, the assembler uses a **literal pool** load. A literal pool is a small area of constant data that the assembler places in memory near your code (usually right after a function). Instead of encoding the constant inside the instruction, the CPU loads it from this nearby data using a PC-relative load: ```asm LDR X0, =0x123456789ABCDEF0 // Pseudo-instruction // Assembler places the constant in a nearby literal pool and generates: // LDR X0, [PC, #offset_to_literal] ``` The `=` syntax is a GNU assembler (gas) convenience. Some assemblers use different syntax. **Why literal pools exist**: AArch64 instructions are fixed at 32 bits, so there's simply not enough room to embed a full 64-bit constant. The best the ISA can do inline is MOVZ+MOVK (up to 4 instructions = 16 bytes of code). A literal pool load uses just 1 instruction (4 bytes of code) + 8 bytes of data, which is smaller for complex constants and faster to execute. **How the assembler decides**: When you write `LDR X0, =val`, the assembler checks if `val` can be encoded more efficiently as a `MOV` (via MOVZ, MOVN, or bitmask immediate). If so, it emits the `MOV` instead. Only if no single-instruction encoding works does it fall back to a literal pool load. Some assemblers (like LLVM's integrated assembler) are smarter than others about this. **Literal pool range**: The LDR (literal) instruction uses a 19-bit signed offset (±1 MB). The assembler must place the literal pool close enough to the load. For large functions, it may need to insert pools mid-function (after unconditional branches, so execution doesn't fall into the data). **Multiple loads of the same constant**: The assembler typically deduplicates — if you write `LDR X0, =0x1234` in three places, only one copy of 0x1234 appears in the literal pool. ### 9.7 ADR / ADRP — PC-Relative Address Loading ``` ADR Xd|XZR, label // Xd = PC + offset (±1 MB range, byte-aligned) ADRP Xd|XZR, label // Xd = (PC & ~0xFFF) + (offset << 12) (±4 GB, page-aligned) ``` `ADR` loads the exact address of a label into a register, using a 21-bit signed offset from PC (±1 MB range). `ADRP` loads the address of the **4 KB page** containing the label. Memory is divided into 4096-byte (0x1000) pages. `ADRP` zeroes the bottom 12 bits of PC (the `& ~0xFFF` part — `~0xFFF` is `0xFFFFFFFFFFFFF000`, a mask that clears the low 12 bits) and then adds a page-granularity offset. This gives ±4 GB range but only page-level precision. You then use `ADD` with `:lo12:` to add back the offset within the page: ```asm ADRP X0, my_global // X0 = page containing my_global ADD X0, X0, :lo12:my_global // X0 = exact address of my_global LDR X1, [X0] // X1 = value at my_global ``` This ADRP+ADD pattern is the standard way to access global variables in position-independent code (PIC) — code that works correctly regardless of where the OS loads it in memory. Since ADRP computes addresses relative to PC, the code doesn't contain any hardcoded absolute addresses. --- ## 10. Comparison & Test Instructions **Why separate comparison instructions exist**: You could compare two values using `SUBS` and ignoring the result, but the comparison instructions (`CMP`, `CMN`, `TST`) make intent clear and — crucially — write to the zero register instead of a GPR. This means they don't consume a register for an unwanted result. `CMP X0, X1` is literally `SUBS XZR, X0, X1` — the subtraction happens, flags are set, and the result is discarded into XZR. ### 10.1 CMP — Compare `CMP` subtracts the second operand from the first and sets the condition flags (N, Z, C, V) based on the result, but **discards the result** — it writes to the zero register. It is used before conditional branches or conditional selects to set up the flags. `CMP Xn, Xm` is an alias for `SUBS XZR, Xn, Xm`. ``` CMP Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Alias for: SUBS XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} [64-bit shifted-reg] CMP Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // 32-bit shifted-reg CMP Xn|SP, #imm12{, LSL #12} // Alias for: SUBS XZR, Xn|SP, #imm12{, LSL #12} [immediate] CMP Wn|WSP, #imm12{, LSL #12} // 32-bit immediate CMP Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} // extended-reg CMP Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} // extended-reg, 64-bit Rm ``` The result is discarded (written to XZR/WZR), only flags are kept. **32-bit note**: `CMP Wn, Wm` sets flags based on 32-bit subtraction: N = bit 31 of result, C/V from 32-bit arithmetic. This matters for signed comparisons — `CMP W0, W1; B.GT` checks if W0 > W1 as signed 32-bit values, regardless of the upper 32 bits of X0/X1. ### 10.2 CMN — Compare Negative `CMN` ("Compare Negative") **adds** the two operands and sets flags, discarding the result. It is an alias for `ADDS XZR, Xn, Xm`. It is useful when you want to compare against a negative number — since `CMP` can only encode positive immediates, `CMN X0, #5` effectively tests if `X0 == -5`. ``` CMN Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} // Alias for: ADDS XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR #0-63} [shifted-reg] CMN Wn|WZR, Wm|WZR{, LSL|LSR|ASR #0-31} // 32-bit shifted-reg CMN Xn|SP, #imm12{, LSL #12} // Alias for: ADDS XZR, Xn|SP, #imm12{, LSL #12} [immediate] CMN Wn|WSP, #imm12{, LSL #12} // 32-bit immediate CMN Xn|SP, Wm|WZR, UXTB|UXTH|UXTW|SXTB|SXTH|SXTW {#0-4} // extended-reg CMN Xn|SP, Xm|XZR, UXTX|SXTX|LSL {#0-4} // extended-reg, 64-bit Rm ``` `CMN` is like `CMP` but adds instead of subtracts. Equivalent to `CMP Xn, #-Xm` in terms of flag setting (but NOT identical for all edge cases due to signed overflow differences). Use case: `CMP X0, #-5` can't be encoded (negative immediate), but `CMN X0, #5` can. ### 10.3 TST — Test Bits `TST` performs a bitwise AND of two operands and sets flags, discarding the result. It is an alias for `ANDS XZR, Xn, op2`. Used to check if specific bits are set: after `TST X0, #1`, the zero flag (Z) tells you whether bit 0 was set (Z=0 means the bit was set; Z=1 means it was clear). ``` TST Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} // Alias for: ANDS XZR, Xn|XZR, Xm|XZR{, LSL|LSR|ASR|ROR #0-63} [shifted-reg] TST Wn|WZR, Wm|WZR{, LSL|LSR|ASR|ROR #0-31} // 32-bit shifted-reg TST Xn|XZR, #bitmask_imm // Alias for: ANDS XZR, Xn|XZR, #bitmask_imm [immediate] TST Wn|WZR, #bitmask_imm // 32-bit immediate ``` Sets N and Z based on the AND result (C and V are cleared). ```asm TST X0, #1 // Test bit 0 (is X0 odd?) B.NE is_odd // branch if bit was set (Z==0) TST X0, #0xF // Test lower nibble B.EQ lower_zero // branch if lower nibble is all zeros ``` ### 10.4 CCMP / CCMN — Conditional Compare `CCMP` checks a condition (from the current flags), and only performs its comparison if the condition is true. If the condition is false, it sets the flags to a value you choose via `#nzcv` (a 4-bit constant: bit 3=N, bit 2=Z, bit 1=C, bit 0=V). This lets you chain multiple comparisons into compound boolean expressions (AND / OR) without any branches. `CCMN` is the same but adds instead of subtracts (like CMN vs CMP). ``` CCMP Xn|XZR, Xm|XZR, #nzcv, cond // If cond true: compare Xn, Xm. Else: flags = #nzcv. [64-bit] CCMP Wn|WZR, Wm|WZR, #nzcv, cond CCMP Xn|XZR, #imm5, #nzcv, cond // Same with 5-bit immediate (0–31) [64-bit] CCMP Wn|WZR, #imm5, #nzcv, cond CCMN Xn|XZR, Xm|XZR, #nzcv, cond // Conditional CMN [64-bit] CCMN Wn|WZR, Wm|WZR, #nzcv, cond CCMN Xn|XZR, #imm5, #nzcv, cond // immediate form [64-bit] CCMN Wn|WZR, #imm5, #nzcv, cond ``` **This is one of the most powerful and unique instructions in AArch64.** It enables complex compound conditions without branches. The idea: CCMP checks a condition first. If that condition is true, it performs a normal comparison and sets flags. If the condition is false, it sets the flags to a value you specify in the `#nzcv` operand — this lets you control the outcome of the final branch. **Gotcha**: CCMP reads the current NZCV flags to evaluate `cond`, so a flag-setting instruction (CMP, SUBS, ANDS, TST, or another CCMP) **must** come before it. CCMP without a prior flag-setting instruction reads whatever stale flags happen to be in PSTATE — a bug that's hard to catch because it might work by accident during testing. ```asm // Equivalent of: if (x == 5 && y == 10) CMP X0, #5 CCMP X1, #10, #0, EQ // Only compare X1 with 10 if X0==5; else set flags=0000 (NE) B.EQ both_match // Walkthrough: // If X0 == 5: EQ is true → CCMP compares X1 vs 10 → B.EQ taken only if X1 == 10 // If X0 != 5: EQ is false → flags set to #0 (all zero, Z=0) → B.EQ not taken (Z must be 1 for EQ) // Equivalent of: if (x == 5 || y == 10) CMP X0, #5 CCMP X1, #10, #0b0100, NE // Only compare X1 with 10 if X0!=5; else set Z=1 (EQ) B.EQ either_match // Walkthrough: // If X0 == 5: NE is false → flags set to #0b0100 (Z=1) → B.EQ taken (first condition was true) // If X0 != 5: NE is true → CCMP compares X1 vs 10 → B.EQ taken only if X1 == 10 ``` The `#nzcv` operand is a 4-bit value specifying the flag state if the condition is false: bit 3 = N, bit 2 = Z, bit 1 = C, bit 0 = V. CCMP chains can implement arbitrary boolean combinations of comparisons without branching. --- ## 11. Multiply & Divide ARM multiply and divide instructions. Unlike x86, ARM division never traps — divide-by-zero returns 0, and there is no remainder instruction (you compute it with MSUB). ### 11.1 MUL / MADD / MSUB `MUL` multiplies two registers and stores the low 64 (or 32) bits of the result. For multiplying two 64-bit values, the full mathematical result could be 128 bits, but `MUL` only keeps the low 64 — the same bits whether the inputs are signed or unsigned. `MADD` (Multiply-Add) computes `Xa + Xn × Xm` in one instruction. `MSUB` (Multiply-Subtract) computes `Xa - Xn × Xm`. `MUL` is actually an alias for `MADD` with the accumulator set to the zero register. ``` MUL Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn * Xm (low 64 bits) [64-bit] MUL Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn * Wm (low 32 bits) [32-bit] // Both alias for: MADD Rd, Rn, Rm, RZR MADD Xd|XZR, Xn|XZR, Xm|XZR, Xa|XZR // Xd = Xa + (Xn * Xm) [64-bit] MADD Wd|WZR, Wn|WZR, Wm|WZR, Wa|WZR // Wd = Wa + (Wn * Wm) [32-bit] MSUB Xd|XZR, Xn|XZR, Xm|XZR, Xa|XZR // Xd = Xa - (Xn * Xm) [64-bit] MSUB Wd|WZR, Wn|WZR, Wm|WZR, Wa|WZR // Wd = Wa - (Wn * Wm) [32-bit] MNEG Xd|XZR, Xn|XZR, Xm|XZR // Xd = -(Xn * Xm) Alias: MSUB Xd, Xn, Xm, XZR MNEG Wd|WZR, Wn|WZR, Wm|WZR // Wd = -(Wn * Wm) Alias: MSUB Wd, Wn, Wm, WZR ``` These produce the **low** 64 bits of the 128-bit product. They work for both signed and unsigned (the low bits are the same for both). **None of these set flags.** There is no `MULS` in AArch64. **Overflow behavior**: All multiply instructions silently wrap on overflow — there is no trap, no flag, no indication. `MADD` computes the mathematically exact `Xa + (Xn × Xm)` and then truncates to the low 64 (or 32) bits. If you need to detect multiply overflow, use `UMULH`/`SMULH` (§11.2) and check whether the high half is zero (unsigned) or all-sign-bits (signed). ### 11.2 Wide Multiply (64×64→128) When you multiply two 64-bit numbers, the result can be up to 128 bits. `MUL` gives you the low 64 bits. `SMULH` (Signed Multiply High) and `UMULH` (Unsigned Multiply High) give you the **upper** 64 bits. Together, `MUL` + `UMULH` (or `SMULH`) give you the full 128-bit product. ``` SMULH Xd|XZR, Xn|XZR, Xm|XZR // Xd = high 64 bits of signed(Xn) * signed(Xm) UMULH Xd|XZR, Xn|XZR, Xm|XZR // Xd = high 64 bits of unsigned(Xn) * unsigned(Xm) ``` To get a full 128-bit product: ```asm // Unsigned 128-bit: X1:X0 = X2 * X3 MUL X0, X2, X3 // low 64 bits UMULH X1, X2, X3 // high 64 bits ``` ### 11.3 Long Multiply (32×32→64) These multiply two 32-bit values and produce a full 64-bit result, with no overflow possible. `SMULL` treats the inputs as signed; `UMULL` treats them as unsigned. The result is always in a 64-bit X register. Useful when you know the inputs are 32-bit but need the full product. ``` SMULL Xd|XZR, Wn|WZR, Wm|WZR // Xd = sign_extend(Wn) * sign_extend(Wm) // Alias for: SMADDL Xd, Wn, Wm, XZR UMULL Xd|XZR, Wn|WZR, Wm|WZR // Xd = zero_extend(Wn) * zero_extend(Wm) // Alias for: UMADDL Xd, Wn, Wm, XZR SMADDL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa + sign_extend(Wn) * sign_extend(Wm) UMADDL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa + zero_extend(Wn) * zero_extend(Wm) SMSUBL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa - sign_extend(Wn) * sign_extend(Wm) UMSUBL Xd|XZR, Wn|WZR, Wm|WZR, Xa|XZR // Xd = Xa - zero_extend(Wn) * zero_extend(Wm) SMNEGL Xd|XZR, Wn|WZR, Wm|WZR // Alias for: SMSUBL Xd, Wn, Wm, XZR UMNEGL Xd|XZR, Wn|WZR, Wm|WZR // Alias for: UMSUBL Xd, Wn, Wm, XZR ``` ### 11.4 Division `UDIV` divides unsigned integers. `SDIV` divides signed integers. Both truncate toward zero (drop the fractional part). Unlike x86, ARM division **never raises an exception** — dividing by zero simply returns 0. There is no remainder instruction; you compute it as `remainder = dividend - (quotient * divisor)` using `MSUB`. **Why no remainder instruction?** Division hardware already computes both quotient and remainder internally, but exposing both from one instruction would require 2 destination registers, which ARM's encoding doesn't support. Instead, compilers emit `UDIV` + `MSUB` — the CPU can often fuse or optimize this pair internally. **Why no flags from multiply/divide?** Multiply overflow is ambiguous (do you mean the low-half overflowed, or the full product didn't fit?), and divide-by-zero is a software concern, not a hardware trap. ARM chose simplicity: if you need overflow detection, check explicitly with `UMULH`/`SMULH`. ``` UDIV Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn / Xm (unsigned, truncate toward zero) [64-bit] UDIV Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn / Wm [32-bit] SDIV Xd|XZR, Xn|XZR, Xm|XZR // Xd = Xn / Xm (signed, truncate toward zero) [64-bit] SDIV Wd|WZR, Wn|WZR, Wm|WZR // Wd = Wn / Wm [32-bit] ``` **No flags are set. No exceptions on divide-by-zero.** Division by zero returns **0** in AArch64. **32-bit overflow**: `SDIV Wd` of `INT32_MIN / -1` returns `INT32_MIN` (0x80000000). Same wrapping behavior as 64-bit. To get the remainder (modulo), there is no `MOD` instruction. Use: ```asm // X0 = X1 % X2 (unsigned) UDIV X3, X1, X2 // X3 = X1 / X2 MSUB X0, X3, X2, X1 // X0 = X1 - (X3 * X2) = remainder ``` **What MSUB REALLY does here — traced:** ```asm // Compute 17 % 5: // X1 = 17, X2 = 5 UDIV X3, X1, X2 // X3 = 17 / 5 = 3 (truncated) MSUB X0, X3, X2, X1 // X0 = X1 - (X3 * X2) = 17 - (3 * 5) = 17 - 15 = 2 // X0 = 2 ✓ (17 mod 5 = 2) ``` `MSUB Xd, Xn, Xm, Xa` computes `Xa - (Xn × Xm)`. The accumulator `Xa` is the dividend, and `Xn × Xm` is the quotient times divisor — subtracted from the dividend gives the remainder. **Signed division overflow**: `SDIV` of `INT64_MIN / -1` returns `INT64_MIN` (not an exception). The mathematically correct answer (+2^63) doesn't fit in a signed 64-bit integer, so it wraps. --- ## 12. Sign Extension & Zero Extension When you have a small value (e.g., an 8-bit byte) and need to put it in a larger register (e.g., 64-bit), you need to "extend" it. **Zero extension** fills the upper bits with zeros — used for unsigned values. **Sign extension** fills the upper bits with copies of the value's sign bit (the MSB) — used for signed values, preserving the negative/positive meaning. **Why extension is needed**: Registers are 64 bits wide, but data types in real programs are often 8, 16, or 32 bits. When you load a byte from memory into a 64-bit register, the hardware must decide what to put in the other 56 bits. For unsigned values, zeros make the register hold the correct unsigned interpretation (e.g., byte 0xFF = 255). For signed values, sign-extending preserves the mathematical value (e.g., signed byte 0xFF = -1, which sign-extended to 64 bits is 0xFFFFFFFFFFFFFFFF = -1). Using the wrong extension is a common source of bugs — this is why ARM provides both `LDR` (zero-extending) and `LDRSW`/`LDRSH`/`LDRSB` (sign-extending) load instructions. For example, the byte `0x80` (which is -128 as a signed byte): zero-extending gives `0x0000000000000080` (128 unsigned), but sign-extending gives `0xFFFFFFFFFFFFFF80` (-128 signed). ### 12.1 Dedicated Extend Aliases **64-bit destination (extend to 64 bits):** ``` SXTB Xd|XZR, Wn|WZR // Sign-extend byte → 64 bits. Alias: SBFM Xd, Xn, #0, #7 SXTH Xd|XZR, Wn|WZR // Sign-extend halfword → 64. Alias: SBFM Xd, Xn, #0, #15 SXTW Xd|XZR, Wn|WZR // Sign-extend word → 64. Alias: SBFM Xd, Xn, #0, #31 ``` **32-bit destination (extend to 32 bits):** ``` SXTB Wd|WZR, Wn|WZR // Sign-extend byte → 32 bits. Alias: SBFM Wd, Wn, #0, #7 SXTH Wd|WZR, Wn|WZR // Sign-extend halfword → 32. Alias: SBFM Wd, Wn, #0, #15 UXTB Wd|WZR, Wn|WZR // Zero-extend byte → 32 bits. Alias: UBFM Wd, Wn, #0, #7 // (also AND Wd, Wn, #0xFF) UXTH Wd|WZR, Wn|WZR // Zero-extend halfword → 32. Alias: UBFM Wd, Wn, #0, #15 ``` **RE note — SXTB Wd vs SXTB Xd**: Both exist and produce different results when the byte's sign bit (bit 7) is set: ```asm // If W1 low byte = 0x80: SXTB W0, W1 // W0 = 0xFFFFFF80, X0 = 0x00000000FFFFFF80 (sign to 32, zero to 64) SXTB X0, W1 // X0 = 0xFFFFFFFFFFFFFF80 (sign-extended all the way to 64) ``` The Wd form sign-extends within 32 bits, then the W-register write zeroes the upper 32 — so you get a positive 64-bit value with a negative 32-bit interpretation. **Where are UXTW and UXTX?** - `UXTW` as a standalone instruction **does not exist** because writing to a W register **automatically** zero-extends to 64 bits. So `MOV W0, W1` already does a UXTW. `UXTW` only appears as a modifier in extended register forms (see §8.2 — the `ADD Xd, Xn, Wm, UXTW` form, where the extension happens as part of the address/arithmetic computation). - `UXTX` is effectively a no-op (64-bit to 64-bit zero extension). **Where is SXTW Wd, Wn?** — It doesn't exist. `SXTW` is inherently a 32→64 operation; extending a 32-bit value to 32 bits is a no-op. ### 12.2 Implicit Extension Remember the fundamental rule: any instruction writing to `Wd` automatically zero-extends the result into `Xd`. This means: ```asm ADD W0, W1, W2 // Result in W0 → upper 32 bits of X0 are zeroed LDR W0, [X1] // Loads 32 bits → upper 32 bits of X0 are zeroed ``` For **sign** extension, you must be explicit: ```asm LDRSW X0, [X1] // Load 32-bit signed, sign-extend to 64 bits LDRSH X0, [X1] // Load 16-bit signed, sign-extend to 64 bits LDRSB X0, [X1] // Load 8-bit signed, sign-extend to 64 bits ``` --- ## 13. Bitfield Operations (BFM family) The BFM (Bitfield Move) family is the Swiss Army knife of ARM. Many instructions you use daily — shifts, extends, bitfield extracts — are actually aliases for these three base instructions. Understanding BFM helps you read disassembly where the disassembler shows the raw instruction instead of the friendly alias. **Why ARM uses this design**: Instead of having separate opcodes for LSL, LSR, ASR, SXTB, UXTB, UBFX, SBFIZ, and a dozen more, ARM encodes them all as variants of three base instructions (UBFM, SBFM, BFM) with different immediate parameters. This saves precious opcode space (remember, everything must fit in 32 bits) and means the hardware only needs one circuit for all bitfield operations. The downside: the relationship between the friendly alias and the actual `immr`/`imms` encoding is confusing. That's what this section explains. A "bitfield" is a contiguous range of bits within a register. These instructions extract, insert, or move bitfields with optional sign or zero extension. ### 13.1 The Simple Mental Model **Forget immr/imms for a moment.** At a high level, the BFM family does just two things: 1. **Extract**: Pull a range of bits out of a register, put them at bit 0, and fill the rest (with zeros, sign bits, or leave unchanged). 2. **Insert/Shift**: Take some low bits from a register, shift them left to a new position, and fill the rest. That's it. Every BFM alias — LSL, LSR, ASR, SXTB, UBFX, BFI, etc. — is one of these two operations. The three flavors differ only in what they do with the bits OUTSIDE the field: | Instruction | Bits outside the field | |---|---| | **UBFM** | Filled with **zeros** | | **SBFM** | Filled with copies of the field's **sign bit** (MSB of the extracted field) | | **BFM** | **Left unchanged** from Xd's previous value (insert into existing register) | **The friendly aliases you actually write** (and what they mean): | What you write | What it does | Really encodes as | |---|---|---| | `UBFX X0, X1, #4, #8` | Extract 8 bits starting at bit 4, zero the rest | UBFM (extract case) | | `SBFX X0, X1, #4, #8` | Extract 8 bits starting at bit 4, sign-extend | SBFM (extract case) | | `UBFIZ X0, X1, #8, #4` | Take low 4 bits, shift left by 8, zero the rest | UBFM (shift case) | | `BFI X0, X1, #8, #4` | Take low 4 bits, shift left by 8, insert into X0 | BFM (shift case) | | `LSR X0, X1, #5` | Shift right by 5 (= extract bits [63:5], zero top) | UBFM (extract case) | | `LSL X0, X1, #5` | Shift left by 5 (= take low 59 bits, shift left) | UBFM (shift case) | | `ASR X0, X1, #5` | Arithmetic shift right (= extract + sign-extend) | SBFM (extract case) | | `SXTB X0, W1` | Sign-extend byte to 64 bits | SBFM (extract bits [7:0]) | | `UXTB W0, W1` | Zero-extend byte to 32 bits | UBFM (extract bits [7:0]) | **You almost never write UBFM/SBFM/BFM directly.** You write the aliases. But disassemblers sometimes show the raw form, so you need to understand how `immr` and `imms` map to the aliases. That's the next subsection. ### 13.2 The immr/imms Encoding (How the Hardware Sees It) The hardware doesn't know about "UBFX" or "LSL" — it only sees `UBFM Xd, Xn, #immr, #imms`. The names stand for: `immr` = **immediate rotate** (how much to rotate the source right), `imms` = **immediate mask size** (how many bits to keep, roughly). ``` SBFM Xd|XZR, Xn|XZR, #immr, #imms // Signed Bitfield Move [64-bit, immr/imms: 0–63] UBFM Xd|XZR, Xn|XZR, #immr, #imms // Unsigned Bitfield Move [64-bit, immr/imms: 0–63] BFM Xd|XZR, Xn|XZR, #immr, #imms // Bitfield Move (insert) [64-bit, immr/imms: 0–63] SBFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] UBFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] BFM Wd|WZR, Wn|WZR, #immr, #imms // [32-bit, immr/imms: 0–31] ``` The behavior depends on the relationship between `immr` and `imms`: **When imms >= immr** (the **extract** case): - Extract bits [imms:immr] from Xn — starting at bit position `immr`, up through bit `imms`. Width = `imms - immr + 1`. - Place the field at bit 0 of Xd. - UBFM: zero the rest. SBFM: sign-extend from bit [imms]. BFM: leave Xd's other bits unchanged. **When imms < immr** (the **shift/insert** case): - Take the low `imms + 1` bits from Xn. - Place them at bit position `64 - immr` (effectively shift left by `64 - immr`). - UBFM: zero the rest. SBFM: sign-extend. BFM: leave Xd's other bits unchanged. **Why two cases?** The hardware actually does one thing: rotate Xn right by `immr`, then apply a bitmask of width `imms+1`. Depending on how the rotation and mask interact, this looks like either an "extract from the middle" or a "shift up from the bottom." The two cases are just the two ways the mask can land relative to the rotation. **The raw instructions traced with hex values:** ```asm // ═══════════════════════════════════════════════════════════ // UBFM — Unsigned Bitfield Move (zero-fills non-field bits) // ═══════════════════════════════════════════════════════════ // Case 1: imms >= immr (EXTRACT case) // UBFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8] from X1 (width = 15-8+1 = 8 bits), place at bit 0, zero rest // // If X1 = 0x00000000_0000ABCD: // Bits [15:8] of 0xABCD: 0xAB (binary: 10101011) // Place at bit 0: 0x000000AB // Zero the rest: 0x00000000_000000AB // X0 = 0x00000000_000000AB // This is what the assembler shows as: UBFX X0, X1, #8, #8 // Case 2: imms < immr (INSERT-IN-ZERO / SHIFT case) // UBFM X0, X1, #56, #7 (immr=56, imms=7) // Take bits [7:0] from X1 (width = 7+1 = 8 bits), place at bit 64-56 = 8 // // If X1 = 0x00000000_000000FF: // Bits [7:0]: 0xFF // Place at bit 8: 0x0000FF00 // Zero the rest: 0x00000000_0000FF00 // X0 = 0x00000000_0000FF00 // This is what the assembler shows as: UBFIZ X0, X1, #8, #8 // (or equivalently: LSL X0, X1, #8 if width covered the whole register) ``` ```asm // ═══════════════════════════════════════════════════════════ // SBFM — Signed Bitfield Move (sign-extends from field's top bit) // ═══════════════════════════════════════════════════════════ // Case 1: imms >= immr (EXTRACT + SIGN-EXTEND case) // SBFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8], place at bit 0, sign-extend from bit [imms] = bit 15 // // If X1 = 0x00000000_0000ABCD: // Bits [15:8]: 0xAB (bit 7 of the field = 1 → "negative") // Sign-extend: 0xFFFFFFFF_FFFFFFAB // X0 = 0xFFFFFFFF_FFFFFFAB // This is: SBFX X0, X1, #8, #8 // // If X1 = 0x00000000_00001234: // Bits [15:8]: 0x12 (bit 7 of the field = 0 → "positive") // Sign-extend (no change): 0x00000000_00000012 // This is: SBFX X0, X1, #8, #8 // Special case: SBFM X0, X1, #0, #7 = SXTB X0, W1 // Extract bits [7:0], sign-extend from bit 7 // // If X1 = 0x00000000_000000C0: // Bits [7:0]: 0xC0 (bit 7 = 1 → negative byte) // Sign-extend to 64: 0xFFFFFFFF_FFFFFFC0 = -64 (signed) // Special case: SBFM X0, X1, #0, #31 = SXTW X0, W1 // Extract bits [31:0], sign-extend from bit 31 // // If X1 = 0x00000000_80000000: // Bits [31:0]: 0x80000000 (bit 31 = 1 → negative word) // Sign-extend: 0xFFFFFFFF_80000000 = INT32_MIN as 64-bit ``` ```asm // ═══════════════════════════════════════════════════════════ // BFM — Bitfield Move (INSERT: modifies only the target field in Xd) // ═══════════════════════════════════════════════════════════ // Unlike UBFM/SBFM which write ALL bits of Xd, BFM only modifies the // destination bitfield and leaves all other bits of Xd UNCHANGED. // This is why BFM is used for "insert" operations. // Case 1: imms >= immr (EXTRACT-AND-INSERT-LOW case) // BFM X0, X1, #8, #15 (immr=8, imms=15) // Extract bits [15:8] from X1, insert at bits [7:0] of X0 (other bits unchanged) // // If X0 = 0xDEADBEEF_DEADBEEF and X1 = 0x00000000_0000ABCD: // Bits [15:8] of X1: 0xAB // Replace bits [7:0] of X0 with 0xAB: // X0 = 0xDEADBEEF_DEADBEAB (only low 8 bits changed!) // This is: BFXIL X0, X1, #8, #8 // Case 2: imms < immr (INSERT-AT-POSITION case) // BFM X0, X1, #56, #7 (immr=56, imms=7) // Take bits [7:0] from X1, insert at bit 8 of X0 (other bits unchanged) // // If X0 = 0xDEADBEEF_DEADBEEF and X1 = 0x00000000_000000FF: // Bits [7:0] of X1: 0xFF // Insert at bits [15:8] of X0: // X0 = 0xDEADBEEF_DEADFFEF (only bits [15:8] changed!) // This is: BFI X0, X1, #8, #8 ``` **Summary: how the three differ on the SAME operation:** ``` // All three extract bits [15:8] from X1 (= 0xAB from 0xABCD) and place at bit 0: // X0 starts as 0xDEADBEEF_DEADBEEF for BFM, doesn't matter for UBFM/SBFM UBFM X0, X1, #8, #15 // X0 = 0x00000000_000000AB (zero-filled) SBFM X0, X1, #8, #15 // X0 = 0xFFFFFFFF_FFFFFFAB (sign-extended, because 0xAB has bit 7 set) BFM X0, X1, #8, #15 // X0 = 0xDEADBEEF_DEADBEAB (only bits [7:0] replaced, rest kept) ``` ### 13.3 Aliases of UBFM **Don't memorize these tables** — use them as a reference when you see raw UBFM/SBFM/BFM in a disassembler and need to figure out which friendly instruction it corresponds to. | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `LSL Rd, Rn, #s` | `UBFM Xd, Xn, #(-s MOD 64), #(63-s)` | `UBFM Wd, Wn, #(-s MOD 32), #(31-s)` | | `LSR Rd, Rn, #s` | `UBFM Xd, Xn, #s, #63` | `UBFM Wd, Wn, #s, #31` | | `UBFX Rd, Rn, #lsb, #w` | `UBFM Xd, Xn, #lsb, #(lsb+w-1)` | `UBFM Wd, Wn, #lsb, #(lsb+w-1)` | | `UBFIZ Rd, Rn, #lsb, #w` | `UBFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `UBFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `UXTB Wd, Wn` | — | `UBFM Wd, Wn, #0, #7` | | `UXTH Wd, Wn` | — | `UBFM Wd, Wn, #0, #15` | ### 13.4 Aliases of SBFM | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `ASR Rd, Rn, #s` | `SBFM Xd, Xn, #s, #63` | `SBFM Wd, Wn, #s, #31` | | `SBFX Rd, Rn, #lsb, #w` | `SBFM Xd, Xn, #lsb, #(lsb+w-1)` | `SBFM Wd, Wn, #lsb, #(lsb+w-1)` | | `SBFIZ Rd, Rn, #lsb, #w` | `SBFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `SBFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `SXTB Wd, Wn` | — | `SBFM Wd, Wn, #0, #7` | | `SXTB Xd, Wn` | `SBFM Xd, Xn, #0, #7` | — | | `SXTH Wd, Wn` | — | `SBFM Wd, Wn, #0, #15` | | `SXTH Xd, Wn` | `SBFM Xd, Xn, #0, #15` | — | | `SXTW Xd, Wn` | `SBFM Xd, Xn, #0, #31` | — (no 32-bit form; SXTW is inherently 32→64) | **RE note**: A disassembler may show `SBFM W0, W1, #0, #7` — that's just `SXTB W0, W1` (sign-extend byte to 32 bits). But `SBFM X0, X1, #0, #7` is `SXTB X0, W1` (sign-extend byte to 64 bits). The register width tells you the target size of the extension. ### 13.5 Aliases of BFM | Alias | Actual encoding (64-bit) | Actual encoding (32-bit) | |---|---|---| | `BFI Rd, Rn, #lsb, #w` | `BFM Xd, Xn, #(-lsb MOD 64), #(w-1)` | `BFM Wd, Wn, #(-lsb MOD 32), #(w-1)` | | `BFXIL Rd, Rn, #lsb, #w` | `BFM Xd, Xn, #lsb, #(lsb+w-1)` | `BFM Wd, Wn, #lsb, #(lsb+w-1)` | ### 13.6 Practical BFM Examples **What each instruction REALLY does — traced with concrete values:** ```asm // ═══ UBFX — Unsigned Bitfield Extract ═══ // "Pull out a range of bits, zero-extend the rest" // UBFX X0, X1, #4, #8 → extract 8 bits starting at bit 4 // // If X1 = 0x00000000_0000ABCD: // Binary of low 16 bits: 1010_1011_1100_1101 // Bits [11:4]: 1011_1100 // Zero-extend to 64 bits: 0x00000000_000000BC // X0 = 0x00000000_000000BC UBFX X0, X1, #4, #8 // ═══ SBFX — Signed Bitfield Extract ═══ // "Pull out a range of bits, sign-extend from the top bit of the field" // // If X1 = 0x00000000_0000ABCD (same value): // Bits [11:4]: 1011_1100 (bit 11 = 1, so the field is "negative") // Sign-extend to 64 bits: 0xFFFFFFFF_FFFFFFBC = -68 (signed) // // If X1 = 0x00000000_00001234: // Bits [11:4]: 0010_0011 (bit 11 = 0, so "positive") // X0 = 0x00000000_00000023 = 35 SBFX X0, X1, #4, #8 ``` ```asm // ═══ BFI — Bitfield Insert ═══ // "Take low bits from source, plug them into a specific position in destination" // BFI X0, X1, #8, #8 → take low 8 bits of X1, insert at bits [15:8] of X0 // // If X0 = 0x00000000_12345678 and X1 = 0x00000000_000000FF: // Low 8 bits of X1: 0xFF // Insert at bits [15:8] of X0: replace the "56" in 0x12345678 // X0 = 0x00000000_1234FF78 BFI X0, X1, #8, #8 // ═══ BFXIL — Bitfield Extract and Insert Low ═══ // "Extract a range from source, insert at bit 0 of destination, keep upper bits" // // If X0 = 0xAAAAAAAA_AAAAAAAA and X1 = 0x00000000_00AB0000: // Bits [23:16] of X1: 0xAB // Insert at bits [7:0] of X0: 0xAAAAAAAA_AAAAAAAB (only low 8 bits changed) BFXIL X0, X1, #16, #8 ``` ```asm // ═══ UBFIZ — Unsigned Bitfield Insert in Zero ═══ // "Take low bits from source, shift them left, zero everything else" // // If X1 = 0x00000000_000000AB: // Low 8 bits: 0xAB, shift left by 16 → X0 = 0x00000000_00AB0000 UBFIZ X0, X1, #16, #8 // ═══ SBFIZ — Signed Bitfield Insert in Zero ═══ // "Take low bits, shift left, sign-extend from the top bit of the field" // // If X1 = 0x00000000_000000FF (low 8 bits: 0xFF, bit 7=1 → "negative"): // Shift left by 16: 0x00000000_00FF0000 // Sign-extend from bit 23: X0 = 0xFFFFFFFF_FFFF0000 // // If X1 = 0x00000000_0000007F (bit 7=0 → "positive"): // Shift left by 16: X0 = 0x00000000_007F0000 (no sign-extension needed) SBFIZ X0, X1, #16, #8 // ═══ Clearing a bitfield ═══ // BFI X0, XZR, #8, #8 → insert 8 zero bits at bits [15:8] // If X0 = 0x00000000_FFFFFFFF → X0 = 0x00000000_FFFF00FF BFI X0, XZR, #8, #8 ``` **How to read BFM in disassembly**: If you see raw `UBFM X0, X1, #4, #11`, check: is imms >= immr? Yes (11 >= 4), so it's an extract: bits [11:4], width = 11-4+1 = 8. This is `UBFX X0, X1, #4, #8`. If you see `UBFM X0, X1, #60, #3`, check: imms < immr? Yes (3 < 60), so it's an insert-in-zero: low 4 bits shifted left by 64-60 = 4. This is `UBFIZ X0, X1, #4, #4`. ```asm // === 32-bit equivalents === // Same operations but with 32-bit registers — upper 32 of Xd always zeroed // If W1 = 0x0000ABCD: UBFX W0, W1, #4, #8 // W0 = 0x000000BC, X0 = 0x00000000_000000BC SBFX W0, W1, #4, #8 // W0 = 0xFFFFFFBC (sign-extend to 32), X0 = 0x00000000_FFFFFFBC // RE trap: SBFX W0 vs SBFX X0 // SBFX W0, W1, #4, #8 → sign-extends to bit 31, then upper 32 of X0 zeroed // SBFX X0, X1, #4, #8 → sign-extends all the way to bit 63 // These give DIFFERENT results when the sign bit (bit 11) is set! // If X1 = 0x00000000_0000ABCD: // SBFX W0 → W0 = 0xFFFFFFBC, X0 = 0x00000000_FFFFFFBC (positive 64-bit!) // SBFX X0 → X0 = 0xFFFFFFFF_FFFFFFBC (negative 64-bit!) ``` ### 13.7 EXTR — Extract from Pair ``` EXTR Xd|XZR, Xn|XZR, Xm|XZR, #0-63 // 64-bit: treat Xn:Xm as a 128-bit value (Xn is the high half, // Xm is the low half), then extract 64 bits starting at bit #lsb EXTR Wd|WZR, Wn|WZR, Wm|WZR, #0-31 // 32-bit: treat Wn:Wm as a 64-bit value, extract 32 bits at #lsb ``` `#lsb` is the bit position in the low register (Xm/Wm) where extraction starts (0–63 for 64-bit, 0–31 for 32-bit). The result is bits [lsb+63 : lsb] of the 128-bit concatenation (wrapping from Xm into Xn). When `Xn == Xm` (or `Wn == Wm`), this is `ROR Rd, Rn, #lsb` (rotate right). **Traced example:** ```asm // If X1 = 0x00000000_000ABCDE and X2 = 0x12345678_9ABC0000: EXTR X0, X1, X2, #20 // Concatenation X1:X2 = 0x00000000000ABCDE:123456789ABC0000 (128 bits) // Extract 64 bits starting at bit 20 of the low register (X2): // Bottom 44 bits: X2[63:20] = 0x12345678_9ABC0000 >> 20 = 0x123456789AB // Top 20 bits: X1[19:0] = 0xABCDE // Combined: 0xABCDE_123456789AB → X0 = 0xABCDE123456789AB // In practice: EXTR shifts X2 right by 20, and fills the vacated top 20 bits // with the bottom 20 bits of X1. // Rotate right (Xn == Xm): // If X0 = 0x00000000_0000000F: EXTR X0, X0, X0, #4 // ROR X0, X0, #4 // Bits 3:0 (= 0xF) rotate to bits 63:60 // X0 = 0xF000000000000000 ``` ```asm // 64-bit rotate right X0 by 5: EXTR X0, X0, X0, #5 // same as ROR X0, X0, #5 // 32-bit rotate right W0 by 5: EXTR W0, W0, W0, #5 // same as ROR W0, W0, #5 — upper 32 of X0 zeroed // Extract 64 bits from the middle of X1:X2 EXTR X0, X1, X2, #20 // bits [83:20] of the 128-bit value X1:X2 ``` --- ## 14. Bit Manipulation Instructions Instructions for counting, reversing, and manipulating individual bits. These are essential for bitmap operations, hash functions, and low-level data structure manipulation. ### 14.1 CLZ — Count Leading Zeros `CLZ` counts how many consecutive zero bits there are starting from the most significant bit (left side). For example, `CLZ` of `0x00F0...` would be 8 (eight zeros before the first 1). If the entire register is zero, the result is 64 (or 32 for Wd). Useful for finding the position of the highest set bit. ``` CLZ Xd|XZR, Xn|XZR // Xd = number of leading zero bits in Xn (0-64) CLZ Wd|WZR, Wn|WZR // Wd = number of leading zero bits in Wn (0-32) ``` If `Xn == 0`, result is 64 (or 32 for Wn). Use case: finding the highest set bit, computing floor(log2(x)): ```asm // floor(log2(X0)) = 63 - CLZ(X0), for X0 > 0 CLZ X1, X0 MOV X2, #63 SUB X1, X2, X1 // X1 = floor(log2(X0)) ``` ### 14.2 CLS — Count Leading Sign Bits `CLS` counts the number of leading sign bits in a register, minus 1. A "leading sign bit" is a bit that matches the MSB, counting from the top. For positive numbers (MSB=0), it counts leading zeros minus 1. For negative numbers (MSB=1), it counts leading ones minus 1. The result tells you how many redundant sign bits there are — useful for determining how many bits are actually needed to represent a value. ``` CLS Xd|XZR, Xn|XZR // Count leading bits that match the sign bit, minus 1 (range 0–63) CLS Wd|WZR, Wn|WZR // Same for 32-bit (range 0–31) ``` ### 14.3 RBIT — Reverse Bits `RBIT` reverses the order of all bits in a register — bit 0 swaps with bit 63, bit 1 with bit 62, etc. The main use case is computing a count of trailing zeros (CTZ): reverse the bits with `RBIT`, then count leading zeros with `CLZ` — the leading zeros of the reversed value equal the trailing zeros of the original. ``` RBIT Xd|XZR, Xn|XZR // Reverse all 64 bits (bit 0 ↔ bit 63, etc.) RBIT Wd|WZR, Wn|WZR // Reverse all 32 bits ``` Useful for CRC calculations and trailing-zero counts: ```asm // Count trailing zeros (CTZ) — baseline approach: RBIT X1, X0 // Reverse bits CLZ X1, X1 // Count leading zeros of reversed = trailing zeros of original ``` **With FEAT_CSSC:** A dedicated `CTZ Xd, Xn` / `CTZ Wd, Wn` instruction exists, eliminating the RBIT+CLZ sequence. ### 14.4 REV — Reverse Bytes `REV` reverses the byte order of a register — this converts between little-endian and big-endian. `REV16` reverses bytes within each 16-bit halfword. `REV32` reverses bytes within each 32-bit word (Xd form only, since `REV Wd` already does 32-bit reversal). ``` REV Xd|XZR, Xn|XZR // Reverse byte order (64-bit endian swap) REV Wd|WZR, Wn|WZR // Reverse byte order (32-bit endian swap) REV16 Xd|XZR, Xn|XZR // Reverse bytes within each 16-bit halfword (64-bit) REV16 Wd|WZR, Wn|WZR // Reverse bytes within each 16-bit halfword (32-bit) REV32 Xd|XZR, Xn|XZR // Reverse bytes within each 32-bit word (64-bit ONLY — no Wd form) ``` **Note**: `REV32` only has an Xd form because `REV Wd, Wn` already does a 32-bit byte swap. `REV32 Xd, Xn` swaps bytes within each 32-bit half independently. ```asm // If X0 = 0x0102030405060708: REV X1, X0 // X1 = 0x0807060504030201 (full 64-bit byte swap) REV W1, W0 // W1 = 0x08070605 (32-bit byte swap of low word) // X1 = 0x0000000008070605 (upper zeroed) REV16 X1, X0 // X1 = 0x0201040306050807 (swap within each 16-bit chunk) REV32 X1, X0 // X1 = 0x0403020108070605 (swap within each 32-bit chunk) ``` ### 14.5 CNT — Population Count **With FEAT_CSSC (optional from ARMv8.7-A):** A scalar `CNT` instruction exists: ```asm CNT Xd|XZR, Xn|XZR // Xd = popcount(Xn) CNT Wd|WZR, Wn|WZR // Wd = popcount(Wn) ``` **Without FEAT_CSSC (baseline AArch64):** No scalar popcount exists. Use the NEON workaround: ```asm // Count set bits in X0: FMOV D0, X0 // Move X0 into SIMD register D0 CNT V0.8B, V0.8B // Count bits in each byte (NEON vector CNT) ADDV B0, V0.8B // Sum all byte counts UMOV W1, V0.B[0] // Move result to GPR ``` ### 14.6 FEAT_CSSC — Common Short Sequence Compression FEAT_CSSC (optional from ARMv8.7-A / ARMv9.2-A, mandatory from ARMv8.9-A / ARMv9.4-A) adds scalar instructions that previously required multi-instruction sequences. These exist because compilers kept generating the same 2-4 instruction patterns, so ARM added single instructions to replace them. ``` // Absolute value (previously: CMP + CNEG, 2 instructions) ABS Xd|XZR, Xn|XZR // Xd = |Xn| (signed absolute value) ABS Wd|WZR, Wn|WZR // Min/Max (previously: CMP + CSEL, 2 instructions each) SMAX Xd|XZR, Xn|XZR, Xm|XZR // Xd = max(Xn, Xm) signed SMAX Wd|WZR, Wn|WZR, Wm|WZR SMIN Xd|XZR, Xn|XZR, Xm|XZR // Xd = min(Xn, Xm) signed SMIN Wd|WZR, Wn|WZR, Wm|WZR UMAX Xd|XZR, Xn|XZR, Xm|XZR // Xd = max(Xn, Xm) unsigned UMAX Wd|WZR, Wn|WZR, Wm|WZR UMIN Xd|XZR, Xn|XZR, Xm|XZR // Xd = min(Xn, Xm) unsigned UMIN Wd|WZR, Wn|WZR, Wm|WZR // Also with immediate: SMAX Xd|XZR, Xn|XZR, #simm8 // Signed max with 8-bit signed immediate (-128 to 127) SMAX Wd|WZR, Wn|WZR, #simm8 SMIN Xd|XZR, Xn|XZR, #simm8 // Signed min with 8-bit signed immediate SMIN Wd|WZR, Wn|WZR, #simm8 UMAX Xd|XZR, Xn|XZR, #uimm8 // Unsigned max with 8-bit unsigned immediate (0 to 255) UMAX Wd|WZR, Wn|WZR, #uimm8 UMIN Xd|XZR, Xn|XZR, #uimm8 // Unsigned min with 8-bit unsigned immediate UMIN Wd|WZR, Wn|WZR, #uimm8 // Count trailing zeros (previously: RBIT + CLZ, 2 instructions) CTZ Xd|XZR, Xn|XZR // Xd = number of trailing zeros (0-64) CTZ Wd|WZR, Wn|WZR // Wd = number of trailing zeros (0-32) // Scalar population count (previously: FMOV + CNT + ADDV + UMOV, 4 instructions) CNT Xd|XZR, Xn|XZR // Xd = popcount(Xn) CNT Wd|WZR, Wn|WZR ``` All Wd forms follow the standard W-register rule: upper 32 bits of Xd are zeroed. None of these set flags. **Why these exist**: Compilers emit CMP+CSEL for min/max thousands of times in typical code. SMAX/SMIN/UMAX/UMIN cut the count in half, improving both code size and throughput. Similarly, CTZ (count trailing zeros) is used in every `ffs()`-style operation and bitmap scanner. --- ## 15. Load & Store Instructions Loads copy data from memory into a register. Stores copy data from a register into memory. The syntax `[Xn]` means "the memory address stored in register Xn." Think of the square brackets as a dereference — like `*ptr` in C. ### 15.1 Basic Loads These read data from memory at the address in `Xn` and place it into the destination register. The base register `Xn` can be **SP** (stack pointer) — this is how stack-relative loads work (e.g., `LDR X0, [SP, #8]`). The destination `Xt` can be **XZR** — loading into XZR discards the value (used for prefetch side-effects or consuming cache lines). Smaller loads (byte, halfword, word) are automatically zero-extended or sign-extended to fill the full register. **LDR — Load Register (64-bit):** ``` LDR Xt|XZR, [Xn|SP] // Base register LDR Xt|XZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 8, 0–32760) LDR Xt|XZR, [Xn|SP, #simm9]! // Pre-index (−256 to +255) LDR Xt|XZR, [Xn|SP], #simm9 // Post-index (−256 to +255) LDR Xt|XZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#3}}] // Register offset (64-bit index) LDR Xt|XZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#3}] // Extended register (32-bit index) LDR Xt|XZR, label // PC-relative literal (±1 MB) ``` **LDR — Load Register (32-bit, zero-extends to 64):** ``` LDR Wt|WZR, [Xn|SP] // Base register LDR Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 4, 0–16380) LDR Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDR Wt|WZR, [Xn|SP], #simm9 // Post-index LDR Wt|WZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#2}}] // Register offset LDR Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#2}] // Extended register LDR Wt|WZR, label // PC-relative literal ``` **LDRH — Load Halfword (16-bit, zero-extends to 32/64):** ``` LDRH Wt|WZR, [Xn|SP] // Base register LDRH Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 2, 0–8190) LDRH Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRH Wt|WZR, [Xn|SP], #simm9 // Post-index LDRH Wt|WZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#1}}] // Register offset LDRH Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#1}] // Extended register ``` **LDRB — Load Byte (8-bit, zero-extends to 32/64):** ``` LDRB Wt|WZR, [Xn|SP] // Base register LDRB Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (0–4095, no scaling) LDRB Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRB Wt|WZR, [Xn|SP], #simm9 // Post-index LDRB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRB Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0}] // Extended register ``` **LDRSW — Load Signed Word (32-bit, sign-extends to 64):** ``` LDRSW Xt|XZR, [Xn|SP] // Base register LDRSW Xt|XZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 4, 0–16380) LDRSW Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSW Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSW Xt|XZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#2}}] // Register offset LDRSW Xt|XZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#2}] // Extended register LDRSW Xt|XZR, label // PC-relative literal ``` **LDRSH — Load Signed Halfword (16-bit, sign-extends to 32 or 64):** ``` LDRSH Xt|XZR, [Xn|SP{, #pimm}] // Sign-extend to 64 (all addressing modes) LDRSH Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSH Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSH Xt|XZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#1}}] // Register offset LDRSH Xt|XZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#1}] // Extended register LDRSH Wt|WZR, [Xn|SP{, #pimm}] // Sign-extend to 32 (all addressing modes) LDRSH Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRSH Wt|WZR, [Xn|SP], #simm9 // Post-index LDRSH Wt|WZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#1}}] // Register offset LDRSH Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#1}] // Extended register ``` **LDRSB — Load Signed Byte (8-bit, sign-extends to 32 or 64):** ``` LDRSB Xt|XZR, [Xn|SP{, #pimm}] // Sign-extend to 64 (all addressing modes) LDRSB Xt|XZR, [Xn|SP, #simm9]! // Pre-index LDRSB Xt|XZR, [Xn|SP], #simm9 // Post-index LDRSB Xt|XZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRSB Xt|XZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0}] // Extended register LDRSB Wt|WZR, [Xn|SP{, #pimm}] // Sign-extend to 32 (all addressing modes) LDRSB Wt|WZR, [Xn|SP, #simm9]! // Pre-index LDRSB Wt|WZR, [Xn|SP], #simm9 // Post-index LDRSB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset LDRSB Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0}] // Extended register ``` **SIMD/FP Loads (all addressing modes):** ``` LDR Bt, [Xn|SP{, #pimm}] // Load 8-bit FP/SIMD LDR Ht, [Xn|SP{, #pimm}] // Load 16-bit FP/SIMD LDR St, [Xn|SP{, #pimm}] // Load 32-bit FP/SIMD (single-precision) LDR Dt, [Xn|SP{, #pimm}] // Load 64-bit FP/SIMD (double-precision) LDR Qt, [Xn|SP{, #pimm}] // Load 128-bit SIMD // All FP/SIMD loads support: [Xn|SP, #simm9]!, [Xn|SP], #simm9, // [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#s}}], [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#s}] // where #s = log2(access_size). Also: LDR St/Dt/Qt, label (PC-relative). ``` Note: `LDRSH Wd` vs `LDRSH Xd` — the register width determines whether sign extension goes to 32 or 64 bits. The `Xd` variant sign-extends all the way to 64 bits; the `Wd` variant sign-extends to 32, then the W-register write zeroes the upper 32. ### 15.2 Basic Stores These write data from a register into memory. Only the relevant low bytes are written — there is no sign extension for stores. **STR — Store Register (64-bit):** ``` STR Xt|XZR, [Xn|SP] // Base register (XZR stores zero) STR Xt|XZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 8, 0–32760) STR Xt|XZR, [Xn|SP, #simm9]! // Pre-index (−256 to +255) STR Xt|XZR, [Xn|SP], #simm9 // Post-index (−256 to +255) STR Xt|XZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#3}}] // Register offset STR Xt|XZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#3}] // Extended register ``` **STR — Store Register (32-bit):** ``` STR Wt|WZR, [Xn|SP] // Base register (WZR stores zero) STR Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 4, 0–16380) STR Wt|WZR, [Xn|SP, #simm9]! // Pre-index STR Wt|WZR, [Xn|SP], #simm9 // Post-index STR Wt|WZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#2}}] // Register offset STR Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#2}] // Extended register ``` **STRH — Store Halfword (16-bit):** ``` STRH Wt|WZR, [Xn|SP] // Base register STRH Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (multiple of 2, 0–8190) STRH Wt|WZR, [Xn|SP, #simm9]! // Pre-index STRH Wt|WZR, [Xn|SP], #simm9 // Post-index STRH Wt|WZR, [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#1}}] // Register offset STRH Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#1}] // Extended register ``` **STRB — Store Byte (8-bit):** ``` STRB Wt|WZR, [Xn|SP] // Base register STRB Wt|WZR, [Xn|SP, #pimm] // Unsigned offset (0–4095, no scaling) STRB Wt|WZR, [Xn|SP, #simm9]! // Pre-index STRB Wt|WZR, [Xn|SP], #simm9 // Post-index STRB Wt|WZR, [Xn|SP, Xm|XZR{, LSL #0|SXTX #0}] // Register offset STRB Wt|WZR, [Xn|SP, Wm|WZR, SXTW|UXTW {#0}] // Extended register ``` **SIMD/FP Stores (all addressing modes):** ``` STR Bt, [Xn|SP{, #pimm}] // Store 8-bit FP/SIMD STR Ht, [Xn|SP{, #pimm}] // Store 16-bit FP/SIMD STR St, [Xn|SP{, #pimm}] // Store 32-bit FP/SIMD STR Dt, [Xn|SP{, #pimm}] // Store 64-bit FP/SIMD STR Qt, [Xn|SP{, #pimm}] // Store 128-bit SIMD // All support: [Xn|SP, #simm9]!, [Xn|SP], #simm9, // [Xn|SP, Xm|XZR{, LSL|SXTX {#0|#s}}], [Xn|SP, Wm|WZR, SXTW|UXTW {#0|#s}] ``` There are no "sign-extending" stores — stores just write the low bytes, sign doesn't matter. ### 15.3 Addressing Modes All loads and stores need a memory address. AArch64 provides several ways to compute that address, called **addressing modes**. The syntax `[Xn, #offset]` means "the address in Xn plus offset." In all load/store addressing modes, the base register `Xn` can be **SP** (register 31 = SP in this context). The offset register `Xm` or `Wm` uses XZR/WZR for register 31. | Mode | Syntax | Effective address | Base updated? | |---|---|---|---| | Base register | `[Xn\|SP]` | Xn | No | | Immediate offset | `[Xn\|SP, #imm]` | Xn + imm | No | | Pre-index | `[Xn\|SP, #imm]!` | Xn + imm | Yes, **before** access | | Post-index | `[Xn\|SP], #imm` | Xn | Yes, **after** access | | Register offset | `[Xn\|SP, Xm\|XZR]` | Xn + Xm | No | | Shifted register | `[Xn\|SP, Xm\|XZR, LSL #s]` | Xn + (Xm << s) | No | | Extended register | `[Xn\|SP, Wm\|WZR, SXTW {#s}]` | Xn + sign_extend(Wm) << s | No | | PC-relative literal | `label` | PC + offset | No | **Immediate offset** (`LDR Xd, [Xn, #imm]`): The most common form. The offset is added to the base register to compute the address. **What the hardware actually encodes**: The instruction has a 12-bit unsigned offset field, but the value stored is **divided by the access size**. For 64-bit LDR, the hardware stores `offset ÷ 8`, so the byte offset you write must be a multiple of 8 (range 0–32760). For 32-bit LDR, it stores `offset ÷ 4` (range 0–16380). For LDRH, `offset ÷ 2` (range 0–8190). For LDRB, the offset is unscaled (range 0–4095). This is why `LDR X0, [X1, #7]` is illegal — 7 is not a multiple of 8. **Use `LDUR` for non-multiples and negative offsets** (see §15.4). ```asm // What you write: // What the hardware encodes: LDR X0, [X1, #0] // imm12 = 0 → address = X1 + 0 LDR X0, [X1, #8] // imm12 = 1 → address = X1 + (1 × 8) = X1 + 8 LDR X0, [X1, #32760] // imm12 = 4095 → address = X1 + (4095 × 8) LDR X0, [X1, #7] // ERROR: 7 is not a multiple of 8 LDR W0, [X1, #4] // imm12 = 1 → address = X1 + (1 × 4) = X1 + 4 LDRB W0, [X1, #100] // imm12 = 100 → address = X1 + 100 (byte, no scaling) ``` For negative offsets or offsets that aren't multiples of the access size, use `LDUR`/`STUR` instead (§15.4) — they use an unscaled signed 9-bit offset. **Pre-index** (`[Xn, #imm]!`): The `!` means "update the base register." The base is updated to `Xn + imm` **before** the memory access. Used for "push" operations. The offset is a signed 9-bit value (−256 to +255), NOT the scaled 12-bit field — this is why you can write `STR X0, [SP, #-16]!` with a negative offset. **Post-index** (`[Xn], #imm`): The offset is outside the brackets. The memory access uses the original Xn, then Xn is updated to `Xn + imm` **after** the access. Used for "pop" operations. Same signed 9-bit range as pre-index. ```asm // Pre-index: update base BEFORE the access LDR X0, [X1, #16]! // X1 = X1 + 16, then load from new X1 // Post-index: update base AFTER the access LDR X0, [X1], #16 // Load from X1, then X1 = X1 + 16 ``` **Stack push/pop patterns:** ```asm // Push X0 onto stack (pre-decrement): STR X0, [SP, #-16]! // SP -= 16, then store X0 // Pop X0 from stack (post-increment): LDR X0, [SP], #16 // Load X0, then SP += 16 ``` **Traced stack walkthrough:** ``` // Initial: SP = 0x1000, X0 = 0xDEAD, X1 = 0xBEEF STR X0, [SP, #-16]! // Step 1: SP = 0x1000 - 16 = 0xFF0 // Step 2: Store 0xDEAD at address 0xFF0 // Memory at 0xFF0: 0xDEAD STR X1, [SP, #-16]! // SP = 0xFF0 - 16 = 0xFE0, store 0xBEEF at 0xFE0 // Stack: 0xFE0→0xBEEF, 0xFF0→0xDEAD LDR X2, [SP], #16 // Load from 0xFE0 → X2 = 0xBEEF, then SP = 0xFE0 + 16 = 0xFF0 LDR X3, [SP], #16 // Load from 0xFF0 → X3 = 0xDEAD, then SP = 0xFF0 + 16 = 0x1000 // Stack restored: SP back to 0x1000, values popped in reverse order ``` **Register offset with shift/extend**: In the register-offset and extended-register forms, the shift `#s` can be either **0** (unshifted) or the **log2 of the access size** (scaled). No other values are encodable. | Access | Scaled shift | Valid `#s` values | |---|---|---| | LDR Xt (64-bit) | `#3` (×8) | `#0` or `#3` | | LDR Wt (32-bit) | `#2` (×4) | `#0` or `#2` | | LDRH (16-bit) | `#1` (×2) | `#0` or `#1` | | LDRB (8-bit) | `#0` (×1) | `#0` only | ```asm // Scaled: X2 is an element index, hardware multiplies by element size LDR X0, [X1, X2, LSL #3] // X0 = mem[X1 + X2*8] // Unshifted: X2 is a raw byte offset LDR X0, [X1, X2] // X0 = mem[X1 + X2] // Extended register with scaling (32-bit index into 64-bit address space): LDR W0, [X1, W3, SXTW #2] // W0 = mem[X1 + sign_extend(W3)*4] // Extended register without scaling: LDR W0, [X1, W3, SXTW] // W0 = mem[X1 + sign_extend(W3)] ``` ### 15.4 LDUR / STUR — Unscaled Offset **The problem LDUR solves**: Regular `LDR Xd, [Xn, #offset]` uses a **scaled** 12-bit unsigned offset — the hardware stores `offset ÷ access_size`, so for a 64-bit load the byte offset must be a multiple of 8, for a 32-bit load a multiple of 4, etc. This means `LDR X0, [X1, #5]` is **illegal** — 5 is not a multiple of 8. Similarly, `LDR X0, [X1, #-8]` is illegal because the 12-bit field is unsigned (no negatives). `LDUR` and `STUR` solve both problems: they use an **unscaled** signed 9-bit offset, meaning the offset is a raw byte count (not divided by anything) and can be negative. ``` LDUR Xt|XZR, [Xn|SP, #simm9] // Load 64 bits from address Xn + offset (-256 to +255) LDUR Wt|WZR, [Xn|SP, #simm9] // Load 32 bits, unscaled offset (zero-extends to 64) STUR Xt|XZR, [Xn|SP, #simm9] // Store 64 bits to address Xn + offset STUR Wt|WZR, [Xn|SP, #simm9] // Store 32 bits, unscaled offset LDURB Wt|WZR, [Xn|SP, #simm9] // Load byte, unscaled offset LDURH Wt|WZR, [Xn|SP, #simm9] // Load halfword, unscaled offset LDURSW Xt|XZR, [Xn|SP, #simm9] // Load signed word, sign-extend to 64, unscaled LDURSB Xt|XZR, [Xn|SP, #simm9] // Load signed byte, sign-extend to 64, unscaled LDURSB Wt|WZR, [Xn|SP, #simm9] // Load signed byte, sign-extend to 32, unscaled LDURSH Xt|XZR, [Xn|SP, #simm9] // Load signed halfword, sign-extend to 64, unscaled LDURSH Wt|WZR, [Xn|SP, #simm9] // Load signed halfword, sign-extend to 32, unscaled STURB Wt|WZR, [Xn|SP, #simm9] // Store byte, unscaled offset STURH Wt|WZR, [Xn|SP, #simm9] // Store halfword, unscaled offset // FP/SIMD unscaled: LDUR Bt, [Xn|SP, #simm9] // Load 8-bit FP/SIMD, unscaled LDUR Ht, [Xn|SP, #simm9] // Load 16-bit FP/SIMD, unscaled LDUR St, [Xn|SP, #simm9] // Load 32-bit FP/SIMD, unscaled LDUR Dt, [Xn|SP, #simm9] // Load 64-bit FP/SIMD, unscaled LDUR Qt, [Xn|SP, #simm9] // Load 128-bit SIMD, unscaled STUR Bt, [Xn|SP, #simm9] // Store 8-bit FP/SIMD, unscaled STUR Ht, [Xn|SP, #simm9] // Store 16-bit FP/SIMD, unscaled STUR St, [Xn|SP, #simm9] // Store 32-bit FP/SIMD, unscaled STUR Dt, [Xn|SP, #simm9] // Store 64-bit FP/SIMD, unscaled STUR Qt, [Xn|SP, #simm9] // Store 128-bit SIMD, unscaled ``` **Traced example — when you NEED LDUR:** ```asm // Struct with packed/unaligned fields: // struct { uint8_t type; uint64_t value; } __attribute__((packed)); // type at offset 0 (1 byte), value at offset 1 (NOT aligned to 8!) // // X0 = pointer to struct LDRB W1, [X0] // W1 = type (offset 0 — fine, byte access is always aligned) LDUR X2, [X0, #1] // X2 = value (offset 1 — NOT a multiple of 8, so LDR can't encode it) // LDUR uses raw byte offset: address = X0 + 1 // Accessing stack locals at negative offsets: // X29 (FP) points to saved frame, locals are below it LDUR X1, [X29, #-8] // Load local at FP-8 (negative offset, LDR can't encode negative) LDUR W2, [X29, #-20] // Load local at FP-20 // Compare: what LDR can and can't do: LDR X0, [X1, #8] // OK: 8 is a multiple of 8, encodable as imm12=1 LDR X0, [X1, #32760] // OK: max scaled offset (4095 × 8) // LDR X0, [X1, #5] // ILLEGAL: 5 is not a multiple of 8 // LDR X0, [X1, #-8] // ILLEGAL: LDR immediate offset is unsigned (no negatives) LDUR X0, [X1, #5] // OK: unscaled, raw byte offset 5 LDUR X0, [X1, #-8] // OK: unscaled, signed offset -8 ``` **Why two separate instructions?** Encoding efficiency. LDR's scaled 12-bit unsigned offset covers a large range (0 to 32,760 for 64-bit) which handles the vast majority of struct field and array accesses. LDUR's 9-bit signed offset covers the remaining cases (negative offsets, offsets that aren't multiples of the access size) in a smaller range (−256 to +255). Having both means common cases (positive, naturally-scaled) get the big range, and uncommon cases still work. Note: "unscaled" refers to the offset encoding, not memory alignment — whether an access faults on an unaligned address depends on `SCTLR.A`, not on whether you used LDR or LDUR. **How the assembler handles the overlap**: For offsets that BOTH can encode (e.g., `#0`, `#8`, `#16`), the assembler typically picks LDR (the scaled form). For negative offsets or non-multiples, it picks LDUR. GNU `as` does this automatically if you just write `LDR X0, [X1, #-8]` — it silently emits LDUR. But in disassembly, you'll see the explicit `LDUR` mnemonic. **Note on pre-index and post-index**: The `[Xn, #imm]!` (pre-index) and `[Xn], #imm` (post-index) forms also use unscaled signed 9-bit offsets — they share the same encoding space as LDUR/STUR. So `STR X0, [SP, #-16]!` works with a negative offset because pre-index uses the 9-bit signed field, not the 12-bit unsigned field. **Writeback with base == destination**: For loads with writeback (`LDR Xt, [Xn, #imm]!` or `LDR Xt, [Xn], #imm`), if Xt and Xn are the **same register**, the loaded value wins — Xn gets the loaded data, NOT the updated address. For stores, if Xt == Xn with writeback, the base is updated (the store reads the old value of Xt before the update). These are CONSTRAINED UNPREDICTABLE in the architecture — they may work on one CPU but not another. Avoid them. **LDR vs LDUR at a glance (for 64-bit access):** | | LDR (scaled) | LDUR (unscaled) | Pre/Post-index | |---|---|---|---| | Offset field | 12-bit unsigned | 9-bit signed | 9-bit signed | | Stored as | offset ÷ 8 | raw bytes | raw bytes | | Range | 0 to +32,760 | −256 to +255 | −256 to +255 | | Must be multiple of 8? | **Yes** | No | No | | Negative offset? | **No** | Yes | Yes | | Updates base register? | No | No | Yes | | Use case | Most struct/array access | Packed structs, negative offsets | Push/pop, walking memory | **LDTR / STTR — Unprivileged Load/Store:** `LDTR` and `STTR` perform loads and stores using the permissions of **EL0 (user mode)**, even when executing at EL1 (kernel). This is how the kernel safely accesses user-provided pointers — if the address is invalid or user-inaccessible, LDTR generates a fault that the kernel can catch, instead of silently accessing kernel memory. ```asm LDTR Xt|XZR, [Xn|SP, #simm9] // Load 64-bit with EL0 permission check LDTR Wt|WZR, [Xn|SP, #simm9] // Load 32-bit with EL0 permission check STTR Xt|XZR, [Xn|SP, #simm9] // Store 64-bit with EL0 permission check STTR Wt|WZR, [Xn|SP, #simm9] // Store 32-bit with EL0 permission check LDTRB Wt|WZR, [Xn|SP, #simm9] // Byte version LDTRH Wt|WZR, [Xn|SP, #simm9] // Halfword version LDTRSW Xt|XZR, [Xn|SP, #simm9] // Sign-extending word → 64-bit LDTRSB Xt|XZR, [Xn|SP, #simm9] // Sign-extending byte → 64-bit LDTRSB Wt|WZR, [Xn|SP, #simm9] // Sign-extending byte → 32-bit LDTRSH Xt|XZR, [Xn|SP, #simm9] // Sign-extending halfword → 64-bit LDTRSH Wt|WZR, [Xn|SP, #simm9] // Sign-extending halfword → 32-bit STTRB Wt|WZR, [Xn|SP, #simm9] // Store byte, unprivileged STTRH Wt|WZR, [Xn|SP, #simm9] // Store halfword, unprivileged ``` **Why LDTR exists**: When a user passes a pointer to a syscall, the kernel must validate it. Using regular `LDR` would access the address with kernel privileges — if the user passes a kernel address, the load succeeds and leaks kernel data. `LDTR` uses user-mode permissions, so invalid or privileged addresses fault safely. ### 15.5 LDR (literal) — PC-Relative Load Loads a value from a fixed address relative to the current instruction. The assembler computes the offset from PC to the label automatically. Used to load constants from "literal pools" — small data areas placed near the code. ``` LDR Xt|XZR, label // Load 64 bits from PC + offset (±1 MB) LDR Wt|WZR, label // Load 32 bits from PC + offset LDR Sd, label // Load single-precision FP from PC + offset LDR Dd, label // Load double-precision FP from PC + offset LDR Qd, label // Load 128-bit SIMD from PC + offset ``` ### 15.6 Alignment Requirements AArch64 is generally more tolerant of unaligned access than older ARM, but alignment still matters: **Default behavior**: Loads and stores to naturally-aligned addresses always work. For unaligned accesses, the behavior depends on the `SCTLR_EL1.A` bit (Alignment check enable). When A=0 (the default on Linux), most unaligned accesses work but are potentially slower — the CPU may split them into multiple bus transactions. When A=1, unaligned accesses generate an alignment fault exception. **What "naturally aligned" means**: An N-byte access is naturally aligned when the address is a multiple of N. So a 4-byte LDR W needs address % 4 == 0, an 8-byte LDR X needs address % 8 == 0, and a 16-byte LDP needs address % 8 == 0 (aligned to the element size, not the pair size). **Always require alignment** (regardless of SCTLR.A — these fault even with alignment checking disabled): - `LDXR`/`STXR` (exclusive): must be naturally aligned or you get an alignment fault. The hardware's exclusive monitor only tracks aligned addresses. - `LDAR`/`STLR` (acquire/release): must be naturally aligned. - SP must be 16-byte aligned whenever it is used as a base address — but this check is only active when `SCTLR_EL1.SA0` (for EL0) or `SCTLR_ELx.SA` (for the current EL) is enabled. Linux enables SA0 by default, so EL0 code faults on unaligned SP. Bare-metal or custom kernels may have it disabled. - Atomic instructions (LSE: `LDADD`, `CAS`, `SWP`, etc.): must be naturally aligned. **Follow SCTLR.A** (alignment-checked only when A=1): - `LDP`/`STP`: the address should be aligned to the element size (8 for Xt, 4 for Wt). With SCTLR.A=0 (default on Linux), unaligned LDP/STP to Normal memory is architecturally permitted but may be slower or non-atomic. With SCTLR.A=1, unaligned LDP/STP faults. Best practice: always align. **Why alignment matters for atomics**: The CPU guarantees atomicity only for aligned accesses at the natural size. An 8-byte store to an 8-byte-aligned address is guaranteed to be visible to other cores as a single atomic write. An unaligned 8-byte store might be split into two 4-byte stores, and another core could observe half the old value and half the new value — a torn read. ### 15.7 PRFM — Prefetch Memory `PRFM` (Prefetch Memory) is a **hint** that tells the CPU to start loading data into cache before you actually need it. This can hide memory latency for predictable access patterns. The CPU is free to ignore the hint — it never changes program behavior and never causes faults, even if the address is invalid. ``` PRFM <prfop>, [Xn|SP, #imm] // Immediate offset (scaled, like LDR) PRFM <prfop>, [Xn|SP, Xm|XZR{, LSL #3}] // Register offset PRFM <prfop>, label // PC-relative (literal) PRFUM <prfop>, [Xn|SP, #simm9] // Unscaled offset ``` The `<prfop>` specifies what kind of prefetch: | Operation | Meaning | |---|---| | `PLDL1KEEP` | Prefetch for **load**, into **L1** cache, **temporal** (expect reuse) | | `PLDL1STRM` | Prefetch for **load**, into **L1**, **streaming** (one-time use, don't pollute cache) | | `PLDL2KEEP` | Prefetch for load, into L2 cache, temporal | | `PLDL2STRM` | Prefetch for load, into L2, streaming | | `PLDL3KEEP` | Prefetch for load, into L3, temporal | | `PSTL1KEEP` | Prefetch for **store** (get exclusive ownership), L1, temporal | | `PSTL1STRM` | Prefetch for store, L1, streaming | **When to use**: Prefetch helps when you know you'll access memory in a predictable pattern (e.g., walking an array) and the access is far enough ahead that the CPU's hardware prefetcher hasn't caught up. Typical use: prefetch the next cache line 2-4 iterations ahead in a loop. ```asm // Prefetch 256 bytes ahead while processing an array: loop: PRFM PLDL1KEEP, [X0, #256] // Hint: fetch data 256 bytes ahead LDR X1, [X0] // Process current element // ... work with X1 ... ADD X0, X0, #8 CMP X0, X2 B.LT loop ``` **Why PST (store prefetch) exists**: When you're about to write to a cache line, the CPU needs exclusive ownership of it (the MESI/MOESI "E" or "M" state). PST tells the CPU to acquire ownership early, avoiding a stall when the store actually happens. Useful for zeroing large buffers or initializing arrays. --- ## 16. Load/Store Pair, Non-Temporal & Exclusive Extensions to the basic load/store: pair operations (load/store two registers at once), non-temporal hints (bypass cache), and exclusive access (for implementing atomics). ### 16.1 LDP / STP — Load/Store Pair `LDP` (Load Pair) loads two registers from consecutive memory locations in a single instruction. `STP` (Store Pair) stores two registers. These are more efficient than two separate loads/stores, and they are the standard way to save and restore registers in function prologues and epilogues. ``` LDP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load two 64-bit registers LDP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm] // Signed offset (multiple of 8, range −512 to +504) LDP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index LDP Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index STP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm] // Store pair (XZR stores zero) STP Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index STP Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index ``` **32-bit pair forms:** ``` LDP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load two 32-bit registers LDP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm] // Signed offset (multiple of 4, range −256 to +252) LDP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm]! // Pre-index LDP Wt1|WZR, Wt2|WZR, [Xn|SP], #simm // Post-index STP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm] // Store pair 32-bit STP Wt1|WZR, Wt2|WZR, [Xn|SP, #simm]! // Pre-index STP Wt1|WZR, Wt2|WZR, [Xn|SP], #simm // Post-index ``` **LDPSW — Load Pair Signed Word (sign-extend each 32-bit value to 64):** ``` LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP] // Load two signed 32-bit → 64-bit LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP, #simm] // Signed offset (multiple of 4, range −256 to +252) LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP, #simm]! // Pre-index LDPSW Xt1|XZR, Xt2|XZR, [Xn|SP], #simm // Post-index ``` **FP/SIMD pair forms:** ``` LDP St1, St2, [Xn|SP{, #simm}] // Load pair single (offset: multiple of 4, range −256 to +252) LDP Dt1, Dt2, [Xn|SP{, #simm}] // Load pair double (offset: multiple of 8, range −512 to +504) LDP Qt1, Qt2, [Xn|SP{, #simm}] // Load pair quad (offset: multiple of 16, range −1024 to +1008) STP St1, St2, [Xn|SP{, #simm}] // Store pair single (same offset range as LDP St) STP Dt1, Dt2, [Xn|SP{, #simm}] // Store pair double (same offset range as LDP Dt) STP Qt1, Qt2, [Xn|SP{, #simm}] // Store pair quad (same offset range as LDP Qt) // All FP/SIMD pair forms support pre-index [Xn|SP, #simm]! and post-index [Xn|SP], #simm. ``` **What the hardware actually encodes**: Like LDR, the offset is stored divided by the access size. LDP has a 7-bit signed offset field. For 64-bit pairs: the hardware stores `offset ÷ 8`, so the byte offset must be a multiple of 8 (range: −512 to +504, since a 7-bit signed value is −64 to +63, times 8). For 32-bit pairs: `offset ÷ 4` (range: −256 to +252). So `STP X29, X30, [SP, #-16]!` encodes the offset as −16 ÷ 8 = −2. **Gotcha**: `LDP Xt1, Xt2, [Xn|SP]` — the two destination registers `Xt1` and `Xt2` **must be different**. `LDP X0, X0, [X1]` is unpredictable (the CPU doesn't know which value to keep in X0). The base register can be **SP** (this is the standard prologue/epilogue pattern). The data registers can be **XZR** (to discard one or both loaded values). **Writeback constraint**: For both LDP and STP with pre/post-index (`!` or post-index form), the base register `Xn` must not be the same as either data register. For LDP, this is because both the loaded value and the updated base address would try to write to the same register. For STP, the CPU might update the base before reading the data register, corrupting the stored value. Violating this is CONSTRAINED UNPREDICTABLE — it may work on one implementation and fail on another. **Function prologue/epilogue pattern:** ```asm // Prologue: save FP and LR STP X29, X30, [SP, #-16]! // Push FP and LR, decrement SP MOV X29, SP // Set frame pointer // Epilogue: restore FP and LR LDP X29, X30, [SP], #16 // Pop FP and LR, increment SP RET ``` **Traced prologue/epilogue (what REALLY happens to memory):** ``` // Initial state: SP = 0x4000, X29 = 0xOLD_FP, X30 = 0xRETURN_ADDR STP X29, X30, [SP, #-16]! // Step 1: SP = 0x4000 - 16 = 0x3FF0 (pre-decrement) // Step 2: mem[0x3FF0] = X29 = 0xOLD_FP (first register at lower address) // mem[0x3FF8] = X30 = 0xRETURN_ADDR (second register at higher address) // SP is now 0x3FF0 MOV X29, SP // X29 = 0x3FF0 (frame pointer points to saved FP/LR pair) // ... function body uses X29-relative offsets for local variables ... LDP X29, X30, [SP], #16 // Step 1: X29 = mem[0x3FF0] = 0xOLD_FP (restore from lower address) // X30 = mem[0x3FF8] = 0xRETURN_ADDR (restore from higher address) // Step 2: SP = 0x3FF0 + 16 = 0x4000 (post-increment, SP restored) RET // Branch to X30 = 0xRETURN_ADDR ``` ### 16.2 LDNP / STNP — Non-Temporal Pair "Non-temporal" means the CPU is told this data won't be needed again soon. The CPU may skip caching it, which avoids polluting the cache during large streaming operations like copying a big buffer. ``` LDNP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Non-temporal load pair 64-bit (offset: multiple of 8, range −512 to +504) STNP Xt1|XZR, Xt2|XZR, [Xn|SP{, #simm}] // Non-temporal store pair 64-bit LDNP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // Non-temporal load pair 32-bit (offset: multiple of 4, range −256 to +252) STNP Wt1|WZR, Wt2|WZR, [Xn|SP{, #simm}] // Non-temporal store pair 32-bit LDNP St1, St2, [Xn|SP{, #simm}] // Non-temporal load pair single FP LDNP Dt1, Dt2, [Xn|SP{, #simm}] // Non-temporal load pair double FP LDNP Qt1, Qt2, [Xn|SP{, #simm}] // Non-temporal load pair quad STNP St1, St2, [Xn|SP{, #simm}] // Non-temporal store pair single FP STNP Dt1, Dt2, [Xn|SP{, #simm}] // Non-temporal store pair double FP STNP Qt1, Qt2, [Xn|SP{, #simm}] // Non-temporal store pair quad ``` The offset encoding is the same as LDP/STP (7-bit signed, scaled by element size). Only the signed-offset form exists — no pre-index or post-index. ### 16.3 LDXR / STXR — Exclusive (for atomics) Exclusive loads and stores are the building blocks for lock-free atomic operations. `LDXR` (Load Exclusive) reads a value from memory and sets up an **exclusive monitor** — a hardware mechanism that watches the address. `STXR` (Store Exclusive) attempts to write back — but it only succeeds if no other CPU core has written to that address since the `LDXR`. If it fails, the status register `Ws` is set to 1; if it succeeds, `Ws` is 0. You retry the whole sequence until the store succeeds. **Why this works**: The exclusive monitor is a simple 1-bit flag per core (plus the tracked address). `LDXR` sets the flag. Any write to that cache line by any core (including DMA devices) clears it. `STXR` checks the flag — if clear, someone else modified the data, so the store is aborted. This is how you build atomic read-modify-write without locks. ``` LDXR Xt|XZR, [Xn|SP] // Load exclusive 64-bit (start exclusive monitor) LDXR Wt|WZR, [Xn|SP] // Load exclusive 32-bit STXR Ws|WZR, Xt|XZR, [Xn|SP] // Store exclusive 64-bit (Ws = 0 if success, 1 if failed) STXR Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive 32-bit LDXRB Wt|WZR, [Xn|SP] // Load exclusive byte STXRB Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive byte LDXRH Wt|WZR, [Xn|SP] // Load exclusive halfword STXRH Ws|WZR, Wt|WZR, [Xn|SP] // Store exclusive halfword LDXP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load exclusive pair 64-bit (128-bit atomic read) LDXP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load exclusive pair 32-bit (64-bit atomic read) STXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store exclusive pair 64-bit STXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store exclusive pair 32-bit ``` **Alignment requirement**: The address `[Xn|SP]` **must be naturally aligned** — aligned to the access size (4 bytes for Wt, 8 bytes for Xt, 16 bytes for LDXP/STXP of Xt). Unaligned exclusive access generates an alignment fault regardless of `SCTLR.A`. This is because the exclusive monitor tracks at cache-line granularity, and unaligned accesses could span two cache lines, making atomicity impossible. **Rules for the exclusive sequence** (violating these may cause the store to always fail): 1. The LDXR and STXR must target the **same address and size**. 2. Between LDXR and STXR, **avoid** accessing other memory locations — other loads/stores may cause the exclusive monitor to be cleared on some implementations, which makes the STXR fail and forces a retry. The ARM architecture permits (but does not require) the monitor to be cleared by any other memory access, so keeping the sequence to pure register operations maximizes portability and success rate. 3. Do not branch to code that might be context-switched (the OS clears the monitor on context switch via `CLREX`). 4. Keep the sequence **short** — long sequences increase the chance of another core invalidating the monitor. 5. `STXR`'s status register `Ws` **must be a different register** from both `Xt` (data) and `Xn` (address base). If they overlap, the behavior is constrained unpredictable — it might work on one CPU and fail on another. Note: `Ws` can technically be WZR (to discard the status), but (a) you can't check if the store succeeded, making the exclusive useless, and (b) if the base is SP, then Rs=31 and Rn=31 which **violates the Rs≠Rn constraint** — this combination is UNPREDICTABLE. 6. **Don't nest LDXR**: A second `LDXR` to a different address cancels the first exclusive monitor. There's only one monitor per core — the last LDXR wins. **Why LDXR/STXR (instead of just CAS)?** Base ARMv8.0 shipped without single-instruction CAS because the exclusive pair approach is simpler in hardware — the CPU just needs a "monitor" flag per cache line, not a full read-modify-write pipeline. LDXR/STXR also works for arbitrary read-modify-write patterns (not just compare-and-swap). CAS was added later in ARMv8.1 (LSE) because the exclusive retry loop wastes bus bandwidth under high contention — see §24.1. **Classic compare-and-swap (CAS) loop:** ```asm // Atomically increment [X0]: retry: LDXR X1, [X0] // Load exclusive ADD X1, X1, #1 // Modify STXR W2, X1, [X0] // Store exclusive CBNZ W2, retry // Retry if store failed ``` **Traced execution with contention (what REALLY happens):** ``` // [X0] = 42 initially. Core A and Core B both try to increment. Core A: Core B: LDXR X1, [X0] → X1=42 LDXR X1, [X0] → X1=42 (monitor set on cache line) (monitor set on cache line) ADD X1, X1, #1 → X1=43 ADD X1, X1, #1 → X1=43 STXR W2, X1, [X0] W2=0 (success! first to store) [X0] = 43 STXR W2, X1, [X0] W2=1 (FAIL — Core A's store cleared our monitor) CBNZ W2, retry → back to LDXR LDXR X1, [X0] → X1=43 (sees Core A's write) ADD X1, X1, #1 → X1=44 STXR W2, X1, [X0] W2=0 (success) [X0] = 44 // Final: [X0] = 44. Both increments applied. No lost update. ``` **CLREX — Clear Exclusive Monitor:** ``` CLREX // Clear the local exclusive monitor without storing ``` The OS kernel uses `CLREX` during context switches to ensure a thread doesn't carry a stale exclusive state from before it was scheduled out. **Exclusive Pair — 128-bit atomics:** ```asm LDXP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load exclusive pair 64-bit (128 bits total) LDXP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load exclusive pair 32-bit (64 bits total) STXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store exclusive pair 64-bit (Ws = 0 success, 1 fail) STXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store exclusive pair 32-bit LDAXP Xt1|XZR, Xt2|XZR, [Xn|SP] // Load-acquire exclusive pair 64-bit LDAXP Wt1|WZR, Wt2|WZR, [Xn|SP] // Load-acquire exclusive pair 32-bit STLXP Ws|WZR, Xt1|XZR, Xt2|XZR, [Xn|SP] // Store-release exclusive pair 64-bit STLXP Ws|WZR, Wt1|WZR, Wt2|WZR, [Xn|SP] // Store-release exclusive pair 32-bit ``` These load/store two 64-bit registers atomically as a 128-bit value. Used for lock-free 128-bit operations (e.g., doubly-linked list insertion where you need to atomically update both a pointer and a counter). The address must be 16-byte aligned. On ARMv8.1+ with LSE, `CASP` (compare-and-swap pair) is preferred for 128-bit CAS. --- ## 17. Branching & Control Flow Branches change the program counter (PC) — they make the CPU jump to a different instruction instead of continuing to the next one. They implement `if/else`, loops, and function calls. **Why so many branch types?** Each has a different range, and larger ranges require more encoding bits. `B` uses 26 bits for ±128 MB — enough for jumps within any reasonable function or between nearby functions. `B.cond` uses 19 bits for ±1 MB — conditions are usually short-range (within a function). `TBZ`/`TBNZ` uses only 14 bits for ±32 KB — testing a single bit is a tight, local operation. `CBZ`/`CBNZ` exist because "compare to zero and branch" is the single most common branch pattern in compiled code, and fusing it into one instruction saves both code size and branch predictor entries. `BR`/`BLR` use a full register for unlimited range — needed for function pointers, virtual dispatch, and PLT stubs. ### 17.1 Unconditional Branches `B` is a simple jump — go to a label unconditionally. `BL` ("Branch with Link") is a function call — it saves the return address in X30 before jumping, so `RET` can get back. `BR`/`BLR` are the same but take the target address from a register (indirect). ``` B label // Branch (PC-relative, ±128 MB) BL label // Branch with Link: X30 = return address, then branch (±128 MB) BR Xn|XZR // Branch to address in Xn (indirect) BLR Xn|XZR // Branch with Link to address in Xn RET {Xn|XZR} // Return: branch to Xn (default X30) // Functionally identical to BR X30, but hints branch predictor ``` **B vs BL vs BR vs BLR**: `B`/`BL` use an immediate offset (PC-relative, range limited). `BR`/`BLR` use a register (any address in the 64-bit space). `BL`/`BLR` save the return address in X30; `B`/`BR` don't. `RET` is functionally `BR X30` but gives the branch predictor a hint that this is a function return (not a computed jump), improving prediction accuracy. The `{Xn|XZR}` means the operand is optional — if omitted, it defaults to X30. **No conditional call**: AArch64 has **no conditional BL** (no `BL.cond`). You cannot conditionally call a function in one instruction. To call conditionally, branch around the BL: `B.NE skip; BL func; skip:`. This is a deliberate simplification from AArch32 (where almost every instruction could be conditional). **What the hardware actually encodes**: Since all AArch64 instructions are 4 bytes and 4-byte aligned, the branch target is always a multiple of 4 bytes away. So the hardware stores the offset **divided by 4** (the instruction count, not the byte count). A 26-bit signed field holding instruction counts gives a range of ±2^25 instructions = ±33,554,432 instructions × 4 bytes = ±128 MB. The same trick applies to all PC-relative branches: `B.cond` stores a 19-bit instruction count (±1 MB), `TBZ`/`TBNZ` stores a 14-bit instruction count (±32 KB). **BL vs BLR**: Both store the return address in X30 (LR). `BL` is PC-relative, `BLR` is indirect. **RET vs BR X30**: Both branch to X30 (by default), but `RET` tells the branch predictor this is a function return, enabling the return address stack to predict correctly. Always use `RET` for function returns. ### 17.2 Conditional Branches `B.cond` branches only if the condition (based on the NZCV flags) is true. The flags must be set by a prior instruction like `CMP`, `ADDS`, `SUBS`, `TST`, etc. If the condition is false, execution continues to the next instruction. ``` B.cond label // Branch if condition is true (±1 MB range) ``` Where `cond` is any condition code from the table in section 4 (`EQ`, `NE`, `LT`, `GE`, etc.). ### 17.3 Compare and Branch `CBZ` (Compare and Branch if Zero) and `CBNZ` (Compare and Branch if Not Zero) combine a zero-test with a branch in a single instruction. They do NOT set the condition flags — they just test the register and branch. They save you from writing a separate `CMP Xn, #0` + `B.EQ`/`B.NE` pair. ``` CBZ Xn|XZR, label // Branch if Xn == 0 (±1 MB) CBNZ Xn|XZR, label // Branch if Xn != 0 CBZ Wn|WZR, label // Branch if Wn == 0 CBNZ Wn|WZR, label // Branch if Wn != 0 ``` These do NOT set flags. They compare to zero and branch in a single instruction, saving a `CMP` + `B.EQ`/`B.NE` pair. ### 17.4 Test Bit and Branch ``` TBZ Xn|XZR, #0-63, label // Branch if bit #bit of Xn is 0 (±32 KB range) TBNZ Xn|XZR, #0-63, label // Branch if bit #bit of Xn is 1 TBZ Wn|WZR, #0-31, label // 32-bit form TBNZ Wn|WZR, #0-31, label ``` **Encoding note**: The register width determines the valid bit range. `TBZ Wn, #bit` requires bit 0–31; `TBZ Xn, #bit` allows 0–63. Some assemblers/disassemblers always show the Xn form when bit >= 32 and the Wn form when bit <= 31, even if you wrote it differently. Very useful for testing a single flag bit: ```asm TBZ X0, #31, positive // Branch if bit 31 (sign bit of 32-bit) is 0 TBNZ X0, #0, is_odd // Branch if bit 0 (LSB — least significant bit, the rightmost bit) is 1 ``` Note the smaller range (±32 KB) compared to B.cond (±1 MB) or B (±128 MB). ### 17.5 Branch Ranges Summary | Instruction | Range | |---|---| | `B` / `BL` | ±128 MB | | `B.cond` / `CBZ` / `CBNZ` | ±1 MB | | `TBZ` / `TBNZ` | ±32 KB | | `BR` / `BLR` / `RET` | Full 64-bit address space | If a conditional branch target is out of range, the assembler/linker may invert the condition and use a trampoline: ```asm // Instead of: B.EQ far_away (out of range) B.NE skip B far_away // unconditional B has ±128 MB range skip: ``` --- ## 18. Conditional Select & Increment AArch64 replaces AArch32's conditional execution (predicated instructions) with conditional select instructions. These choose between two values based on the condition flags, without branching. This is how compilers implement branchless `if/else` — the CPU always executes the instruction, but the result depends on the flags. ### 18.1 CSEL — Conditional Select `CSEL` picks one of two register values based on a condition. If the condition is true, the first source is selected; otherwise, the second. Like a hardware ternary operator: `Xd = cond ? Xn : Xm`. The `cond` operand is any condition code from the table in §4 (EQ, NE, LT, GE, GT, LE, HI, LS, etc.) — it tests the current NZCV flags, so you typically need a CMP/TST/ADDS before the CSEL. ``` CSEL Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : Xm [64-bit] CSEL Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : Wm [32-bit, upper 32 of Xd zeroed] ``` ```asm CMP X0, X1 CSEL X2, X0, X1, LE // X2 = min(X0, X1) signed CSEL X3, X0, X1, GE // X3 = max(X0, X1) signed CSEL X4, X0, X1, HI // X4 = max(X0, X1) unsigned ``` **What CSEL REALLY does — traced:** ```asm // If X0 = 10, X1 = 20: CMP X0, X1 // 10 - 20: N=1, Z=0, V=0 → N!=V so LT; N==V false so not-GE CSEL X2, X0, X1, LE // LE true (Z=1||N!=V = 0||1 = true) → X2 = X0 = 10 (the min) ✓ CSEL X3, X0, X1, GE // GE false (N==V = 1==0 = false) → X3 = X1 = 20 (the max) ✓ ``` ### 18.2 CSINC — Conditional Select Increment `CSINC` selects the first source if the condition is true, otherwise selects the second source **plus 1**. Its most common alias is `CSET`, which sets a register to 1 if a condition is true and 0 otherwise — this is how compilers convert comparisons to boolean values (like C's `result = (a > b)`). ``` CSINC Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : (Xm + 1) [64-bit] CSINC Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : (Wm + 1) [32-bit] ``` Aliases: ``` CINC Xd|XZR, Xn|XZR, cond // Xd = cond ? Xn+1 : Xn. Encodes as: CSINC Xd, Xn, Xn, invert(cond) CINC Wd|WZR, Wn|WZR, cond CSET Xd|XZR, cond // Xd = cond ? 1 : 0. Encodes as: CSINC Xd, XZR, XZR, invert(cond) CSET Wd|WZR, cond ``` **Why the inverted condition?** This confuses everyone, but it's forced by the encoding. `CSINC Xd, Xn, Xm, cond` means "if cond is true, select Xn (unchanged); if cond is false, select Xm+1." For `CSET Rd, GT` (set to 1 if greater), we want: result=1 when GT, result=0 when not-GT. We encode this as `CSINC Rd, XZR, XZR, LE` — when LE is true (i.e., GT is false), we select XZR=0 (unchanged); when LE is false (i.e., GT is true), we select XZR+1=1. The inversion happens because the "interesting" operation (the +1) is on the false path of CSINC, so to make the +1 happen when our desired condition is true, we must invert it. The same logic applies to CINC, CINV, CSETM, and CNEG — they all apply their operation (increment, invert, negate) on the **false** path, so the alias inverts the condition to put the operation where you want it. ```asm CMP X0, #10 CSET W1, GT // W1 = (X0 > 10) ? 1 : 0 (common pattern for bool conversion) CSET X1, GT // X1 = same but 64-bit result ``` ### 18.3 CSINV — Conditional Select Invert `CSINV` selects the first source if the condition is true, otherwise selects the bitwise NOT of the second source. `CSETM Rd, cond` (set to all-ones if true, zero if false) is the most common alias — it produces a bitmask useful for branchless bitwise selection. ``` CSINV Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : ~Xm [64-bit] CSINV Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : ~Wm [32-bit] ``` Aliases: ``` CINV Xd|XZR, Xn|XZR, cond // Xd = cond ? ~Xn : Xn. Encodes as: CSINV Xd, Xn, Xn, invert(cond) CINV Wd|WZR, Wn|WZR, cond CSETM Xd|XZR, cond // Xd = cond ? -1 : 0. Encodes as: CSINV Xd, XZR, XZR, invert(cond) CSETM Wd|WZR, cond // 32-bit (Wd = 0xFFFFFFFF, NOT 64-bit -1) ``` **32-bit note**: `CSETM W0, cond` sets W0 to 0xFFFFFFFF (not 0xFFFFFFFFFFFFFFFF). X0 upper 32 bits are zeroed. ### 18.4 CSNEG — Conditional Select Negate `CSNEG` selects the first source if the condition is true, otherwise selects the two's complement negation of the second source. `CNEG Rd, Rn, cond` (negate if condition true, keep otherwise) is the key alias — it's how compilers implement branchless `abs()`. ``` CSNEG Xd|XZR, Xn|XZR, Xm|XZR, cond // Xd = cond ? Xn : -Xm [64-bit] CSNEG Wd|WZR, Wn|WZR, Wm|WZR, cond // Wd = cond ? Wn : -Wm [32-bit] ``` Alias: ``` CNEG Xd|XZR, Xn|XZR, cond // Xd = cond ? -Xn : Xn. Encodes as: CSNEG Xd, Xn, Xn, invert(cond) CNEG Wd|WZR, Wn|WZR, cond ``` ### 18.5 Branchless Patterns with Conditional Select ```asm // Absolute value: CMP X0, #0 CNEG X0, X0, LT // if (X0 < 0) X0 = -X0 // Clamp to range [0, 255]: CMP X0, #0 CSEL X0, XZR, X0, LT // X0 = max(X0, 0) MOV X1, #255 CMP X0, X1 CSEL X0, X1, X0, GT // X0 = min(X0, 255) // Convert bool to 0 or 1: CMP X0, #0 CSET W0, NE // W0 = (X0 != 0) ? 1 : 0 // Convert bool to 0 or -1 (all-ones mask): CMP X0, #0 CSETM W0, NE // W0 = (X0 != 0) ? 0xFFFFFFFF : 0 ``` **Traced examples for the aliases:** ```asm // ═══ CINC — Conditional Increment ═══ // CINC X0, X1, EQ = CSINC X0, X1, X1, NE (note inverted condition) // "If EQ, increment X1; otherwise keep X1 unchanged" // // If Z=1 (EQ is true): NE is false → CSINC takes false path → X0 = X1 + 1 // If Z=0 (EQ is false): NE is true → CSINC takes true path → X0 = X1 // // Concrete: X1 = 10, flags from CMP that set Z=1 (equal): // CINC X0, X1, EQ → X0 = 11 (incremented because EQ was true) // ═══ CNEG — Conditional Negate ═══ // CNEG X0, X1, LT = CSNEG X0, X1, X1, GE (inverted) // "If LT, negate X1; otherwise keep X1" // // Concrete: X1 = -5, flags from CMP that set LT true: // CNEG X0, X1, LT → X0 = 5 (negated because LT was true) // This is exactly how branchless abs() works: CMP + CNEG // ═══ CSETM — Conditional Set Mask ═══ // CSETM X0, NE = CSINV X0, XZR, XZR, EQ (inverted) // "Set to all-ones if NE, zero otherwise" // // If NE true: EQ false → CSINV takes false path → X0 = ~XZR = 0xFFFFFFFFFFFFFFFF // If NE false: EQ true → CSINV takes true path → X0 = XZR = 0 // // Why CSETM is useful: the all-ones mask (0xFFFF...F) can be used with AND/ORR // for branchless conditional operations on bitfields. In C, this is like: // mask = (cond) ? ~0ULL : 0ULL; // result = (value & mask) | (other & ~mask); ``` --- ## 19. System Registers & Special Instructions System registers control hardware features like interrupt masking, cache behavior, and virtual memory. They are not part of the general-purpose register file — you access them with dedicated `MRS` (read) and `MSR` (write) instructions. ### 19.1 MRS / MSR — System Register Access `MRS` (Move to Register from System) copies a system register into a general-purpose register. `MSR` (Move to System Register) copies a general-purpose register into a system register. Some system registers are read-only, some are write-only, and many are only accessible at higher exception levels (kernel, hypervisor). ``` MRS Xt|XZR, <sysreg> // Move system register to GPR MSR <sysreg>, Xt|XZR // Move GPR to system register MSR <sysreg>, #imm4 // Immediate to specific PSTATE fields ``` Common system registers: ```asm MRS X0, NZCV // Read condition flags MSR NZCV, X0 // Write condition flags MRS X0, FPCR // Floating-point control MRS X0, FPSR // Floating-point status MRS X0, CurrentEL // Current exception level (bits [3:2]) MRS X0, DAIF // Interrupt mask flags MRS X0, CNTFRQ_EL0 // Timer frequency MRS X0, CNTVCT_EL0 // Virtual timer count (high-resolution timestamp) MRS X0, CTR_EL0 // Cache type register MRS X0, DCZID_EL0 // Data cache zero ID MRS X0, TPIDR_EL0 // Thread ID register (user-accessible, used for thread-local storage) ``` ### 19.2 NOP, YIELD, WFE, WFI, SEV ``` NOP // No operation (often used for alignment or timing) YIELD // Hint: yield to other hardware threads sharing this core (spin-lock hint) WFE // Wait For Event (low-power wait) WFI // Wait For Interrupt (deeper low-power wait) SEV // Send Event (wake up WFE waiters) SEVL // Send Event Local (wake up local core from WFE) ``` **Spin-lock pattern with WFE:** ```asm spin: LDAXR W1, [X0] // Load-acquire exclusive CBNZ W1, wait // If locked, wait STXR W2, W3, [X0] // Try to store our value CBNZ W2, spin // If exclusive failed, retry B got_lock wait: WFE // Low-power wait until event B spin // Try again ``` ### 19.3 SVC / HVC / SMC — Exception Generation These instructions deliberately trigger an exception to call into a higher privilege level. `SVC` (Supervisor Call) is how user programs make system calls to the kernel — the 16-bit immediate is ignored by the hardware but can be read by the handler from the instruction encoding via ESR. `HVC` calls the hypervisor. `SMC` calls secure firmware. `BRK` triggers a debug breakpoint. ``` SVC #imm16 // Supervisor Call (EL0 → EL1 system call) HVC #imm16 // Hypervisor Call (EL1 → EL2) SMC #imm16 // Secure Monitor Call (EL1 → EL3) BRK #imm16 // Breakpoint (debug exception) HLT #imm16 // Halt (debug, external debugger) ``` Linux system call convention: ```asm MOV X8, #64 // syscall number (e.g., 64 = write) MOV X0, #1 // fd = stdout ADR X1, message // buffer MOV X2, #14 // length SVC #0 // trigger syscall // Return value in X0 ``` ### 19.4 HINT — Hint Space Many "instructions" are actually specific encodings of HINT: | Instruction | Encoding | |---|---| | `NOP` | `HINT #0` | | `YIELD` | `HINT #1` | | `WFE` | `HINT #2` | | `WFI` | `HINT #3` | | `SEV` | `HINT #4` | | `SEVL` | `HINT #5` | | `BTI` | `HINT #32/34/36/38` | | `PACIA/PACIB/...` | Various HINT encodings (PAC) | **Why HINT encoding?** Older CPUs that don't support a feature (like PAC or BTI) will execute the HINT as a NOP — the program still runs, just without the security benefit. This provides backward compatibility: a PAC-enabled binary runs safely on old hardware (no crashes, just no protection). ### 19.5 SYS / SYSL — System Instructions `SYS` and `SYSL` are the generic system instruction encodings that all cache, TLB, and address translation operations are aliases for. You rarely write `SYS` directly — you write the friendly alias (like `DC ZVA`), and the assembler encodes it as `SYS`. ``` SYS #op1, Cn, Cm, #op2{, Xt|XZR} // System instruction with optional input register SYSL Xt|XZR, #op1, Cn, Cm, #op2 // System instruction with output to Xt ``` ### 19.6 Cache Maintenance Operations These are all aliases for `SYS` instructions. Cache maintenance is needed when writing self-modifying code (JIT compilers), setting up DMA transfers, or when the instruction and data caches see different views of memory. **Data Cache (DC) operations:** ```asm DC ZVA, Xt|XZR // Zero a cache line (Xt = address). Fastest way to zero memory. // Zeroes a block of memory the size of a data cache line (typically 64 bytes). // The block must be naturally aligned to the cache line size. // Read DCZID_EL0 to get the block size. DC CVAC, Xt|XZR // Clean to Point of Coherency (write dirty data back to main memory) DC CVAU, Xt|XZR // Clean to Point of Unification (for instruction fetch coherency) DC CIVAC, Xt|XZR // Clean and Invalidate to Point of Coherency DC IVAC, Xt|XZR // Invalidate (discard data, EL1+ only — dangerous, can lose dirty data) ``` **Instruction Cache (IC) operations:** ```asm IC IALLU // Invalidate all instruction caches (EL1+) IC IVAU, Xt|XZR // Invalidate instruction cache by address to Point of Unification ``` **Why you need DC+IC together for JIT**: When you write machine code to memory (via stores), it goes through the data cache. But the CPU fetches instructions from the instruction cache, which is separate. To make the CPU see your new code, you must: (1) clean the data cache line to the point of unification (`DC CVAU`), so the data reaches a level visible to the I-cache; (2) invalidate the instruction cache (`IC IVAU`), so the I-cache re-fetches from the cleaned data; (3) insert barriers (`DSB ISH` + `ISB`) to ensure ordering. ```asm // After writing code to [X0]: DC CVAU, X0 // Clean data cache to Point of Unification DSB ISH // Wait for clean to complete IC IVAU, X0 // Invalidate instruction cache DSB ISH // Wait for invalidate to complete ISB // Flush pipeline, fetch new instructions ``` ### 19.7 Address Translation (AT) Translate a virtual address using the page tables, without actually accessing memory. The result goes into `PAR_EL1` (Physical Address Register). Useful for debugging page table issues in kernel code. ```asm AT S1E1R, X0 // Stage 1, EL1, Read: translate X0 as if EL1 read AT S1E1W, X0 // Stage 1, EL1, Write AT S1E0R, X0 // Stage 1, EL0, Read: translate as user-mode read MRS X1, PAR_EL1 // Read result (physical address + attributes, or fault info) ``` ### 19.8 TLB Invalidation (TLBI) The TLB (Translation Lookaside Buffer) is a cache of page table entries. When the OS modifies page tables (changing permissions, unmapping pages, switching address spaces), it must invalidate stale TLB entries so the CPU re-reads the updated page tables. ```asm TLBI VMALLE1 // Invalidate ALL TLB entries at EL1 (current VMID) TLBI VAE1, X0 // Invalidate TLB entry for virtual address in X0 (EL1) TLBI ASIDE1, X0 // Invalidate all entries matching ASID in X0 TLBI VALE1, X0 // Invalidate by VA, last level only (more targeted, faster) DSB ISH // Wait for invalidation to complete ISB // Ensure subsequent instruction fetches use new translations ``` **Why TLBI needs DSB+ISB**: TLBI is asynchronous — it tells the TLB to invalidate, but the invalidation may not be complete when the next instruction executes. `DSB ISH` waits for the invalidation to finish across all cores in the inner shareable domain. `ISB` then flushes the pipeline so subsequent instructions fetch with the new translations. --- ## 20. Overflow, Underflow & Carry **Why overflow detection matters**: Integer arithmetic silently wraps on overflow — `UINT64_MAX + 1 = 0`. In most code this is harmless (or intentional). But for security-critical code (buffer size calculations, array index bounds), undetected overflow causes vulnerabilities. ARM doesn't trap on overflow (unlike some architectures) — you must explicitly check using the flag-setting instructions (`ADDS`, `SUBS`) and conditional branches. This section shows how. ### 20.1 Unsigned Overflow (Carry) For unsigned arithmetic, "overflow" means the result didn't fit in 64 (or 32) bits. The carry flag (C) indicates this. ```asm ADDS X0, X1, X2 // Unsigned: if result < X1, carry occurred B.CS overflow // CS = Carry Set = unsigned overflow SUBS X0, X1, X2 // Unsigned: if X1 < X2, borrow occurred B.CC underflow // CC = Carry Clear = unsigned underflow (borrow) ``` **Remember ARM's inverted carry for subtraction:** After SUBS, C=1 means NO borrow (X1 >= X2 unsigned), C=0 means borrow occurred (X1 < X2 unsigned). ### 20.2 Signed Overflow (V flag) Signed overflow occurs when the result of an operation doesn't fit in the signed range. The V flag indicates this. ```asm ADDS X0, X1, X2 B.VS signed_overflow // V=1 means signed overflow // Signed overflow in addition: positive + positive = negative, or negative + negative = positive // Signed overflow in subtraction: positive - negative = negative, or negative - positive = positive ``` ### 20.3 Detecting Overflow in Practice **Unsigned multiply overflow:** ```asm // Check if X0 * X1 overflows unsigned 64-bit: UMULH X2, X0, X1 // High 64 bits of product MUL X3, X0, X1 // Low 64 bits (the result we want) CBNZ X2, overflow // If high bits non-zero, overflow ``` **Signed multiply overflow:** ```asm // Check if X0 * X1 overflows signed 64-bit: SMULH X2, X0, X1 // High 64 bits (signed) MUL X3, X0, X1 // Low 64 bits // Overflow if X2 != sign-extension of X3 ASR X4, X3, #63 // X4 = all zeros or all ones (sign of X3) CMP X2, X4 B.NE overflow ``` **Multi-word addition with carry propagation:** ```asm // 128-bit: (X1:X0) + (X3:X2) → (X5:X4) ADDS X4, X0, X2 // Low 64, set carry ADCS X5, X1, X3 // High 64 + carry, set carry B.CS overflow_128 // Carry out of 128-bit B.VS signed_overflow_128 // Signed overflow of 128-bit ``` ### 20.4 Saturating Arithmetic AArch64 scalar doesn't have saturating add/sub (unlike NEON). You must implement it: ```asm // Unsigned saturating add: X0 = min(X1 + X2, UINT64_MAX) MOV X3, #-1 // X3 = UINT64_MAX ADDS X0, X1, X2 CSEL X0, X3, X0, CS // If carry (overflow), use UINT64_MAX; else keep result // Signed saturating add is more complex — need to handle both directions: ADDS X0, X1, X2 // On signed overflow, saturate to INT64_MAX or INT64_MIN depending on direction MOV X3, #0x7FFFFFFFFFFFFFFF // INT64_MAX (valid bitmask immediate) // The trick: ASR #63 fills the entire register with copies of the sign bit. // If X1 >= 0: X1 ASR 63 = 0x0000000000000000, so EOR with INT64_MAX = INT64_MAX // If X1 < 0: X1 ASR 63 = 0xFFFFFFFFFFFFFFFF, so EOR with INT64_MAX = 0x8000000000000000 = INT64_MIN // This selects the correct saturation direction: positive overflow → MAX, negative overflow → MIN EOR X4, X3, X1, ASR #63 // X4 = INT64_MAX if X1 positive, INT64_MIN if negative CSEL X0, X4, X0, VS // If signed overflow, use saturated value ``` --- ## 21. Exceptions, Interrupts & Exception Levels ARM has a privilege system called Exception Levels (EL0–EL3). If you're writing user-space code, you only interact with exceptions via `SVC` (system calls). If you're writing a kernel, hypervisor, or firmware, you need to understand the full exception model. Even for RE, understanding EL helps you identify what privilege level code runs at. ### 21.1 Exception Levels (EL) | Level | Typical use | Can access | |---|---|---| | EL0 | User applications | User registers, EL0 system regs | | EL1 | OS kernel | EL0 + EL1 system regs, page tables | | EL2 | Hypervisor | EL0 + EL1 + EL2 system regs | | EL3 | Secure Monitor / firmware | Everything | Higher EL = more privilege. Exceptions go UP (or stay same level), returns go DOWN. You **cannot** take an exception to a lower EL. ### 21.2 Exception Types 1. **Synchronous exceptions** (caused by current instruction): - **SVC/HVC/SMC**: System calls - **Instruction abort**: Bad instruction fetch (e.g., page fault, permission fault) - **Data abort**: Bad data access (e.g., page fault, alignment, permission) - **Undefined instruction**: Unrecognized encoding - **Debug exceptions**: BRK, watchpoint, breakpoint, single-step - **SP/PC alignment fault** 2. **Asynchronous exceptions** (not caused by current instruction): - **IRQ**: Normal hardware interrupt — an external device (timer, network card, keyboard) signals the CPU that it needs attention. The CPU pauses its current code, saves state, and jumps to the interrupt handler. This happens asynchronously (at any point during program execution). - **FIQ**: Fast interrupt — same concept as IRQ but with a separate, higher-priority path. Used for latency-critical handlers (e.g., secure world interrupts). - **SError**: System error (asynchronous abort, e.g., uncorrectable memory error from a previous write that was buffered) ### 21.3 Exception Handling Mechanism When an exception occurs to ELx: 1. `PSTATE` is saved to `SPSR_ELx` (Saved Program Status Register) 2. Return address is saved to `ELR_ELx` (Exception Link Register) 3. Exception Syndrome info is saved to `ESR_ELx` (tells you WHY: instruction class, fault details) 4. If it's an abort, the faulting address is in `FAR_ELx` (Fault Address Register) 5. PSTATE is modified (interrupts masked, EL set, etc.) 6. PC jumps to the exception vector **ESR_ELx decoding**: Bits [31:26] are the **Exception Class (EC)** — the top-level reason for the exception. Common EC values: | EC (hex) | Meaning | |---|---| | 0x15 | SVC from AArch64 (system call) | | 0x18 | MSR/MRS trap (system register access from lower EL) | | 0x20 | Instruction abort from lower EL (page fault on instruction fetch) | | 0x21 | Instruction abort from same EL | | 0x24 | Data abort from lower EL (page fault on data access) | | 0x25 | Data abort from same EL | | 0x2C | BRK instruction (debug breakpoint) | Bits [24:0] are the **ISS (Instruction Specific Syndrome)** — details specific to each EC. For SVC, the ISS contains the 16-bit immediate from the SVC instruction. For data aborts, the ISS tells you whether it was a read or write, the access size, and the fault type (translation, permission, alignment, etc.). ### 21.4 Exception Vector Table (VBAR_ELx) Each EL has a vector base address register (`VBAR_EL1`, `VBAR_EL2`, `VBAR_EL3`). The vector table has 16 entries, each 128 bytes (32 instructions). The CPU picks which entry to jump to based on three things: where the exception came from, which stack pointer was active, and what type of exception it is. The four groups: **"Current EL with SP_EL0"** means the exception happened at the same EL that's handling it, and the code was using the user-mode stack pointer (unusual — most kernels switch to SP_ELx immediately). **"Current EL with SP_ELx"** is the normal case for kernel exceptions. **"Lower EL, AArch64/AArch32"** means the exception came from a less-privileged level (e.g., a user-mode `SVC` arriving at the kernel). | Offset | Source | Type | |---|---|---| | 0x000 | Current EL with SP_EL0 | Synchronous | | 0x080 | Current EL with SP_EL0 | IRQ | | 0x100 | Current EL with SP_EL0 | FIQ | | 0x180 | Current EL with SP_EL0 | SError | | 0x200 | Current EL with SP_ELx | Synchronous | | 0x280 | Current EL with SP_ELx | IRQ | | 0x300 | Current EL with SP_ELx | FIQ | | 0x380 | Current EL with SP_ELx | SError | | 0x400 | Lower EL, AArch64 | Synchronous | | 0x480 | Lower EL, AArch64 | IRQ | | 0x500 | Lower EL, AArch64 | FIQ | | 0x580 | Lower EL, AArch64 | SError | | 0x600 | Lower EL, AArch32 | Synchronous | | 0x680 | Lower EL, AArch32 | IRQ | | 0x700 | Lower EL, AArch32 | FIQ | | 0x780 | Lower EL, AArch32 | SError | **Return from exception:** ``` ERET // PC = ELR_ELx, PSTATE = SPSR_ELx, EL drops as appropriate ``` **Practical example — minimal SVC handler (EL1 kernel handling user SVC):** ```asm // This code would be at VBAR_EL1 + 0x400 (Lower EL AArch64, Synchronous) el0_sync_handler: STP X29, X30, [SP, #-16]! // Save frame (using kernel SP) MRS X0, ESR_EL1 // Read exception syndrome LSR X1, X0, #26 // Extract EC (bits [31:26]) CMP X1, #0x15 // EC=0x15 = SVC from AArch64? B.NE not_svc // If not SVC, handle other exception // It's a syscall — X8 has the syscall number (set by user before SVC) // X0-X7 have the arguments (set by user) MRS X9, ELR_EL1 // Save return address (instruction after SVC) // ... dispatch to syscall handler based on X8 ... // ... handler puts return value in X0 ... MSR ELR_EL1, X9 // Restore return address LDP X29, X30, [SP], #16 ERET // Return to user: PC=ELR, PSTATE=SPSR, drop to EL0 ``` ### 21.5 Masking Interrupts ```asm MSR DAIFSet, #0xF // Mask all: Debug, SError (A), IRQ (I), FIQ (F) MSR DAIFClr, #0xF // Unmask all MSR DAIFSet, #0x2 // Mask IRQ only (bit 1) MSR DAIFClr, #0x2 // Unmask IRQ only ``` The bits: D=bit3, A=bit2, I=bit1, F=bit0. A set bit means **masked** (disabled). --- ## 22. Floating Point (SIMD/FP) Floating-point instructions operate on the S (32-bit single) and D (64-bit double) register views of the SIMD/FP register file. They are separate from integer instructions and use a separate set of condition flag semantics for comparisons (particularly around NaN — "Not a Number" — which is a special float value representing undefined results like 0/0). ### 22.1 Basic FP Instructions These mirror integer arithmetic but for floating-point values. `FADD` adds, `FSUB` subtracts, `FMUL` multiplies, `FDIV` divides. Unlike integer division, `FDIV` can produce fractional results. `FSQRT` computes the square root. ```asm FADD Sd, Sn, Sm // Single-precision add FADD Dd, Dn, Dm // Double-precision add FSUB Sd, Sn, Sm // Single-precision subtract FSUB Dd, Dn, Dm // Double-precision subtract FMUL Sd, Sn, Sm // Single-precision multiply FMUL Dd, Dn, Dm // Double-precision multiply FDIV Sd, Sn, Sm // Single-precision divide FDIV Dd, Dn, Dm // Double-precision divide FNEG Sd, Sn // Negate single (flip sign bit) FNEG Dd, Dn // Negate double FABS Sd, Sn // Absolute value single (clear sign bit) FABS Dd, Dn // Absolute value double FSQRT Sd, Sn // Square root single FSQRT Dd, Dn // Square root double FRINT32Z Sd, Sn // Round to 32-bit integer value (toward zero), stay in FP format // REQUIRES FEAT_FRINTTS (check ID_AA64ISAR1_EL1.FRINTTS) // Also: FRINT32X (round to nearest), FRINT64Z, FRINT64X FRINTN Sd, Sn // Round to nearest integer (stay in FP: 3.7 → 4.0, not integer 4) FRINTN Dd, Dn // Double FRINTM Sd, Sn // Round toward -infinity (floor), stay in FP FRINTM Dd, Dn // Double FRINTP Sd, Sn // Round toward +infinity (ceil), stay in FP FRINTP Dd, Dn // Double FRINTZ Sd, Sn // Round toward zero (truncate), stay in FP FRINTZ Dd, Dn // Double FRINTA Sd, Sn // Round to nearest, ties away from zero, stay in FP FRINTA Dd, Dn // Double FRINTX Sd, Sn // Round using FPCR mode, signal inexact FRINTX Dd, Dn // Double FRINTI Sd, Sn // Round using FPCR mode FRINTI Dd, Dn // Double // FRINTN/M/P/Z are baseline ARMv8.0. FRINT32Z/32X/64Z/64X require FEAT_FRINTTS. ``` **What these REALLY do — traced with values:** ```asm // If S0 = 3.0 and S1 = 1.5: FADD S2, S0, S1 // S2 = 3.0 + 1.5 = 4.5 FSUB S2, S0, S1 // S2 = 3.0 - 1.5 = 1.5 FMUL S2, S0, S1 // S2 = 3.0 × 1.5 = 4.5 FDIV S2, S0, S1 // S2 = 3.0 ÷ 1.5 = 2.0 // Special cases the hardware handles: FDIV S2, S0, S1 // If S1 = 0.0: S2 = +infinity (not an exception!) FDIV S2, S0, S1 // If S0 = 0.0 and S1 = 0.0: S2 = NaN (0/0 is undefined) FSQRT S2, S0 // If S0 = -1.0: S2 = NaN (square root of negative) FSQRT S2, S0 // If S0 = 4.0: S2 = 2.0 ``` **Why FP doesn't trap on errors by default**: Unlike integer division (which returns 0 for divide-by-zero on ARM), FP operations produce IEEE 754 special values (infinity, NaN) instead of faulting. This lets algorithms handle edge cases without branch-heavy error checking. If you need to detect errors, check `FPSR` exception flags after the computation. ### 22.2 FP Multiply-Accumulate Fused multiply-accumulate (FMA) computes `a + (b × c)` or `a - (b × c)` with only **one** rounding step at the end, making it more accurate than separate FMUL + FADD. This is the single most important instruction for numerical performance — matrix multiply, convolution, polynomial evaluation, and physics simulations all reduce to FMA loops. ```asm FMADD Sd, Sn, Sm, Sa // Sd = Sa + (Sn * Sm), fused (single rounding) [single] FMADD Dd, Dn, Dm, Da // Dd = Da + (Dn * Dm) [double] FMSUB Sd, Sn, Sm, Sa // Sd = Sa - (Sn * Sm) [single] FMSUB Dd, Dn, Dm, Da // Dd = Da - (Dn * Dm) [double] FNMADD Sd, Sn, Sm, Sa // Sd = -Sa - (Sn * Sm) = -(Sa + Sn*Sm) [single] FNMADD Dd, Dn, Dm, Da // Dd = -Da - (Dn * Dm) [double] FNMSUB Sd, Sn, Sm, Sa // Sd = -Sa + (Sn * Sm) = Sn*Sm - Sa [single] FNMSUB Dd, Dn, Dm, Da // Dd = -Da + (Dn * Dm) = Dn*Dm - Da [double] ``` `FNMADD` and `FNMSUB` are the negated versions — they negate the entire result. `FNMADD` negates the fused multiply-add (useful for computing `-(a×b + c)`). `FNMSUB` computes `a×b - c` (the multiply result minus the accumulator). **Fused**: Only one rounding at the end, not after multiply then again after add. This is more accurate than separate FMUL + FADD. **MADD for polynomial evaluation** (Horner's method): compilers use FMADD to evaluate `a*x^2 + b*x + c` as `FMADD(FMADD(a, x, b), x, c)` — two fused multiply-adds instead of separate multiply and add chains. ### 22.3 FP Conditional Select & Moves ```asm FCSEL Sd, Sn, Sm, cond // Sd = cond ? Sn : Sm (based on integer NZCV flags) [single] FCSEL Dd, Dn, Dm, cond // Dd = cond ? Dn : Dm [double] FCSEL Hd, Hn, Hm, cond // Hd = cond ? Hn : Hm (FEAT_FP16) [half] ``` **What FCSEL REALLY does**: It's the FP equivalent of CSEL. The condition is tested against the integer NZCV flags (typically set by a prior FCMP or CMP), and one of the two FP registers is selected. This enables branchless FP min/max: ```asm // FP min — no-NaN version (assumes ordered inputs): FCMP S0, S1 FCSEL S0, S0, S1, LE // LE includes unordered — wrong if either operand is NaN // Ordered select (NOT a true min — picks S1 when unordered, even if S1 is NaN): FCMP S0, S1 FCSEL S0, S0, S1, MI // S0 only when S0 < S1 ordered; else S1 // TRUE NaN-safe min/max — use FMINNM/FMAXNM: FMINNM Sd, Sn, Sm // Min, returns numeric value when one operand is quiet NaN [single] FMINNM Dd, Dn, Dm // [double] FMAXNM Sd, Sn, Sm // Max, returns numeric value when one operand is quiet NaN [single] FMAXNM Dd, Dn, Dm // [double] FMIN Sd, Sn, Sm // Min (IEEE 754-2008 minimum: propagates NaN) [single] FMIN Dd, Dn, Dm // [double] FMAX Sd, Sn, Sm // Max (IEEE 754-2008 maximum: propagates NaN) [single] FMAX Dd, Dn, Dm // [double] // FP abs: FABS S0, S0 // Just use FABS — no FCSEL needed ``` **FP register-to-register moves** (no conversion, just copy): ```asm FMOV Sd, Sn // Copy single-precision register FMOV Dd, Dn // Copy double-precision register ``` These copy the value between FP registers without touching GPRs. Unlike `MOV` (which is always an alias), `FMOV` between same-width FP registers is a real instruction. ### 22.4 FP Comparison `FCMP` compares two floating-point values and sets the NZCV flags. Unlike integer comparisons, floats have a special case: if either operand is NaN (Not a Number), the result is "unordered" — the operands cannot be compared. After `FCMP`, you can use the same condition codes as after integer `CMP`, plus `B.VS` to detect the NaN/unordered case. ```asm FCMP Sn, Sm // Compare single, set NZCV flags FCMP Dn, Dm // Compare double FCMP Sn, #0.0 // Compare single with zero FCMP Dn, #0.0 // Compare double with zero FCMPE Sn, Sm // Signaling compare single: signals Invalid Operation for ANY NaN FCMPE Dn, Dm // Signaling compare double // (FCMP only signals for signaling NaNs, not quiet NaNs) FCMPE Sn, #0.0 // Signaling compare single against zero FCMPE Dn, #0.0 // Signaling compare double against zero // FEAT_FP16: FCMP Hn, Hm / FCMP Hn, #0.0 / FCMPE Hn, Hm / FCMPE Hn, #0.0 ``` After FCMP: - Ordered and equal: Z=1, C=1, N=0, V=0 - Ordered and less than: N=1, V=0 (so N!=V → LT) - Ordered and greater: C=1, Z=0 (so HI/GT work) - Unordered (NaN involved): C=1, V=1, Z=0, N=0 Use `B.VS` to check for NaN after FCMP. **Critical NaN gotcha**: NaN is not equal to **anything**, including itself. `FCMP S0, S0` where S0=NaN sets V=1 (unordered), Z=0 (not equal). This means `B.EQ` after comparing NaN to itself is **NOT taken**. This is the standard IEEE 754 behavior and is how `isnan(x)` works: `x != x` is true only if x is NaN. In ARM assembly: `FCMP S0, S0; B.VS is_nan` (check the V flag directly). **Traced example:** ```asm // If S0 = 3.14 and S1 = 2.71: FCMP S0, S1 // 3.14 > 2.71 → flags: N=0,Z=0,C=1,V=0 B.GT greater_label // GT = (Z==0 && N==V) = (0==0 && 0==0) = true → taken ✓ B.HI greater_label // HI = (C==1 && Z==0) = true → works here, but CAUTION: // B.HI also triggers on NaN! Use B.GT for FP greater (NaN-safe). // If S0 = NaN: FCMP S0, S1 // NaN involved → flags: N=0,Z=0,C=1,V=1 (unordered) B.VS nan_label // VS = (V==1) = true → branch to NaN handler ✓ B.GT greater_label // GT = (Z==0 && N==V) = (0==0 && 0==1) = false → NOT taken ✓ // (NaN is not greater than anything — GT excludes NaN) B.LT less_label // LT = (N!=V) = (0!=1) = TRUE → TAKEN! // CAREFUL: B.LT IS taken for NaN! Use B.MI for "less than, not NaN" ``` **Which conditions are NaN-safe after FCMP?** This matters because NaN sets N=0,Z=0,C=1,V=1. Any condition that evaluates to true with these flags will fire on NaN: | FP comparison you want | NaN-safe (excludes NaN) | NaN-UNSAFE (includes NaN) | |---|---|---| | Greater than | `B.GT` | `B.HI` | | Greater or equal | `B.GE` | `B.HS` / `B.CS` | | Less than | `B.MI` | `B.LT` | | Less or equal | `B.LS` | `B.LE` | | Equal | `B.EQ` | — | | Not equal | — | `B.NE` (includes NaN — usually what you want) | | Unordered (is NaN?) | `B.VS` | — | **Why this works**: FCMP never sets N=1 and V=1 simultaneously — there's no "signed overflow" in FP comparison. So `B.MI` (N==1) only fires for ordered-less-than, never for NaN (which sets N=0). Similarly, `B.GT` (Z==0 && N==V) excludes NaN because NaN sets V=1 but N=0 (so N≠V). The unsigned conditions (`HI`, `HS`) are unsafe because NaN sets C=1, which is the same as "unsigned higher." ### 22.5 FP ↔ Integer Conversion These convert between integer and floating-point representations. The value is mathematically converted (not just bit-reinterpreted). For example, `SCVTF Sd, Wn` takes the signed integer in Wn and produces the nearest float in Sd. The reverse (`FCVTZS`) converts a float to an integer, rounding toward zero (truncating the fractional part, like a C cast `(int)f`). Other rounding modes are also available. **What these REALLY do:** ```asm // Integer → Float: // If W0 = 42 (integer): SCVTF S1, W0 // S1 = 42.0 (float representation of the integer 42) UCVTF S1, W0 // Same result for positive numbers // If W0 = -7 (signed integer): SCVTF S1, W0 // S1 = -7.0 (signed conversion, preserves negative) UCVTF S1, W0 // S1 = 4294967289.0 (unsigned! -7 as unsigned 32-bit = huge number) // Float → Integer: // If S0 = 3.7: FCVTZS W1, S0 // W1 = 3 (truncate toward zero — drops the .7) FCVTNS W1, S0 // W1 = 4 (round to nearest, .7 rounds up) FCVTMS W1, S0 // W1 = 3 (floor — round toward minus infinity) FCVTPS W1, S0 // W1 = 4 (ceiling — round toward plus infinity) // If S0 = -3.7: FCVTZS W1, S0 // W1 = -3 (truncate toward zero — NOT -4!) FCVTMS W1, S0 // W1 = -4 (floor — round toward minus infinity) ``` ```asm // Float → Signed integer (round toward zero): FCVTZS Wd|WZR, Sn // Single → signed 32-bit FCVTZS Xd|XZR, Sn // Single → signed 64-bit FCVTZS Wd|WZR, Dn // Double → signed 32-bit FCVTZS Xd|XZR, Dn // Double → signed 64-bit // Float → Unsigned integer (round toward zero): FCVTZU Wd|WZR, Sn // Single → unsigned 32-bit FCVTZU Xd|XZR, Sn // Single → unsigned 64-bit FCVTZU Wd|WZR, Dn // Double → unsigned 32-bit FCVTZU Xd|XZR, Dn // Double → unsigned 64-bit // Signed integer → Float: SCVTF Sd, Wn|WZR // Signed 32-bit → single SCVTF Sd, Xn|XZR // Signed 64-bit → single (may lose precision) SCVTF Dd, Wn|WZR // Signed 32-bit → double (lossless) SCVTF Dd, Xn|XZR // Signed 64-bit → double // Unsigned integer → Float: UCVTF Sd, Wn|WZR // Unsigned 32-bit → single UCVTF Sd, Xn|XZR // Unsigned 64-bit → single (may lose precision) UCVTF Dd, Wn|WZR // Unsigned 32-bit → double (lossless) UCVTF Dd, Xn|XZR // Unsigned 64-bit → double // Other rounding modes — each has ALL 4 width combinations (Wd/Sn, Xd/Sn, Wd/Dn, Xd/Dn): // Signed: FCVTAS Wd|WZR, Sn // Round to nearest, ties away from zero [only Wd,Sn shown — all 4 combos valid] FCVTNS Wd|WZR, Sn // Round to nearest, ties to even FCVTMS Wd|WZR, Sn // Round toward −∞ (floor) FCVTPS Wd|WZR, Sn // Round toward +∞ (ceiling) FCVTAS Xd|XZR, Dn // [Xd,Dn form — all combos: Wd/Sn, Xd/Sn, Wd/Dn, Xd/Dn] FCVTNS Xd|XZR, Dn FCVTMS Xd|XZR, Dn FCVTPS Xd|XZR, Dn // Unsigned: FCVTAU Wd|WZR, Sn // Round to nearest, ties away (unsigned) FCVTNU Wd|WZR, Sn // Round to nearest, ties to even (unsigned) FCVTMU Wd|WZR, Sn // Round toward −∞ (unsigned) FCVTPU Wd|WZR, Sn // Round toward +∞ (unsigned) FCVTAU Xd|XZR, Dn // [all 4 combos valid per instruction] FCVTNU Xd|XZR, Dn FCVTMU Xd|XZR, Dn FCVTPU Xd|XZR, Dn // FEAT_FP16: All conversion instructions also have Hn (half-precision) source/dest forms. ``` ### 22.6 FP ↔ GPR Moves (no conversion) `FMOV` copies raw bits between a general-purpose register and a floating-point register **without any conversion**. The bit pattern is preserved exactly. This is different from `SCVTF`/`FCVTZS` which mathematically convert the value. `FMOV Sd, #fimm` loads a floating-point constant directly, but only a limited set of 256 values are encodable. **What FMOV REALLY does vs SCVTF — critical difference:** ```asm // If W0 = 0x40400000 (which happens to be the IEEE 754 encoding of 3.0): FMOV S1, W0 // S1 = 3.0 (raw bit copy — 0x40400000 IS 3.0 in float) SCVTF S2, W0 // S2 = 1077936128.0 (treats W0 as integer 0x40400000 = 1077936128, // converts that integer to float) // These give COMPLETELY different results! FMOV preserves bits, SCVTF converts values. // Going the other direction: // If S0 = 3.0 (bit pattern 0x40400000): FMOV W1, S0 // W1 = 0x40400000 (raw bits of the float) FCVTZS W2, S0 // W2 = 3 (mathematical conversion: 3.0 → 3) ``` **When to use FMOV vs SCVTF**: Use `SCVTF` when converting between number types (int→float). Use `FMOV` when you need to manipulate the raw bits of a float (e.g., extracting the exponent, comparing float bit patterns, or passing floats through integer registers in a calling convention). ```asm FMOV Sd, Wn|WZR // Copy bits: GPR → single FP (no conversion) FMOV Wd|WZR, Sn // Copy bits: single FP → GPR FMOV Dd, Xn|XZR // Copy bits: GPR → double FP FMOV Xd|XZR, Dn // Copy bits: double FP → GPR FMOV Vd.D[1], Xn|XZR // Copy bits: GPR → upper 64 bits of 128-bit V register FMOV Xd|XZR, Vn.D[1] // Copy bits: upper 64 bits of V register → GPR FMOV Sd, #fimm // Load FP immediate (limited set of 256 values) FMOV Dd, #fimm // Double-precision immediate (same 256 values) FMOV Hd, #fimm // Half-precision immediate (FEAT_FP16, same 256 values) ``` The FP immediate (`#fimm`) can encode values of the form: `±(1 + m/16) × 2^(n)` where 0 ≤ m ≤ 15 and -3 ≤ n ≤ 4. This gives 256 possible values. NOT arbitrary floats. Some examples: ``` // m=0, n=0: ±(1 + 0/16) × 2^0 = ±1.0 // m=0, n=1: ±(1 + 0/16) × 2^1 = ±2.0 // m=0, n=-1: ±(1 + 0/16) × 2^-1 = ±0.5 // m=8, n=0: ±(1 + 8/16) × 2^0 = ±1.5 // m=0, n=4: ±(1 + 0/16) × 2^4 = ±16.0 // m=15, n=4: ±(1 + 15/16) × 2^4 = ±31.0 // m=0, n=-3: ±(1 + 0/16) × 2^-3 = ±0.125 // You CANNOT encode 0.0, 0.1, 0.3, or π with FMOV immediate ``` ### 22.7 FP Precision Conversion These convert between different FP widths (half ↔ single ↔ double). Widening conversions (half→single, single→double) are lossless. Narrowing conversions (double→single, single→half) may lose precision and round. ```asm FCVT Dd, Sn // Single → Double (lossless, no precision lost) FCVT Sd, Dn // Double → Single (may lose precision, rounds) FCVT Hd, Sn // Single → Half (may lose precision) FCVT Sd, Hn // Half → Single (lossless) FCVT Hd, Dn // Double → Half FCVT Dd, Hn // Half → Double (lossless) ``` ### 22.8 Half-Precision (FP16) Operations **FEAT_FP16** (ARMv8.2-A and later) adds native arithmetic on 16-bit floats. Without this feature, half-precision registers (H0–H31) can only be used as a storage format — you convert to single/double to compute, then convert back. With FEAT_FP16, you get direct arithmetic: ```asm // Half-precision arithmetic (FEAT_FP16): FADD Hd, Hn, Hm // 16-bit float add FSUB Hd, Hn, Hm FMUL Hd, Hn, Hm FDIV Hd, Hn, Hm FSQRT Hd, Hn FMADD Hd, Hn, Hm, Ha // Fused multiply-add FABS Hd, Hn FNEG Hd, Hn FCMP Hn, Hm FCVTZS Wd|WZR, Hn // FP16 → signed 32-bit int FCVTZS Xd|XZR, Hn // FP16 → signed 64-bit int FCVTZU Wd|WZR, Hn // FP16 → unsigned 32-bit int FCVTZU Xd|XZR, Hn // FP16 → unsigned 64-bit int SCVTF Hd, Wn|WZR // Signed 32-bit int → FP16 SCVTF Hd, Xn|XZR // Signed 64-bit int → FP16 UCVTF Hd, Wn|WZR // Unsigned 32-bit int → FP16 UCVTF Hd, Xn|XZR // Unsigned 64-bit int → FP16 FMOV Hd, Wn|WZR // Copy raw bits GPR → FP16 (no conversion) FMOV Wd|WZR, Hn // Copy raw bits FP16 → GPR ``` **Why FP16 matters**: Machine learning inference uses FP16 (and even smaller formats) because neural network weights don't need full precision. FP16 gives 2× the throughput of FP32 at half the memory bandwidth, which is often the bottleneck. ARM also supports BFloat16 (BF16, via FEAT_BF16), which has the same 8-bit exponent as FP32 but only 7 mantissa bits — it trades precision for range, which works well for training. **FP16 format**: 1 sign bit, 5 exponent bits, 10 mantissa bits. Range: +/-65504, smallest normal: ~6.1e-5. The limited range means overflow to infinity is common — this is acceptable in ML but dangerous in general-purpose code. ### 22.9 FP Rounding Modes (FPCR) The `FPCR` (Floating-Point Control Register) controls the rounding mode via bits [23:22]: | FPCR.RMode | Meaning | |---|---| | 00 | Round to Nearest, ties to Even (default — IEEE 754) | | 01 | Round toward Plus Infinity (ceiling) | | 10 | Round toward Minus Infinity (floor) | | 11 | Round toward Zero (truncation) | ```asm MRS X0, FPCR // Read current FP control ORR X0, X0, #(0b01 << 22) // Set round-toward-plus-infinity MSR FPCR, X0 // Write back ``` Most code uses the default (Round to Nearest, ties to Even) and never touches FPCR. The `FCVTZS`/`FCVTZU` instructions always round toward zero regardless of FPCR — the "Z" in their name stands for "Zero" (the rounding mode, not the zero register). **Flush-to-Zero (FZ bit)**: FPCR bit [24]. When set, **denormalized** (subnormal) float results are flushed to zero instead of being represented as tiny non-zero values. Denormals are numbers smaller than the smallest normal float (e.g., below ~1.18e-38 for single-precision). Processing denormals is slow on many CPUs (up to 100x slower) because the hardware traps to microcode. Setting FZ=1 avoids this penalty at the cost of losing precision near zero. Most games and media applications set FZ=1; scientific code leaves it at 0 for accuracy. ```asm // Enable flush-to-zero: MRS X0, FPCR ORR X0, X0, #(1 << 24) // Set FZ bit MSR FPCR, X0 ``` **FPSR (FP Status Register)**: Records **cumulative** exception flags from FP operations — these flags are "sticky" (once set, they stay set until you clear them). Check FPSR after a sequence of FP operations to see if anything unusual happened: | FPSR bit | Flag | Meaning | |---|---|---| | [0] | IOC | Invalid Operation (0/0, sqrt of negative, NaN input) | | [1] | DZC | Division by Zero (finite ÷ 0 → ±infinity) | | [2] | OFC | Overflow (result too large for the format) | | [3] | UFC | Underflow (result too small, became denormal or zero) | | [4] | IXC | Inexact (result was rounded — extremely common, almost always set) | | [7] | IDC | Input Denormal (a denormal input was consumed) | ```asm MRS X0, FPSR // Read cumulative FP exception flags TST X0, #1 // Check IOC (Invalid Operation) B.NE had_invalid_op // Branch if any FP operation was invalid MSR FPSR, XZR // Clear all flags ``` --- ## 23. NEON / Advanced SIMD Overview NEON (also called Advanced SIMD) processes multiple data elements in parallel using a single instruction — this is SIMD (Single Instruction, Multiple Data). A 128-bit V register can hold, for example, four 32-bit integers or sixteen 8-bit bytes. One NEON `ADD V0.4S, V1.4S, V2.4S` adds four pairs of 32-bit integers simultaneously. **Why SIMD matters**: Scalar code processes one value per instruction. If you need to add 1000 pairs of 32-bit numbers, that's 1000 ADD instructions. With NEON `.4S`, it's 250 ADD instructions — 4× throughput from the same number of instructions. For byte-level operations (image processing, string scanning), `.16B` gives 16× throughput. This is why compilers auto-vectorize loops and why hand-written NEON dominates in codecs, crypto, and ML inference. **How lanes work**: Each V register is divided into **lanes** (also called elements). `V0.4S` means V0 is viewed as 4 lanes of 32-bit (S) values. An operation like `ADD V0.4S, V1.4S, V2.4S` adds lane 0 of V1 to lane 0 of V2 into lane 0 of V0, lane 1 to lane 1, etc. — all independently, in parallel. There is no carry or overflow between lanes. **64-bit (D) vs 128-bit (Q) operations**: The lower specifiers (`.8B`, `.4H`, `.2S`, `.1D`) operate on the lower 64 bits of the register only — the upper 64 bits of the destination are **zeroed**. The higher specifiers (`.16B`, `.8H`, `.4S`, `.2D`) use all 128 bits. Using 64-bit operations is useful when you have small amounts of data or want to avoid touching the upper half. **Common NEON housekeeping:** ```asm // Zero a vector register (two ways): MOVI V0.4S, #0 // Set all lanes to zero (preferred — single instruction) EOR V0.16B, V0.16B, V0.16B // XOR with self = zero (also works, sometimes preferred by compilers) // Set all lanes to a constant: MOVI V0.4S, #0xFF // All 32-bit lanes = 0xFF (only certain immediates encodable) MOVI V0.16B, #0x55 // All bytes = 0x55 // Broadcast a GPR value to all lanes: DUP V0.4S, W0 // Fill all 4 lanes with W0's value DUP V0.2D, X0 // Fill both 64-bit lanes with X0's value // Broadcast one lane to all lanes: DUP V0.4S, V1.S[2] // Fill all 4 lanes with lane 2 of V1 ``` ### 23.1 Vector Arrangement Specifiers The suffix like `.4S` or `.8B` tells the CPU how to interpret the 128-bit register: how many elements and what size each element is. | Specifier | Element size | Elements per 64-bit D | Elements per 128-bit Q | |---|---|---|---| | `.8B` / `.16B` | 8-bit | 8 | 16 | | `.4H` / `.8H` | 16-bit | 4 | 8 | | `.2S` / `.4S` | 32-bit | 2 | 4 | | `.1D` / `.2D` | 64-bit | 1 | 2 | ### 23.2 Common NEON Instructions ```asm // Vector add/sub ADD V0.4S, V1.4S, V2.4S // 4× 32-bit integer add SUB V0.4S, V1.4S, V2.4S // 4× 32-bit integer subtract FADD V0.4S, V1.4S, V2.4S // 4× single-precision FP add FSUB V0.4S, V1.4S, V2.4S // 4× single-precision FP subtract // Vector multiply MUL V0.4S, V1.4S, V2.4S // 4× 32-bit integer multiply (low 32 bits of each product) FMUL V0.4S, V1.4S, V2.4S // 4× FP multiply // Vector fused multiply-accumulate (CRITICAL for performance — used everywhere by compilers): FMLA V0.4S, V1.4S, V2.4S // V0 += V1 * V2 (per-lane, fused — V0 accumulates) FMLS V0.4S, V1.4S, V2.4S // V0 -= V1 * V2 (per-lane, fused subtract) // FMLA is the single most important NEON instruction for numerical code — matrix multiply, // convolution, FIR filters, and physics all reduce to FMLA loops. // Widening operations (narrow inputs → wider outputs, no overflow possible) // The "L" stands for "Long" — the result is longer than the inputs. SMULL V0.4S, V1.4H, V2.4H // 4× signed 16→32 multiply (lower 4 lanes of input) SMULL2 V0.4S, V1.8H, V2.8H // Same but upper 4 lanes of input // (the "2" suffix means "use the upper half of the source registers") UADDL V0.4S, V1.4H, V2.4H // 4× unsigned 16→32 add (result can't overflow because it's wider) // Narrowing (wider inputs → narrow outputs, may lose data) // The "N" stands for "Narrow" — the result is narrower than the inputs. XTN V0.4H, V1.4S // Extract narrow: take lower 16 bits of each 32-bit lane (truncate) SQXTN V0.4H, V1.4S // Saturating narrow (signed): clamp each 32-bit value to INT16 range // before truncating — prevents silent wraparound // Reduction (collapse all lanes into a single scalar) ADDV Sd, V0.4S // Sum all 4 lanes into one 32-bit scalar SADDLV Dd, V0.4S // Widening sum: sum 4× 32-bit lanes into one 64-bit scalar // (prevents overflow — useful when summing large values) SMAXV Sd, V0.4S // Signed maximum across all 4 lanes → single scalar SMINV Sd, V0.4S // Signed minimum across all 4 lanes → single scalar // UMAXV/UMINV also exist for unsigned. No ADDV for .2D (only .8B/.16B/.4H/.8H/.4S). // Compare (result is a bitmask: all-ones if true, all-zeros if false) CMEQ V0.4S, V1.4S, V2.4S // Per-lane: 0xFFFFFFFF if equal, 0 otherwise CMGT V0.4S, V1.4S, V2.4S // Per-lane: all-ones if V1[i] > V2[i] signed CMHI V0.4S, V1.4S, V2.4S // Per-lane: all-ones if V1[i] > V2[i] unsigned (Higher) CMEQ V0.4S, V1.4S, #0 // Per-lane: compare against zero // Also: CMGE (>=), CMHS (unsigned >=), CMLE (<=), CMLT (<), CMTST (any bits in common) ``` ```asm // Table lookup (byte-level permutation — like x86 PSHUFB) TBL V0.16B, {V1.16B}, V2.16B // V0[i] = V1[V2[i]], or 0 if V2[i] >= 16 // Zip/unzip (interleave/deinterleave) ZIP1 V0.4S, V1.4S, V2.4S // Interleave lower halves: V1[0],V2[0],V1[1],V2[1] ZIP2 V0.4S, V1.4S, V2.4S // Interleave upper halves: V1[2],V2[2],V1[3],V2[3] UZP1 V0.4S, V1.4S, V2.4S // Even elements: V1[0],V1[2],V2[0],V2[2] UZP2 V0.4S, V1.4S, V2.4S // Odd elements: V1[1],V1[3],V2[1],V2[3] // Transpose (reorganize rows/columns — used in matrix operations) TRN1 V0.4S, V1.4S, V2.4S // Transpose even: V1[0],V2[0],V1[2],V2[2] TRN2 V0.4S, V1.4S, V2.4S // Transpose odd: V1[1],V2[1],V1[3],V2[3] // Extract (byte-level sliding window across two registers) EXT V0.16B, V1.16B, V2.16B, #4 // Slide: V0 = bytes [4..15] of V1 concat [0..3] of V2 // Insert/extract element (move between scalar and lane) INS V0.S[2], W0 // Insert GPR value into lane 2 of V0 UMOV W0, V0.S[2] // Extract lane 2 of V0 into GPR (zero-extended) SMOV X0, V0.H[3] // Extract lane 3 as signed halfword, sign-extend to 64-bit // (SMOV always sign-extends; UMOV always zero-extends) DUP V0.4S, W0 // Broadcast: fill all 4 lanes with the value in W0 DUP V0.4S, V1.S[0] // Broadcast: fill all 4 lanes with lane 0 of V1 // Shift (per-lane) SHL V0.4S, V1.4S, #5 // Shift left each lane by 5 USHR V0.4S, V1.4S, #5 // Unsigned shift right each lane by 5 SSHR V0.4S, V1.4S, #5 // Signed (arithmetic) shift right each lane by 5 USHL V0.4S, V1.4S, V2.4S // Shift left by amount in V2 (per-lane, variable) ``` ```asm // Bitwise (per-lane or per-bit — same thing for bitwise ops) AND V0.16B, V1.16B, V2.16B // Bitwise AND (always .16B for bitwise — the element size // doesn't matter since it's bit-by-bit) ORR V0.16B, V1.16B, V2.16B // Bitwise OR EOR V0.16B, V1.16B, V2.16B // Bitwise XOR NOT V0.16B, V1.16B // Bitwise NOT (invert all bits) // Bitwise select family — the SIMD equivalent of CSEL: // These use a mask to select bits from two sources. Combined with CMEQ/CMGT // (which produce all-ones/all-zeros masks), they give branchless per-lane selection. BSL V0.16B, V1.16B, V2.16B // Bitwise select: where V0 has 1, take from V1; where 0, take from V2 // V0 = (V1 & V0_original) | (V2 & ~V0_original) BIT V0.16B, V1.16B, V2.16B // Bitwise insert if true: where V2 has 1, take from V1 into V0 // V0 = (V1 & V2) | (V0_original & ~V2) BIF V0.16B, V1.16B, V2.16B // Bitwise insert if false: where V2 has 0, take from V1 into V0 // V0 = (V0_original & V2) | (V1 & ~V2) ``` **Why compare results are all-ones / all-zeros** (not 1/0): The result is a bitmask meant to be used directly with bitwise select. `CMEQ` + `BSL` gives you a branchless per-lane conditional select — the all-ones mask selects from V1, all-zeros selects from V2. This is the SIMD equivalent of CSEL. ### 23.3 Saturating Arithmetic NEON has saturating versions of most arithmetic — when a result overflows, it clamps to the maximum (or minimum) representable value instead of wrapping. Scalar AArch64 does NOT have this (you must build it from CMP+CSEL), which is why NEON saturating ops are so valuable. ```asm SQADD V0.4S, V1.4S, V2.4S // Signed saturating add: each lane clamps to [INT32_MIN, INT32_MAX] UQADD V0.8H, V1.8H, V2.8H // Unsigned saturating add: each lane clamps to [0, UINT16_MAX] SQSUB V0.4S, V1.4S, V2.4S // Signed saturating subtract UQSUB V0.16B, V1.16B, V2.16B // Unsigned saturating subtract: clamp to 0 (never wraps negative) ``` `SQ` prefix = signed saturating, `UQ` prefix = unsigned saturating. These work with all element sizes (.8B, .4H, .2S, etc.). There are also saturating versions of shifts (`SQSHL`, `UQSHL`), narrowing operations (`SQXTN` — saturating narrow: each element is clamped to the target range before truncating), and accumulates (`SQRDMULH` — saturating rounding doubling multiply returning high half, used heavily in fixed-point DSP). **Why saturating arithmetic**: Audio/image processing needs it constantly. If you add two pixel values (0-255) and the result is 300, you want 255 (clamp), not 44 (wrap). Without saturation, every pixel operation would need a clamp sequence. `UQADD` does it in one instruction for 16 bytes at once. ### 23.4 NEON Load/Store ```asm LD1 {V0.4S}, [X0] // Load 1 register (16 bytes) LD1 {V0.4S, V1.4S}, [X0] // Load 2 registers (32 bytes, consecutive in memory) LD1 {V0.4S, V1.4S, V2.4S}, [X0] // Load 3 LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X0] // Load 4 // Structure loads (automatic deinterleaving): LD2 {V0.4S, V1.4S}, [X0] // Load 8 words, deinterleave: V0={w0,w2,w4,w6}, V1={w1,w3,w5,w7} LD3 {V0.4S, V1.4S, V2.4S}, [X0] // Deinterleave 3 streams (e.g., RGB pixels) LD4 ... // Deinterleave 4 streams (e.g., RGBA pixels) // Single lane: LD1 {V0.S}[2], [X0] // Load one 32-bit element into lane 2, other lanes unchanged ST1 {V0.S}[2], [X0] // Store single lane // Post-index (advance pointer after load): LD1 {V0.4S}, [X0], #16 // Load, then X0 += 16 (16 = sizeof(V0.4S)) LD1 {V0.4S}, [X0], X1 // Load, then X0 += X1 ``` **Why LD2/LD3/LD4 exist**: Real-world data is often interleaved — RGB pixels are stored as R,G,B,R,G,B,... in memory. Without LD3, you'd load all the data, then spend many instructions shuffling R values into one register, G into another, B into another. LD3 does this deinterleaving in hardware during the load, which is dramatically faster. ST2/ST3/ST4 do the reverse (interleave on store). ### 23.5 Practical NEON Examples **Sum an array of 32-bit integers:** ```asm // X0 = array pointer, X1 = count (multiple of 4 for simplicity) MOVI V0.4S, #0 // Accumulator = {0, 0, 0, 0} loop: LD1 {V1.4S}, [X0], #16 // Load 4 ints, advance pointer ADD V0.4S, V0.4S, V1.4S // Add to accumulator (4 adds in parallel) SUBS X1, X1, #4 // count -= 4 B.GT loop // Horizontal reduction: sum the 4 lanes ADDV S0, V0.4S // S0 = V0[0] + V0[1] + V0[2] + V0[3] UMOV W0, V0.S[0] // Move scalar result to GPR ``` **Byte-level: count occurrences of a byte in a buffer:** ```asm // X0 = buffer, X1 = length (multiple of 16), W2 = byte to find // Result in W0 DUP V1.16B, W2 // Broadcast search byte to all 16 lanes MOVI V2.16B, #0 // Accumulator (byte lanes, max 255 iterations before overflow) loop: LD1 {V0.16B}, [X0], #16 // Load 16 bytes, advance pointer CMEQ V3.16B, V0.16B, V1.16B // Compare: 0xFF where match, 0x00 where not // 0xFF = -1 signed. Subtracting -1 from accumulator = adding 1: SUB V2.16B, V2.16B, V3.16B // Accumulator += 1 for each matching byte SUBS X1, X1, #16 B.GT loop // Horizontal sum: add all 16 byte lanes into one scalar UADDLV H0, V2.16B // Widening sum: 16 bytes → one 16-bit result UMOV W0, V0.H[0] // Move to GPR ``` Note: the byte accumulator overflows after 255 matching bytes per lane. For large buffers, periodically drain with UADDLV into a wider accumulator, or use 16-bit lanes from the start. **NEON memcpy (64 bytes per iteration):** ```asm // X0 = dst, X1 = src, X2 = byte count (multiple of 64) loop: LDP Q0, Q1, [X1] // Load 32 bytes LDP Q2, Q3, [X1, #32] // Load next 32 bytes STP Q0, Q1, [X0] // Store 32 bytes STP Q2, Q3, [X0, #32] // Store next 32 bytes ADD X0, X0, #64 ADD X1, X1, #64 SUBS X2, X2, #64 B.GT loop ``` This copies 64 bytes per iteration using LDP/STP with Q (128-bit) registers, which is how optimized `memcpy` implementations work on ARM. --- ## 24. Atomic & Synchronization Instructions In a multi-core system, two CPU cores might try to modify the same memory location at the same time. Atomic instructions guarantee that a read-modify-write sequence happens as one indivisible operation — no other core can see a half-finished update. These are the building blocks for locks, lock-free data structures, and reference counting. ### 24.1 ARMv8.1 Atomics (LSE — Large System Extensions) LSE adds single-instruction atomics that are faster than the older LDXR/STXR loop approach. Each instruction reads the old value, performs an operation (add, OR, swap, etc.), and writes the new value — all atomically. The suffix `A` means acquire ordering, `L` means release ordering, `AL` means both. **Why LSE is faster than LDXR/STXR**: The exclusive loop must retry if another core touches the cache line. Under high contention (many cores competing for the same lock), retries waste cycles. LSE atomics are handled by the cache coherency hardware itself — the cache controller performs the read-modify-write without the retry loop, reducing bus traffic and latency. These require the LSE feature (check `ID_AA64ISAR0_EL1.Atomic`): ```asm // Atomic add: mem[Xn] += Xs, old value returned in Xt LDADD Xs|XZR, Xt|XZR, [Xn|SP] // Load old, add, store new (relaxed — no ordering) LDADDA Xs|XZR, Xt|XZR, [Xn|SP] // Acquire semantics (ordered after this load) LDADDL Xs|XZR, Xt|XZR, [Xn|SP] // Release semantics (ordered before this store) LDADDAL Xs|XZR, Xt|XZR, [Xn|SP] // Acquire + Release (full barrier for this operation) // Similarly for other ops: LDCLR Xs|XZR, Xt|XZR, [Xn|SP] // Atomic AND-NOT: mem[Xn] &= ~Xs (clear bits marked by Xs) LDSET Xs|XZR, Xt|XZR, [Xn|SP] // Atomic OR: mem[Xn] |= Xs (set bits marked by Xs) LDEOR Xs|XZR, Xt|XZR, [Xn|SP] // Atomic XOR: mem[Xn] ^= Xs (toggle bits) // Each has A/L/AL variants (LDCLRA, LDCLRL, LDCLRAL, etc.) // All have 32-bit forms: LDADD Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic add (relaxed) LDADDA Ws|WZR, Wt|WZR, [Xn|SP] // + acquire LDADDL Ws|WZR, Wt|WZR, [Xn|SP] // + release LDADDAL Ws|WZR, Wt|WZR, [Xn|SP] // + acquire+release LDCLR Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic AND-NOT (+ A/L/AL variants) LDSET Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic OR (+ A/L/AL variants) LDEOR Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit atomic XOR (+ A/L/AL variants) // All return the OLD value in Xt. Xs is the operand, [Xn|SP] is the memory address. // If you don't need the old value, use ST variants (no return register, slightly faster): STADD Xs|XZR, [Xn|SP] // Atomic add, don't return old value (fire-and-forget) STADDL Xs|XZR, [Xn|SP] // + release ordering STSET Xs|XZR, [Xn|SP] // Atomic OR, don't return old value STSETL Xs|XZR, [Xn|SP] // + release STCLR Xs|XZR, [Xn|SP] // Atomic AND-NOT, don't return old value STCLRL Xs|XZR, [Xn|SP] // + release STEOR Xs|XZR, [Xn|SP] // Atomic XOR, don't return old value STEORL Xs|XZR, [Xn|SP] // + release // All have 32-bit forms (Ws|WZR) and byte/halfword forms (STADDB/STADDLB/STADDH/STADDLH etc.) // Compare-and-swap: CAS Xs|XZR, Xt|XZR, [Xn|SP] // If [Xn]==Xs, store Xt; Xs = old value either way CASA Xs|XZR, Xt|XZR, [Xn|SP] // + acquire CASL Xs|XZR, Xt|XZR, [Xn|SP] // + release CASAL Xs|XZR, Xt|XZR, [Xn|SP] // + acquire+release CAS Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit CAS CASA Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + acquire CASL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + release CASAL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + acquire+release // Swap: SWP Xs|XZR, Xt|XZR, [Xn|SP] // Xt = old [Xn], [Xn] = Xs (unconditional swap) SWPA Xs|XZR, Xt|XZR, [Xn|SP] // + acquire SWPL Xs|XZR, Xt|XZR, [Xn|SP] // + release SWPAL Xs|XZR, Xt|XZR, [Xn|SP] // + acquire+release SWP Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit swap SWPA Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + acquire SWPL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + release SWPAL Ws|WZR, Wt|WZR, [Xn|SP] // 32-bit + acquire+release // Compare-and-swap pair (128-bit atomic): // NOTE: Xs and Xt must be EVEN-numbered registers (the pair is Xs:X(s+1) and Xt:X(t+1)). // Register 31 (XZR) CANNOT be the first register of a pair (31 is odd). // However, X30 is valid as the first, making X31=XZR the implicit second of the pair. CASP Xs, X(s+1), Xt, X(t+1), [Xn|SP] // 128-bit CAS (Xs:X(s+1) = expected, Xt:X(t+1) = desired) CASPA Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + acquire CASPL Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + release CASPAL Xs, X(s+1), Xt, X(t+1), [Xn|SP] // + acquire+release // Xs,Xt ∈ {X0,X2,X4,...,X28,X30}. When Xs=X30, X(s+1)=XZR (reads as 0, writes discarded). ``` **Alignment**: All LSE atomics require natural alignment — 4-byte alignment for W operations, 8-byte for X, 16-byte for CASP. **Byte and halfword atomics**: Every LSE instruction also has B (byte) and H (halfword) forms: ```asm LDADDB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte add: mem[Xn] += Ws (8-bit), old byte in Wt LDADDH Ws|WZR, Wt|WZR, [Xn|SP] // Atomic halfword add: mem[Xn] += Ws (16-bit) LDADDAB Ws|WZR, Wt|WZR, [Xn|SP] // Byte add with acquire SWPB Ws|WZR, Wt|WZR, [Xn|SP] // Atomic byte swap CASB Ws|WZR, Wt|WZR, [Xn|SP] // Byte compare-and-swap // Similarly: LDCLRB, LDSETB, LDEORB, STADDB, and all ordering variants // Similarly: LDCLRH, LDSETH, LDEORH, STADDH, and all ordering variants ``` These are essential for lock bytes, flag bytes, and any sub-word atomic field. ### 24.2 Load-Acquire / Store-Release In a multi-core system, memory operations can appear to happen in a different order than you wrote them (due to CPU reordering for performance). `LDAR` (Load-Acquire) and `STLR` (Store-Release) enforce ordering. **Why CPUs reorder memory**: Modern CPUs have store buffers, write-combining buffers, and out-of-order pipelines. A store might sit in a buffer while later loads execute. This is invisible to single-threaded code but breaks multi-threaded algorithms that depend on the order of writes being visible to other cores. Acquire/release ordering is how you tell the CPU "this ordering matters." ```asm LDAR Xt|XZR, [Xn|SP] // Load-acquire 64-bit LDAR Wt|WZR, [Xn|SP] // Load-acquire 32-bit STLR Xt|XZR, [Xn|SP] // Store-release 64-bit STLR Wt|WZR, [Xn|SP] // Store-release 32-bit // Byte and halfword variants (for lock bytes, flag bytes, etc.): LDARB Wt|WZR, [Xn|SP] // Load-acquire byte LDARH Wt|WZR, [Xn|SP] // Load-acquire halfword STLRB Wt|WZR, [Xn|SP] // Store-release byte STLRH Wt|WZR, [Xn|SP] // Store-release halfword LDAXR Xt|XZR, [Xn|SP] // Load-acquire exclusive 64-bit (LDAR + LDXR combined) LDAXR Wt|WZR, [Xn|SP] // Load-acquire exclusive 32-bit STLXR Ws|WZR, Xt|XZR, [Xn|SP] // Store-release exclusive 64-bit (STLR + STXR combined) STLXR Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive 32-bit // Byte and halfword exclusive: LDAXRB Wt|WZR, [Xn|SP] // Load-acquire exclusive byte LDAXRH Wt|WZR, [Xn|SP] // Load-acquire exclusive halfword STLXRB Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive byte STLXRH Ws|WZR, Wt|WZR, [Xn|SP] // Store-release exclusive halfword ``` **Alignment**: `LDAR`/`STLR` require natural alignment, same as LDXR/STXR. These implement the C11/C++11 memory ordering model: - `LDAR` ≈ `memory_order_acquire` load - `STLR` ≈ `memory_order_release` store On ARMv8, `STLR` has a stronger guarantee than plain release: all `STLR` stores are ordered before any subsequent `LDAR` loads, even to unrelated addresses. This specific STLR→LDAR ordering (called RCsc — Release Consistency with sequential consistency for special operations) is what allows compilers to map `memory_order_seq_cst` loads to `LDAR` and seq_cst stores to `STLR`. This works because of ARM's specific hardware guarantee, NOT because acquire + release equals seq_cst in general — in the abstract C++ memory model, they do not. ### 24.3 LDAPR — Load-Acquire RCpc (Weaker Acquire) ARMv8.3-A adds `LDAPR` (Load-Acquire, Processor Consistent), which is a **weaker** acquire than `LDAR`. The difference: `LDAR` is RCsc (sequential consistency for special ops — it orders with respect to all prior `STLR`s). `LDAPR` is RCpc (processor consistency — it only orders with respect to `STLR` to the **same address**). ```asm LDAPR Xt|XZR, [Xn|SP] // Load-acquire RCpc 64-bit (weaker than LDAR) LDAPR Wt|WZR, [Xn|SP] // Load-acquire RCpc 32-bit LDAPRB Wt|WZR, [Xn|SP] // Byte version LDAPRH Wt|WZR, [Xn|SP] // Halfword version ``` **Why LDAPR exists**: `LDAR` is stronger than what C++ `memory_order_acquire` actually requires. C++ acquire only needs ordering with respect to the matching release on the same variable, not all releases everywhere. `LDAPR` gives exactly this weaker guarantee, which is cheaper on hardware. Compilers targeting ARMv8.3+ can map `memory_order_acquire` to `LDAPR` instead of `LDAR`, improving performance. `memory_order_seq_cst` still requires `LDAR`. ### 24.4 Mutex / Spinlock Patterns **Simple spinlock (using LDXR/STXR with proper acquire/release):** ```asm // Lock: X0 = address of lock word (0 = unlocked, 1 = locked) lock: MOV W3, #1 spin: LDAXR W1, [X0] // Load-acquire exclusive (see latest value + acquire ordering) CBNZ W1, wait // If locked, go to wait loop STXR W2, W3, [X0] // Try to store 1 (lock it) CBNZ W2, spin // If exclusive failed, retry from LDAXR RET // Lock acquired — acquire ordering ensures all reads/writes // in the critical section see data from before the lock wait: // Spin without exclusive — reduces bus traffic (no cache-line bouncing) LDR W1, [X0] // Plain load (no exclusive monitor overhead) CBNZ W1, wait // Still locked? Keep waiting B spin // Unlocked — try to acquire // Unlock: just store 0 with release ordering unlock: STLR WZR, [X0] // Store-release: all critical section writes complete // before the lock appears unlocked to other cores RET ``` **Why LDAXR in the lock, STLR in the unlock**: The lock acquire needs acquire semantics so that everything read inside the critical section sees data published before the previous `STLR` unlock. The unlock needs release semantics so all writes inside the critical section are visible before the lock appears free. This is the classic acquire/release pair for mutual exclusion. **Why the WFE spin loop (from §19.2) is better**: The `wait` loop above burns CPU cycles. The version with `WFE` puts the core in a low-power state until another core sends an event (the unlock path should use `SEV` after `STLR` to wake waiters). **LSE-based lock (faster under contention):** ```asm lock_lse: MOV W1, #1 SWPA W1, W1, [X0] // Atomic swap with acquire: W1 = old value, [X0] = 1 CBNZ W1, lock_lse // If old value was 1 (locked), retry RET // Lock acquired unlock_lse: STLR WZR, [X0] // Store-release zero RET ``` --- ## 25. Memory Barriers & Ordering Memory barriers (also called fences) are instructions that enforce ordering of memory operations. They don't access memory themselves — they constrain the order in which surrounding loads and stores become visible. This matters on multi-core systems where each core has its own cache and memory operations can be reordered. ### 25.1 Barrier Instructions `DMB` (Data Memory Barrier): Ensures that all memory accesses before the barrier are visible before any memory accesses after it. Does NOT wait for them to complete — just orders them. `DSB` (Data Synchronization Barrier): Stronger than DMB — it waits for all preceding memory accesses to actually complete before any instruction after the barrier executes. `ISB` (Instruction Synchronization Barrier): Flushes the CPU pipeline, ensuring all subsequent instructions are fetched fresh. Needed after modifying page tables, writing self-modifying code, or changing system registers that affect instruction execution. ```asm DMB option // Data Memory Barrier DSB option // Data Synchronization Barrier ISB // Instruction Synchronization Barrier ``` Options (shareability domain + access types). The **domain** controls which CPUs/devices the barrier applies to: Inner Shareable (ISH) covers all cores that share the same cache — this is what you almost always want for multi-threaded code. Outer Shareable (OSH) extends to GPUs and DMA (Direct Memory Access) devices. Full System (SY/LD/ST) covers everything. Non-shareable (NSH) is for single-core scenarios. | Option | Meaning | |---|---| | `OSHLD` | Outer Shareable, loads only | | `OSHST` | Outer Shareable, stores only | | `OSH` | Outer Shareable, all | | `NSHLD` | Non-shareable, loads only | | `NSHST` | Non-shareable, stores only | | `NSH` | Non-shareable, all | | `ISHLD` | Inner Shareable, loads only | | `ISHST` | Inner Shareable, stores only | | `ISH` | Inner Shareable, all (most common) | | `LD` | Full system, loads only | | `ST` | Full system, stores only | | `SY` | Full system, all (strongest) | ```asm DMB ISH // All loads/stores before this complete before any after (inner shareable) DMB ISHST // All stores before complete before stores after DSB SY // Full system sync: nothing crosses this barrier, waits for completion ISB // Flush pipeline, re-fetch instructions (needed after modifying code/page tables) ``` **DMB vs DSB**: DMB only orders memory accesses relative to each other. DSB waits for all preceding memory accesses to actually complete before continuing. DSB is stronger and slower. ISB additionally flushes the pipeline. ### 25.2 The ARM Memory Model ARM uses a **weakly ordered** memory model. This means the CPU is allowed to reorder memory accesses for performance, as long as the reordering is invisible to the current core's own execution. Other cores, however, may see the reordered result. **What reorderings ARM allows** (observable by other cores): - **Load-Load**: A later load can complete before an earlier load. (Rare in practice on most ARMs, but architecturally allowed.) - **Load-Store**: A later store can complete before an earlier load. - **Store-Load**: A later load can complete before an earlier store. (This is the most common and impactful reordering.) - **Store-Store**: A later store can become visible before an earlier store. **What ARM does NOT reorder**: - **Data-dependent loads**: If load B's address depends on the value loaded by load A, then B always sees A's result. This is called "address dependency ordering" and it's guaranteed by ARM hardware. Example: `LDR X1, [X0]; LDR X2, [X1]` — the second load always uses the value from the first, even without barriers. - **Overlapping accesses**: Loads and stores to the same address from the same core always appear in program order to that core. **Why this matters**: On x86, the memory model is much stronger (Total Store Order — stores are never reordered with each other). Code that works on x86 by accident may break on ARM because ARM's weaker model exposes more reorderings. This is why correct multi-threaded code must use acquire/release or barriers. **The one-page summary**: Use `LDAR`/`STLR` for synchronization variables (locks, flags, message passing). Use `DMB ISH` when you need a full fence. Don't use barriers for single-threaded code — they're expensive and unnecessary. When in doubt, use C11 atomics and let the compiler figure it out. **Concrete example — message passing race**: ```asm // Core 1 (producer): // Core 2 (consumer): // W7 = 1 (preloaded) STR X1, [X3] // Write data loop: STR W7, [X4] // Set flag = 1 LDR W5, [X4] // Read flag CBZ W5, loop // Wait for flag LDR X6, [X3] // Read data — MAY SEE STALE DATA! ``` The bug: ARM can reorder the two stores on Core 1, so Core 2 sees flag=1 before the data is written. Fix: use `STLR` for the flag (release) and `LDAR` for reading it (acquire): ```asm // Core 1 (fixed): // Core 2 (fixed): STR X1, [X3] // Write data loop: STLR W7, [X4] // Release-store flag LDAR W5, [X4] // Acquire-load flag CBZ W5, loop LDR X6, [X3] // Guaranteed to see the data ``` The `STLR` ensures the data write is visible before the flag. The `LDAR` ensures the data read happens after the flag is seen. --- ## 26. Pseudo-instructions & Assembler Directives Pseudo-instructions are things you write in assembly source that don't map to a single hardware instruction — the assembler translates them into one or more real instructions. Directives (starting with `.`) control the assembler itself — section placement, alignment, data emission — rather than generating instructions. **Why pseudo-instructions exist**: The ISA has strict encoding constraints (fixed 32-bit instructions, limited immediate ranges). Pseudo-instructions like `LDR X0, =constant` and `MOV X0, #large_value` hide this complexity — you write what you mean, and the assembler figures out the best encoding. Without them, you'd need to manually decompose every large constant into MOVZ/MOVK sequences. ### 26.1 Common Pseudo-instructions (GNU as) ```asm LDR X0, =0x12345678 // Load arbitrary constant (assembler picks best encoding or literal pool) ADR X0, label // (real instruction, but often used like a pseudo-instruction) ADRP X0, label // (real instruction) MOV X0, #large_const // Assembler picks MOVZ/MOVN/MOVK/ORR as needed ``` ### 26.2 GNU Assembler Directives ```asm .text // Code section .data // Data section .bss // Uninitialized data section .rodata // Read-only data .global main // Make symbol globally visible .type main, %function // Symbol is a function .size main, .-main // Size = current address minus start of main .align 4 // Align to 2^4 = 16 bytes .balign 16 // Align to 16 bytes (explicit) .p2align 4 // Align to 2^4 = 16 bytes (power of 2) .byte 0x42 // Emit 1 byte .hword 0x1234 // Emit 2 bytes (halfword) .word 0x12345678 // Emit 4 bytes .dword 0x123456789ABCDEF0 // Emit 8 bytes (AKA .quad or .xword) .ascii "Hello" // String without null terminator (just raw bytes) .asciz "Hello" // String WITH null terminator (a 0x00 byte at the end) // Also called "null-terminated" or "C string". Same as .string .equ BUFFER_SIZE, 1024 // Define constant .set MY_CONST, 42 // Same as .equ .macro my_push reg // Define macro STR \reg, [SP, #-16]! .endm .if CONDITION // Conditional assembly .else .endif .include "other.s" // Include file .section .note.GNU-stack,"",@progbits // Mark stack as non-executable ``` ### 26.3 Relocation Operators Used with ADRP/ADD/LDR to reference symbols: ```asm ADRP X0, symbol // Page address of symbol ADD X0, X0, :lo12:symbol // Low 12 bits (page offset) // GOT (Global Offset Table) access — used for shared library symbols whose // address isn't known until the dynamic linker resolves them at runtime: ADRP X0, :got:symbol LDR X0, [X0, :got_lo12:symbol] // Thread-local storage (TLS) — for variables that have a separate copy per thread // (like C's _Thread_local or __thread). The runtime provides a descriptor function: ADRP X0, :tlsdesc:symbol LDR X1, [X0, :tlsdesc_lo12:symbol] ADD X0, X0, :tlsdesc_lo12:symbol BLR X1 ``` ### 26.4 Practical Tools & Workflow **Assembling and linking:** ```bash # Assemble a .s file to object file: aarch64-linux-gnu-as -o program.o program.s # Link to executable: aarch64-linux-gnu-ld -o program program.o # Or combine with GCC (handles C runtime startup): aarch64-linux-gnu-gcc -o program program.s # Cross-compile C to assembly (to study compiler output): aarch64-linux-gnu-gcc -S -O2 -o output.s input.c ``` **Disassembly (reading compiled binaries):** ```bash # Disassemble an ELF (Executable and Linkable Format) binary: aarch64-linux-gnu-objdump -d program # With source interleaving (if compiled with -g): aarch64-linux-gnu-objdump -dS program # LLVM disassembler (often better formatting): llvm-objdump -d program # Disassemble a single function: aarch64-linux-gnu-objdump -d program | sed -n '/<my_function>:/,/^$/p' ``` **Testing on x86 with emulation:** ```bash # Run AArch64 binary on x86 using QEMU user-mode emulation: qemu-aarch64 ./program # Or with a specific library path: qemu-aarch64 -L /usr/aarch64-linux-gnu ./program ``` ### 26.5 Volatile and Compiler Barriers In C/C++, `volatile` tells the compiler "don't optimize away or reorder this memory access." In assembly, there's no `volatile` keyword — every load and store you write is exactly what the CPU executes. But when writing **inline assembly** in C, you need to understand how `volatile` maps: ```c // C volatile load → compiler emits a plain LDR (no optimization, no reordering by compiler) volatile int *ptr = ...; int val = *ptr; // Compiler MUST emit: LDR Wn, [Xptr] // It cannot cache the value, combine with other loads, or skip it. // For HARDWARE memory ordering (multi-core visibility), volatile is NOT enough. // You need atomic operations or explicit barriers: __atomic_load_n(ptr, __ATOMIC_ACQUIRE); // → LDAR __atomic_store_n(ptr, val, __ATOMIC_RELEASE); // → STLR ``` **Key distinction**: `volatile` prevents the **compiler** from reordering. Memory barriers (`DMB`, `LDAR`/`STLR`) prevent the **CPU** from reordering. For single-core memory-mapped I/O, `volatile` is sufficient. For multi-core synchronization, you need both. --- ## 27. Instruction Aliases — The Master Table This is the comprehensive list of "instructions" that are actually aliases for other instructions. Both 64-bit and 32-bit forms are shown for every alias. Register 31 alternatives (`|XZR`, `|WZR`, `|SP`, `|WSP`) are shown per the encoding rules — see the table in §1.2. | Alias | Real instruction | Notes | |---|---|---| | **Move** | | | | `MOV Xd\|XZR, Xm\|XZR` | `ORR Xd\|XZR, XZR, Xm\|XZR` | Reg-to-reg (shifted-reg encoding) | | `MOV Wd\|WZR, Wm\|WZR` | `ORR Wd\|WZR, WZR, Wm\|WZR` | 32-bit (zeroes upper 32) | | `MOV Xd\|SP, SP` | `ADD Xd\|SP, SP, #0` | From SP (immediate encoding; reg 31 = SP in Rd) | | `MOV SP, Xn\|SP` | `ADD SP, Xn\|SP, #0` | To SP (immediate encoding; reg 31 = SP in Rn) | | `MOV Xd\|XZR, #imm` | `MOVZ Xd\|XZR, #imm{, LSL #s}` | 16-bit imm fits (s=0/16/32/48) | | `MOV Wd\|WZR, #imm` | `MOVZ Wd\|WZR, #imm{, LSL #s}` | 16-bit imm fits (s=0/16 only) | | `MOV Xd\|XZR, #imm` | `MOVN Xd\|XZR, #~imm{, LSL #s}` | Inverted fits in 16 bits | | `MOV Wd\|WZR, #imm` | `MOVN Wd\|WZR, #~imm{, LSL #s}` | Inverted fits (32-bit NOT, s=0/16) | | `MOV Xd\|SP, #imm` | `ORR Xd\|SP, XZR, #bitmask_imm` | Bitmask immediate (Rd=SP!) | | `MOV Wd\|WSP, #imm` | `ORR Wd\|WSP, WZR, #bitmask_imm` | Bitmask imm (32-bit, Rd=WSP!) | | `MVN Xd\|XZR, Xm\|XZR{, LSL\|LSR\|ASR\|ROR #0-63}` | `ORN Xd\|XZR, XZR, Xm\|XZR{, LSL\|LSR\|ASR\|ROR #0-63}` | Bitwise NOT | | `MVN Wd\|WZR, Wm\|WZR{, LSL\|LSR\|ASR\|ROR #0-31}` | `ORN Wd\|WZR, WZR, Wm\|WZR{, LSL\|LSR\|ASR\|ROR #0-31}` | 32-bit | | **Negate** | | | | `NEG Xd\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | `SUB Xd\|XZR, XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | Negate | | `NEG Wd\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | `SUB Wd\|WZR, WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | 32-bit | | `NEGS Xd\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | `SUBS Xd\|XZR, XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | Negate + flags | | `NEGS Wd\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | `SUBS Wd\|WZR, WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | 32-bit | | `NGC Xd\|XZR, Xm\|XZR` | `SBC Xd\|XZR, XZR, Xm\|XZR` | Negate with carry | | `NGC Wd\|WZR, Wm\|WZR` | `SBC Wd\|WZR, WZR, Wm\|WZR` | 32-bit | | `NGCS Xd\|XZR, Xm\|XZR` | `SBCS Xd\|XZR, XZR, Xm\|XZR` | Negate with carry + flags | | `NGCS Wd\|WZR, Wm\|WZR` | `SBCS Wd\|WZR, WZR, Wm\|WZR` | 32-bit | | **Compare / Test** | | | | `CMP Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | `SUBS XZR, Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | Compare (shifted-reg) | | `CMP Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | `SUBS WZR, Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | 32-bit shifted-reg | | `CMP Xn\|SP, #imm12{, LSL #12}` | `SUBS XZR, Xn\|SP, #imm12{, LSL #12}` | Compare (immediate) | | `CMP Wn\|WSP, #imm12{, LSL #12}` | `SUBS WZR, Wn\|WSP, #imm12{, LSL #12}` | 32-bit immediate | | `CMP Xn\|SP, Rm\|XZR, extend` | `SUBS XZR, Xn\|SP, Rm\|XZR, extend` | Compare (extended-reg) | | `CMP Wn\|WSP, Wm\|WZR, extend` | `SUBS WZR, Wn\|WSP, Wm\|WZR, extend` | 32-bit extended-reg | | `CMN Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | `ADDS XZR, Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR #0-63}` | Compare negative (shifted-reg) | | `CMN Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | `ADDS WZR, Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR #0-31}` | 32-bit shifted-reg | | `CMN Xn\|SP, #imm12{, LSL #12}` | `ADDS XZR, Xn\|SP, #imm12{, LSL #12}` | Compare negative (immediate) | | `CMN Wn\|WSP, #imm12{, LSL #12}` | `ADDS WZR, Wn\|WSP, #imm12{, LSL #12}` | 32-bit immediate | | `CMN Xn\|SP, Rm\|XZR, extend` | `ADDS XZR, Xn\|SP, Rm\|XZR, extend` | Compare negative (extended-reg) | | `CMN Wn\|WSP, Wm\|WZR, extend` | `ADDS WZR, Wn\|WSP, Wm\|WZR, extend` | 32-bit extended-reg | | `TST Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR\|ROR #0-63}` | `ANDS XZR, Xn\|XZR, Xm\|XZR{, LSL\|LSR\|ASR\|ROR #0-63}` | Test bits (shifted-reg) | | `TST Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR\|ROR #0-31}` | `ANDS WZR, Wn\|WZR, Wm\|WZR{, LSL\|LSR\|ASR\|ROR #0-31}` | 32-bit shifted-reg | | `TST Xn\|XZR, #bitmask_imm` | `ANDS XZR, Xn\|XZR, #bitmask_imm` | Test bits (immediate) | | `TST Wn\|WZR, #bitmask_imm` | `ANDS WZR, Wn\|WZR, #bitmask_imm` | 32-bit immediate | | **Multiply** | | | | `MUL Xd\|XZR, Xn\|XZR, Xm\|XZR` | `MADD Xd\|XZR, Xn\|XZR, Xm\|XZR, XZR` | Multiply | | `MUL Wd\|WZR, Wn\|WZR, Wm\|WZR` | `MADD Wd\|WZR, Wn\|WZR, Wm\|WZR, WZR` | 32-bit | | `MNEG Xd\|XZR, Xn\|XZR, Xm\|XZR` | `MSUB Xd\|XZR, Xn\|XZR, Xm\|XZR, XZR` | Multiply-negate | | `MNEG Wd\|WZR, Wn\|WZR, Wm\|WZR` | `MSUB Wd\|WZR, Wn\|WZR, Wm\|WZR, WZR` | 32-bit | | `SMULL Xd\|XZR, Wn\|WZR, Wm\|WZR` | `SMADDL Xd\|XZR, Wn\|WZR, Wm\|WZR, XZR` | Signed long multiply | | `UMULL Xd\|XZR, Wn\|WZR, Wm\|WZR` | `UMADDL Xd\|XZR, Wn\|WZR, Wm\|WZR, XZR` | Unsigned long multiply | | `SMNEGL Xd\|XZR, Wn\|WZR, Wm\|WZR` | `SMSUBL Xd\|XZR, Wn\|WZR, Wm\|WZR, XZR` | Signed long multiply-negate | | `UMNEGL Xd\|XZR, Wn\|WZR, Wm\|WZR` | `UMSUBL Xd\|XZR, Wn\|WZR, Wm\|WZR, XZR` | Unsigned long multiply-negate | | **Shifts (immediate)** | | | | `LSL Xd\|XZR, Xn\|XZR, #s` | `UBFM Xd\|XZR, Xn\|XZR, #(-s MOD 64), #(63-s)` | Shift left | | `LSL Wd\|WZR, Wn\|WZR, #s` | `UBFM Wd\|WZR, Wn\|WZR, #(-s MOD 32), #(31-s)` | 32-bit | | `LSR Xd\|XZR, Xn\|XZR, #s` | `UBFM Xd\|XZR, Xn\|XZR, #s, #63` | Shift right | | `LSR Wd\|WZR, Wn\|WZR, #s` | `UBFM Wd\|WZR, Wn\|WZR, #s, #31` | 32-bit | | `ASR Xd\|XZR, Xn\|XZR, #s` | `SBFM Xd\|XZR, Xn\|XZR, #s, #63` | Arith shift right | | `ASR Wd\|WZR, Wn\|WZR, #s` | `SBFM Wd\|WZR, Wn\|WZR, #s, #31` | 32-bit | | `ROR Xd\|XZR, Xn\|XZR, #s` | `EXTR Xd\|XZR, Xn\|XZR, Xn\|XZR, #s` | Rotate right | | `ROR Wd\|WZR, Wn\|WZR, #s` | `EXTR Wd\|WZR, Wn\|WZR, Wn\|WZR, #s` | 32-bit | | **Shifts (register)** | | | | `LSL Xd\|XZR, Xn\|XZR, Xm\|XZR` | `LSLV Xd\|XZR, Xn\|XZR, Xm\|XZR` | Shift left (register) | | `LSL Wd\|WZR, Wn\|WZR, Wm\|WZR` | `LSLV Wd\|WZR, Wn\|WZR, Wm\|WZR` | 32-bit | | `LSR Xd\|XZR, Xn\|XZR, Xm\|XZR` | `LSRV Xd\|XZR, Xn\|XZR, Xm\|XZR` | Shift right (register) | | `LSR Wd\|WZR, Wn\|WZR, Wm\|WZR` | `LSRV Wd\|WZR, Wn\|WZR, Wm\|WZR` | 32-bit | | `ASR Xd\|XZR, Xn\|XZR, Xm\|XZR` | `ASRV Xd\|XZR, Xn\|XZR, Xm\|XZR` | Arith shift right (register) | | `ASR Wd\|WZR, Wn\|WZR, Wm\|WZR` | `ASRV Wd\|WZR, Wn\|WZR, Wm\|WZR` | 32-bit | | `ROR Xd\|XZR, Xn\|XZR, Xm\|XZR` | `RORV Xd\|XZR, Xn\|XZR, Xm\|XZR` | Rotate right (register) | | `ROR Wd\|WZR, Wn\|WZR, Wm\|WZR` | `RORV Wd\|WZR, Wn\|WZR, Wm\|WZR` | 32-bit | | **Extension** | | | | `SXTB Xd\|XZR, Wn\|WZR` | `SBFM Xd\|XZR, Xn\|XZR, #0, #7` | Sign-extend byte → 64 | | `SXTB Wd\|WZR, Wn\|WZR` | `SBFM Wd\|WZR, Wn\|WZR, #0, #7` | Sign-extend byte → 32 | | `SXTH Xd\|XZR, Wn\|WZR` | `SBFM Xd\|XZR, Xn\|XZR, #0, #15` | Sign-extend halfword → 64 | | `SXTH Wd\|WZR, Wn\|WZR` | `SBFM Wd\|WZR, Wn\|WZR, #0, #15` | Sign-extend halfword → 32 | | `SXTW Xd\|XZR, Wn\|WZR` | `SBFM Xd\|XZR, Xn\|XZR, #0, #31` | Sign-extend word → 64 (no Wd form) | | `UXTB Wd\|WZR, Wn\|WZR` | `UBFM Wd\|WZR, Wn\|WZR, #0, #7` | Zero-extend byte | | `UXTH Wd\|WZR, Wn\|WZR` | `UBFM Wd\|WZR, Wn\|WZR, #0, #15` | Zero-extend halfword | | **Bitfield** | | | | `UBFX Xd\|XZR, Xn\|XZR, #l, #w` | `UBFM Xd\|XZR, Xn\|XZR, #l, #(l+w-1)` | Unsigned BF extract | | `UBFX Wd\|WZR, Wn\|WZR, #l, #w` | `UBFM Wd\|WZR, Wn\|WZR, #l, #(l+w-1)` | 32-bit | | `SBFX Xd\|XZR, Xn\|XZR, #l, #w` | `SBFM Xd\|XZR, Xn\|XZR, #l, #(l+w-1)` | Signed BF extract | | `SBFX Wd\|WZR, Wn\|WZR, #l, #w` | `SBFM Wd\|WZR, Wn\|WZR, #l, #(l+w-1)` | 32-bit | | `UBFIZ Xd\|XZR, Xn\|XZR, #l, #w` | `UBFM Xd\|XZR, Xn\|XZR, #(-l MOD 64), #(w-1)` | Unsigned BF insert in zero | | `UBFIZ Wd\|WZR, Wn\|WZR, #l, #w` | `UBFM Wd\|WZR, Wn\|WZR, #(-l MOD 32), #(w-1)` | 32-bit | | `SBFIZ Xd\|XZR, Xn\|XZR, #l, #w` | `SBFM Xd\|XZR, Xn\|XZR, #(-l MOD 64), #(w-1)` | Signed BF insert in zero | | `SBFIZ Wd\|WZR, Wn\|WZR, #l, #w` | `SBFM Wd\|WZR, Wn\|WZR, #(-l MOD 32), #(w-1)` | 32-bit | | `BFI Xd\|XZR, Xn\|XZR, #l, #w` | `BFM Xd\|XZR, Xn\|XZR, #(-l MOD 64), #(w-1)` | Bitfield insert | | `BFI Wd\|WZR, Wn\|WZR, #l, #w` | `BFM Wd\|WZR, Wn\|WZR, #(-l MOD 32), #(w-1)` | 32-bit | | `BFXIL Xd\|XZR, Xn\|XZR, #l, #w` | `BFM Xd\|XZR, Xn\|XZR, #l, #(l+w-1)` | BF extract and insert low | | `BFXIL Wd\|WZR, Wn\|WZR, #l, #w` | `BFM Wd\|WZR, Wn\|WZR, #l, #(l+w-1)` | 32-bit | | **Conditional select aliases** | | | | `CINC Xd\|XZR, Xn\|XZR, cond` | `CSINC Xd\|XZR, Xn\|XZR, Xn\|XZR, inv(cond)` | Conditional increment | | `CINC Wd\|WZR, Wn\|WZR, cond` | `CSINC Wd\|WZR, Wn\|WZR, Wn\|WZR, inv(cond)` | 32-bit | | `CSET Xd\|XZR, cond` | `CSINC Xd\|XZR, XZR, XZR, inv(cond)` | Conditional set | | `CSET Wd\|WZR, cond` | `CSINC Wd\|WZR, WZR, WZR, inv(cond)` | 32-bit | | `CINV Xd\|XZR, Xn\|XZR, cond` | `CSINV Xd\|XZR, Xn\|XZR, Xn\|XZR, inv(cond)` | Conditional invert | | `CINV Wd\|WZR, Wn\|WZR, cond` | `CSINV Wd\|WZR, Wn\|WZR, Wn\|WZR, inv(cond)` | 32-bit | | `CSETM Xd\|XZR, cond` | `CSINV Xd\|XZR, XZR, XZR, inv(cond)` | Conditional set mask | | `CSETM Wd\|WZR, cond` | `CSINV Wd\|WZR, WZR, WZR, inv(cond)` | 32-bit | | `CNEG Xd\|XZR, Xn\|XZR, cond` | `CSNEG Xd\|XZR, Xn\|XZR, Xn\|XZR, inv(cond)` | Conditional negate | | `CNEG Wd\|WZR, Wn\|WZR, cond` | `CSNEG Wd\|WZR, Wn\|WZR, Wn\|WZR, inv(cond)` | 32-bit | | **System** | | | | `NOP` | `HINT #0` | No operation | | `YIELD` | `HINT #1` | Yield | | `WFE` | `HINT #2` | Wait for event | | `WFI` | `HINT #3` | Wait for interrupt | | `SEV` | `HINT #4` | Send event | | `SEVL` | `HINT #5` | Send event local | | `RET` | `RET X30` | Return (default LR) | | `PACIASP` | `PACIA X30, SP` | Sign LR with key A | | `AUTIASP` | `AUTIA X30, SP` | Authenticate LR with key A | | **System instruction aliases** | | | | `AT S1E1R, Xt\|XZR` | `SYS #0, C7, C8, #0, Xt\|XZR` | Address translate | | `DC ZVA, Xt\|XZR` | `SYS #3, C7, C4, #1, Xt\|XZR` | Data cache zero | | `IC IVAU, Xt\|XZR` | `SYS #3, C7, C5, #1, Xt\|XZR` | Instruction cache invalidate | | `TLBI ...` | Various `SYS` encodings | TLB invalidate | --- ## 28. AArch32 (ARM/Thumb) Key Differences ### 28.1 Conditional Execution In AArch32 (ARM state), **almost every instruction** can be conditional: ```asm // AArch32: CMP R0, #10 ADDGT R1, R1, #1 // Only executes if R0 > 10 MOVLE R1, #0 // Only executes if R0 <= 10 ``` AArch64 **removed** this. You must use `CSEL`/`B.cond`/etc. instead. ### 28.2 S Suffix in AArch32 In AArch32, the S suffix is optional on most instructions (like AArch64), but combining it with a condition code gives you things like: ```asm ADDGTS R1, R1, #1 // Conditionally add AND set flags ``` ### 28.3 Register Differences - AArch32: R0–R15, where R13=SP, R14=LR, R15=PC - PC is a general-purpose register! You can do `ADD PC, PC, R0` (computed branch). This doesn't exist in AArch64. - Writing to PC is a branch. This is why there's no separate `RET` in AArch32 — you just do `BX LR` or `MOV PC, LR`. ### 28.4 Barrel Shifter Everywhere In AArch32, EVERY data-processing instruction's second operand can include a shift: ```asm ADD R0, R1, R2, LSL R3 // R0 = R1 + (R2 << R3) — register-controlled shift ``` AArch64 limits shifts to specific forms per instruction class, and **never** allows register-controlled shifts in the operand position (you need `LSLV` separately). ### 28.5 Thumb / Thumb-2 Thumb is a compressed 16-bit instruction set (subset of ARM). Thumb-2 adds 32-bit instructions to Thumb, making it nearly as capable as ARM state but more code-dense. Modern ARM Cortex-M processors only support Thumb. AArch64 **has no Thumb mode**. It is always 32-bit fixed-width A64 instructions. --- ## 29. Calling Convention (AAPCS64) The **AAPCS64** (Arm Architecture Procedure Call Standard for AArch64) defines how functions pass arguments, return values, and which registers they must preserve. This is essential for understanding compiled code and for writing assembly that interoperates with C. **Why these specific register assignments?** X0-X7 for arguments gives 8 register-passed arguments before spilling to the stack — enough for the vast majority of functions (most have ≤4 arguments). Having the return value in X0 (the same as the first argument) is common across architectures because many functions transform their first argument and return the result. The split between caller-saved (X9-X15: temporaries the callee can freely trash) and callee-saved (X19-X28: preserved across calls) is a balance — too many callee-saved means every small function wastes time saving/restoring; too few means callers waste time saving around every call. X29 as frame pointer enables debuggers and stack unwinders to walk the call stack. X30 as link register holds the return address from `BL`/`BLR`. ### 29.1 Parameter Passing | Register | Usage | |---|---| | X0–X7 | Arguments and return values | | X0 | First argument / return value | | X1 | Second argument / second return value (for 128-bit returns) | | X8 | Indirect result location (struct return pointer) | | X9–X15 | Temporary (caller-saved) | | X16–X17 | Intra-procedure scratch (PLT stubs, caller-saved) | | X18 | Platform register (reserved — do not use) | | X19–X28 | Callee-saved | | X29 | Frame pointer (callee-saved) | | X30 | Link register (overwritten by BL/BLR — must be saved by callee if it makes calls) | | SP | Stack pointer (16-byte aligned at public interfaces) | ### 29.2 SIMD/FP Parameter Passing - V0–V7 (D0–D7 / S0–S7 / Q0–Q7): FP/SIMD arguments and return values - V8–V15: Callee-saved (only the **lower 64 bits** D8–D15 are callee-saved; upper 64 bits are scratch) - V16–V31: Temporary (caller-saved) ### 29.3 Stack Frame The stack grows **downward** in memory — pushing data decreases SP, popping increases it. This is a convention shared with x86 and most other architectures. "Top of stack" means the lowest address (where SP points). ``` High address ┌──────────────────────┐ │ Caller's frame │ ├──────────────────────┤ │ Arguments (if >8) │ ← Passed on stack ├──────────────────────┤ │ Return address (X30) │ ← Saved by callee │ Old frame ptr (X29) │ ← X29 points here ├──────────────────────┤ │ Callee-saved regs │ ├──────────────────────┤ │ Local variables │ ├──────────────────────┤ │ Outgoing args (if >8) │ ← For calls this function makes └──────────────────────┘ ← SP (must be 16-byte aligned) Low address ``` **Standard prologue/epilogue:** ```asm my_function: // Prologue STP X29, X30, [SP, #-64]! // Save FP, LR; allocate 64 bytes MOV X29, SP // Set frame pointer STP X19, X20, [SP, #16] // Save callee-saved regs STP X21, X22, [SP, #32] // ... save more if needed ... // Function body... // Epilogue LDP X21, X22, [SP, #32] LDP X19, X20, [SP, #16] LDP X29, X30, [SP], #64 // Restore FP, LR; deallocate RET ``` **No red zone on AArch64**: Unlike x86-64 (which has a 128-byte "red zone" below SP that leaf functions can use without adjusting SP), the AAPCS64 does **not** define a red zone. Signal handlers and interrupts can clobber memory below SP at any time. You MUST adjust SP before storing anything on the stack. Some platform ABIs (like Apple's) DO define a red zone — check your target platform. **Stack canaries** (stack protector): Compilers insert a random value ("canary") between local variables and the saved frame pointer/return address. Before returning, the function checks if the canary was overwritten — if so, a buffer overflow occurred and the program aborts. In AArch64 assembly, you'll see loads from a thread-local `__stack_chk_guard` symbol at the start, and a comparison before `RET`. --- ## 30. Common Patterns & Idioms This section shows how common C/C++ constructs translate to AArch64 assembly. Understanding these patterns is essential for reading compiler output and writing efficient assembly. ### 30.1 If/Else Compilers translate `if/else` into either a **branching** version (using `B.cond`) or a **branchless** version (using `CSEL`). The branchless version avoids branch misprediction penalties and is preferred for simple value assignments. The branching version is better when the if/else bodies are complex (many instructions). ```asm // C: if (x > 10) { a = 1; } else { a = 2; } // X0 = x, result in W1 // Branching version: CMP X0, #10 B.LE else_branch // If x <= 10, skip to else MOV W1, #1 // a = 1 (if body) B end_if else_branch: MOV W1, #2 // a = 2 (else body) end_if: // Branchless version (compiler usually prefers this for simple assignments): CMP X0, #10 MOV W1, #1 // Prepare "if" value MOV W2, #2 // Prepare "else" value CSEL W1, W1, W2, GT // W1 = (x > 10) ? 1 : 2 ``` ### 30.2 Loops Compilers prefer **do-while** style loops (condition at the bottom) because they use one branch per iteration instead of two. A `for` loop is converted to: check if zero iterations needed (branch over), then do-while. `CBZ`/`CBNZ` are commonly used for zero-test loop exits because they combine the comparison and branch into one instruction. ```asm // C: for (int i = 0; i < n; i++) { sum += array[i]; } // X0 = array pointer, X1 = n, result in X2 MOV X2, #0 // sum = 0 MOV X3, #0 // i = 0 loop: CMP X3, X1 // i < n? B.GE loop_end // If i >= n, exit loop LDR X4, [X0, X3, LSL #3] // X4 = array[i] (8-byte elements, index scaled by 8) ADD X2, X2, X4 // sum += array[i] ADD X3, X3, #1 // i++ B loop // Back to top loop_end: // While loop: while (x != 0) { x = x >> 1; count++; } MOV W1, #0 // count = 0 while_loop: CBZ W0, while_end // If x == 0, exit (CBZ = Compare and Branch if Zero) LSR W0, W0, #1 // x >>= 1 ADD W1, W1, #1 // count++ B while_loop while_end: // Do-while: more efficient because the branch is at the bottom (one branch per iteration): MOV X3, #0 do_loop: LDR X4, [X0, X3, LSL #3] ADD X2, X2, X4 ADD X3, X3, #1 CMP X3, X1 B.LT do_loop // Loop while i < n (branch at bottom = 1 branch/iter) ``` ### 30.3 Array and Struct Access Arrays use the shifted/extended register addressing modes — the index is scaled by the element size using `LSL #n`. Structs use immediate offsets from a base pointer — each field has a fixed offset known at compile time. Arrays of structs combine both: compute the struct pointer from the index, then use an immediate offset for the field. ```asm // Array access: int64_t array[100]; val = array[i]; // X0 = array base, X1 = index i LDR X2, [X0, X1, LSL #3] // X2 = array[i] (each element is 8 bytes, LSL #3 = ×8) // Struct access: // struct { int32_t x; int32_t y; int64_t z; } point; // x at +0, y at +4, z at +8 // X0 = pointer to struct LDR W1, [X0] // W1 = point.x (offset 0) LDR W2, [X0, #4] // W2 = point.y (offset 4) LDR X3, [X0, #8] // X3 = point.z (offset 8) // Array of structs: points[i].y (struct size = 16 bytes, y at offset 4) // X0 = array base, W1 = index i ADD X2, X0, W1, UXTW #4 // X2 = base + i*16 (UXTW #4 = zero-extend and shift left 4 = ×16) LDR W3, [X2, #4] // W3 = points[i].y ``` ### 30.4 Branchless Min/Max ```asm // min(X0, X1) → X0 (signed) CMP X0, X1 CSEL X0, X0, X1, LE // max(X0, X1) → X0 (unsigned) CMP X0, X1 CSEL X0, X0, X1, HI ``` ### 30.5 Branchless Absolute Value ```asm // abs(X0) → X0 (signed) CMP X0, #0 CNEG X0, X0, LT ``` ### 30.6 Division by Constant (Multiply by Reciprocal) Compilers do this automatically, but understanding it helps when reading disassembly: ```asm // X0 = X1 / 10 (unsigned) // The compiler finds a "magic multiplier" M and shift s such that // UMULH(n, M) >> s == n / d for all n in range. // For d=10: M = 0xCCCCCCCCCCCCCCCD, s = 3 MOV X2, #0xCCCD MOVK X2, #0xCCCC, LSL #16 MOVK X2, #0xCCCC, LSL #32 MOVK X2, #0xCCCC, LSL #48 UMULH X0, X1, X2 // high 64 bits of X1 × magic LSR X0, X0, #3 // post-shift ``` ### 30.7 Swap Two Registers ```asm // Using EOR (no temp register needed, but 3 instructions): EOR X0, X0, X1 EOR X1, X0, X1 EOR X0, X0, X1 // Better — just use a temp: MOV X2, X0 MOV X0, X1 MOV X1, X2 ``` ### 30.8 Test Power of Two ```asm // Check if X0 is a power of 2 (and not zero): // Power of 2 means exactly one bit set: X0 != 0 && (X0 & (X0-1)) == 0 SUB X1, X0, #1 // X1 = X0 - 1 TST X0, X1 // X0 & (X0-1): sets Z=1 if zero (candidate) CCMP X0, #0, #4, EQ // If Z=1: compare X0 vs 0 (sets Z=1 if X0==0, Z=0 if X0!=0) // If Z=0: set flags to #4 (Z=1), so B.NE won't fire B.NE is_power_of_two // Taken only if (X0 & (X0-1))==0 AND X0!=0 ``` ### 30.9 Align Address ```asm // Align X0 down to 16-byte boundary: AND X0, X0, #~0xF // Clear low 4 bits (bitmask immediate: 0xFFFFFFFFFFFFFFF0) // Align X0 up to 16-byte boundary: ADD X0, X0, #15 AND X0, X0, #~0xF ``` ### 30.10 Position-Independent Hello World (Linux) ```asm .global _start .text _start: // write(1, msg, len) MOV X8, #64 // __NR_write MOV X0, #1 // fd = stdout ADR X1, msg // buffer (PC-relative) MOV X2, #14 // length SVC #0 // exit(0) MOV X8, #93 // __NR_exit MOV X0, #0 // status = 0 SVC #0 .data msg: .asciz "Hello, world!\n" ``` ### 30.11 Jump Table (Switch Statement) Compilers translate large `switch` statements into jump tables — an array of branch offsets indexed by the switch value. This is O(1) instead of a chain of comparisons. ```asm // switch (X0) { case 0: ...; case 1: ...; case 2: ...; case 3: ...; } // X0 = switch value (already range-checked to 0-3) ADR X1, jump_table // X1 = address of the jump table LDRH W2, [X1, X0, LSL #1] // Load 16-bit offset for case X0 (each entry is 2 bytes) ADR X3, case_base // X3 = base address for offset computation ADD X3, X3, W2, UXTH // X3 = base + zero-extended offset BR X3 // Jump to the case handler .align 2 jump_table: .hword case0 - case_base // 16-bit offset to case 0 handler .hword case1 - case_base // 16-bit offset to case 1 handler .hword case2 - case_base .hword case3 - case_base case_base: case0: // ... handler for case 0 ... case1: // ... handler for case 1 ... ``` **What REALLY happens**: The CPU loads a small offset from a table in memory (indexed by the switch value), adds it to a base address, and does an indirect branch. The `ADR` + table approach generates position-independent code. Compilers may also use `TBB`/`TBH` (AArch32) or the `ADR`+`ADD`+`BR` pattern (AArch64). In disassembly, seeing `BR Xn` after an `LDR` from a table-like structure is the tell-tale sign of a switch statement. ### 30.12 Atomic Reference Counting Reference counting (used in `std::shared_ptr`, Python objects, Linux kernel `kref`) atomically increments/decrements a counter. When it reaches zero, the object is freed. ```asm // Increment reference count (relaxed ordering is fine — no data dependency): LDADD X1, X2, [X0] // Atomically: old=[X0], [X0]+=X1 (X1=1 for refcount++) // Or without LSE: LDXR/ADD/STXR loop // Decrement and check for zero (needs release ordering on the decrement, // acquire ordering before freeing — to ensure all accesses to the object // are visible before we free it): MOV X1, #-1 // Decrement by 1 (add -1) LDADDAL X1, X2, [X0] // Atomically: old=X2=[X0], [X0]+=-1 // Acquire+Release: ensures all prior accesses complete // and the zero-check sees the final count CMP X2, #1 // Was old value 1? (means new value is 0) B.EQ free_object // If refcount hit zero, free the object ``` **Why release on decrement?** The release ordering ensures that all reads/writes to the object's data (done while holding a reference) are visible to whoever ends up freeing the object. Without release, the free path might not see all the modifications made by other threads that already dropped their references. **Why acquire before free?** The acquire on the final decrement (via LDADDAL) ensures the freeing thread sees all modifications made by all other threads that previously decremented the count. ### 30.13 Byte/Halfword Atomics for Flags Sometimes you only need a 1-byte or 2-byte atomic (e.g., a boolean flag, a status byte): ```asm // Set a byte flag with release ordering: MOV W1, #1 STLRB W1, [X0] // Release-store a single byte // Read a byte flag with acquire ordering: LDARB W1, [X0] // Acquire-load a single byte CBZ W1, not_set // Atomic byte swap (LSE): MOV W1, #1 SWPAB W1, W2, [X0] // Atomically swap byte, acquire semantics // W2 = old byte value, [X0] = 1 ``` ### 30.14 Leaf Function Optimization A **leaf function** is one that doesn't call any other functions. Since it never executes `BL` (which overwrites X30/LR), it doesn't need to save/restore LR. It also doesn't need to set up a frame pointer if it doesn't use the stack. This makes leaf functions very cheap: ```asm // Non-leaf function (must save LR because it calls other functions): my_func: STP X29, X30, [SP, #-16]! // Save FP+LR (4 bytes + memory access) MOV X29, SP BL other_func // This overwrites X30 LDP X29, X30, [SP], #16 // Restore RET // Leaf function (no calls → no save/restore needed): add_two: ADD X0, X0, X1 // Just do the work RET // X30 still has the return address from our caller ``` **Why this matters**: Most small helper functions (getters, simple math, comparisons) are leaf functions. The compiler skips the prologue/epilogue entirely, making them just 1-2 instructions. When reading disassembly, a function with no `STP`/`LDP` at the start/end is a leaf function. ### 30.15 Tail Call Optimization When the last thing a function does is call another function and return its result, the compiler can replace `BL target; RET` with just `B target`. This reuses the current stack frame instead of creating a new one — saving the prologue/epilogue of the tail-called function AND the call/return overhead. ```asm // Without tail call optimization: wrapper: STP X29, X30, [SP, #-16]! MOV X29, SP // ... setup arguments ... BL real_function // Call (pushes return address) LDP X29, X30, [SP], #16 // Restore RET // Return to our caller // With tail call optimization: wrapper: // ... setup arguments ... B real_function // Jump directly — real_function will RET to OUR caller ``` **Why `B` instead of `BL`**: `BL` saves the return address in X30. But if we're about to return anyway, the correct return address is already in X30 (from our caller). `B` preserves X30, so `real_function`'s `RET` returns directly to our caller, skipping us entirely. **How to recognize a tail call in disassembly**: A function that ends with `B <other_function>` (unconditional branch to a different function) instead of `BL` + `RET` is a tail call. The function may restore callee-saved registers first (LDP X29, X30), then `B` to the target. If you see a function with no `RET` at the end, look for a `B` — that's the tail call. Conditional tail calls look like `B.cond <other_function>` followed by a fallthrough to a different return path. ### 30.16 Random Number Generation (FEAT_RNG) ARMv8.5-A adds hardware random number generation: ```asm MRS X0, RNDR // X0 = hardware random number (conditionalized — check flags) // If successful: NZCV = 0000 (Z=0). If entropy unavailable: Z=1. B.EQ retry // If Z=1, entropy pool depleted — retry or fall back MRS X0, RNDRRS // Reseeded random: forces a reseed before generating // Same Z-flag convention as RNDR ``` **Why RNDR exists**: Cryptographic applications need true random numbers for key generation, nonces, and ASLR. Before FEAT_RNG, ARM code had to call into kernel or firmware for randomness. RNDR provides user-space access to hardware entropy without syscall overhead. ### 30.17 Crypto & Dot Product Instructions (Brief) AArch64 has dedicated NEON instructions for common crypto and ML operations. These are optional features — check `ID_AA64ISAR0_EL1` for availability: ```asm // AES (FEAT_AES): AESE V0.16B, V1.16B // AES single-round encrypt AESD V0.16B, V1.16B // AES single-round decrypt AESMC V0.16B, V0.16B // AES mix columns AESIMC V0.16B, V0.16B // AES inverse mix columns // SHA (FEAT_SHA256): SHA256H Q0, Q1, V2.4S // SHA-256 hash update (part 1) SHA256H2 Q0, Q1, V2.4S // SHA-256 hash update (part 2) SHA256SU0 V0.4S, V1.4S // SHA-256 schedule update 0 SHA256SU1 V0.4S, V1.4S, V2.4S // SHA-256 schedule update 1 // Dot product (FEAT_DotProd — ARMv8.2-A): UDOT V0.4S, V1.16B, V2.16B // Unsigned 8-bit dot product: 4 dot products of 4 bytes each SDOT V0.4S, V1.16B, V2.16B // Signed version // Each 32-bit lane of V0 accumulates: V1[4i+0]*V2[4i+0] + V1[4i+1]*V2[4i+1] + ... // This is 16 multiply-accumulates per instruction — critical for ML inference ``` **Why hardware AES/SHA**: Software AES takes ~10 cycles/byte. Hardware `AESE`+`AESMC` does a full round in 2 instructions. For servers doing HTTPS, this is the difference between CPU-bound and I/O-bound TLS. **Why dot product**: Neural network inference is dominated by matrix multiply, which decomposes into dot products. UDOT processes 16 byte-multiplies per instruction, giving 4-8× speedup over scalar code for INT8 inference. **CRC32 (FEAT_CRC32 — mandatory from ARMv8.1):** ```asm CRC32B W0, W0, W1 // Update CRC-32 with 1 byte from W1 CRC32H W0, W0, W1 // Update CRC-32 with 2 bytes CRC32W W0, W0, W1 // Update CRC-32 with 4 bytes CRC32X W0, W0, X1 // Update CRC-32 with 8 bytes CRC32CB W0, W0, W1 // Same but CRC-32C (Castagnoli) — used by iSCSI, ext4, btrfs ``` Each instruction takes the current CRC in W0, feeds in data from W1/X1, and produces the updated CRC. These replace ~20 instructions of table-lookup CRC computation per byte. --- ## 31. Pointer Authentication (PAC) PAC (ARMv8.3-A) protects against **Return-Oriented Programming (ROP)** and **Jump-Oriented Programming (JOP)** attacks by cryptographically signing pointers. The idea: before using a pointer (like a return address), the CPU verifies a cryptographic signature embedded in the pointer's unused upper bits. If an attacker overwrites the pointer, the signature won't match, and the CPU faults. **Why PAC exists**: Stack buffer overflows let attackers overwrite return addresses. Without PAC, the CPU blindly follows the corrupted return address. With PAC, the corrupted address has a wrong signature and the authentication instruction faults before the branch. ### 31.1 How PAC Works ARM64 pointers typically only use 48 bits for the actual address (bits [47:0]). The upper bits [63:48] are unused (they must be all-zeros or all-ones, matching bit 47 — this is called the "top byte" rules, and FEAT_TBI/FEAT_PAuth uses bits [54:48] or wider for the PAC). PAC stores a cryptographic hash (the **Pointer Authentication Code**) in those unused bits. ```asm // Sign a pointer (add PAC): PACIA Xd|XZR, Xn|SP // Sign Xd (e.g. return address) using key A and Xn|SP as context // The PAC is computed from: the pointer, the context (SP), and a secret key PACIB Xd|XZR, Xn|SP // Same with key B PACDA Xd|XZR, Xn|SP // Sign data pointer with key A PACDB Xd|XZR, Xn|SP // Sign data pointer with key B PACIA1716 // Sign X17 using key A with X16 as context PACIASP // Alias for PACIA X30, SP (sign LR) PACIBSP // Alias for PACIB X30, SP PACIAZ // Alias for PACIA X30, XZR (zero context) // Authenticate (verify + strip PAC): AUTIA Xd|XZR, Xn|SP // Verify Xd's PAC against key A and Xn|SP; if valid, strip the PAC // If invalid: the upper bits are corrupted, causing a fault on use AUTIB Xd|XZR, Xn|SP // Same with key B AUTDA Xd|XZR, Xn|SP // Authenticate data pointer with key A AUTDB Xd|XZR, Xn|SP // Authenticate data pointer with key B AUTIASP // Alias for AUTIA X30, SP AUTIBSP // Alias for AUTIB X30, SP AUTIAZ // Alias for AUTIA X30, XZR // Strip PAC without authenticating (base FEAT_PAuth — ARMv8.3-A): XPACI Xd|XZR // Strip PAC from instruction address (Xd is modified in-place) XPACD Xd|XZR // Strip PAC from data address // Combined branch instructions: RETAA // Authenticate LR with key A + SP, then RET (AUTIA + RET) RETAB // Same with key B BRAA Xn|XZR, Xm|SP // Authenticate Xn with key A + Xm as context, then branch BRAB Xn|XZR, Xm|SP // Same with key B BLRAA Xn|XZR, Xm|SP // Authenticate + branch with link (key A) BLRAB Xn|XZR, Xm|SP // Same with key B BRAAZ Xn|XZR // Authenticate Xn with key A + zero context, then branch BRABZ Xn|XZR // Same with key B BLRAAZ Xn|XZR // Authenticate + branch with link, zero context (key A) BLRABZ Xn|XZR // Same with key B ``` **Two key families**: Key A (`PACIA`, `AUTIA`) and Key B (`PACIB`, `AUTIB`). The kernel sets the secret keys via system registers (`APIAKeyLo_EL1`, etc.). User code never sees the keys directly. ### 31.2 PAC in Practice Compilers emit PAC instructions in function prologues/epilogues: ```asm my_func: PACIASP // Sign X30 with key A, using SP as context STP X29, X30, [SP, #-16]! // Save signed LR MOV X29, SP // ... function body ... LDP X29, X30, [SP], #16 AUTIASP // Authenticate X30 — faults if tampered RET ``` `PACIASP` is an alias for `PACIA X30, SP`. On CPUs without PAC, these instructions execute as NOPs (they're HINT encodings), so PAC-enabled binaries run safely on older hardware — they just lack the protection. --- ## 32. Branch Target Identification (BTI) BTI (ARMv8.5-A) prevents **Jump-Oriented Programming (JOP)** by restricting which instructions can be the target of an indirect branch (`BR`, `BLR`). When BTI is enabled for a memory page (via page table attributes), an indirect branch that lands on an instruction other than a `BTI` instruction causes a fault. **Why BTI exists**: Even with PAC protecting return addresses, an attacker might redirect an indirect call (function pointer, virtual method) to the middle of a function — skipping the prologue, landing on a "gadget" that does something useful to the attacker. BTI ensures indirect branches can only land at explicitly marked entry points. ```asm BTI c // Valid target for indirect CALL (BLR) BTI j // Valid target for indirect JUMP (BR) BTI jc // Valid target for both BR and BLR BTI // Valid target for any indirect branch (equivalent to BTI jc in practice) ``` Functions that can be called via function pointers need `BTI c` or `BTI jc` at their entry. These are HINT instructions — on older CPUs without BTI, they execute as NOPs. --- ## 33. Scalable Vector Extension (SVE / SVE2) SVE (ARMv8.2-A optional, mandatory in ARMv9-A) is ARM's answer to future-proof SIMD. Unlike NEON's fixed 128-bit vectors, SVE supports **variable-length vectors** from 128 to 2048 bits (in 128-bit increments). Code written for SVE works on any SVE implementation without recompilation — the hardware determines the vector length at runtime. SVE2 (ARMv9-A) extends SVE with more operations (making it a full NEON replacement). **Why SVE exists**: NEON vectors are fixed at 128 bits. If ARM makes a chip with 512-bit data paths, NEON can't use them — you'd need new instructions and recompilation. SVE's variable-length model means the same binary automatically uses wider vectors on more capable hardware. ### 33.1 Key SVE Concepts **Vector Length (VL)**: The number of bits in each Z register. Hardware-defined, read via `RDVL` (Read Vector Length). Always a multiple of 128. Your code must NOT assume a specific VL — it must work for any VL. The VL can be set by the OS (up to the hardware maximum) via `ZCR_EL1`. **Z registers**: `Z0`–`Z31`, each VL bits wide. The lower 128 bits of Zn overlap with the NEON Vn register. SVE uses these for all vector data. **P registers (predicates)**: `P0`–`P15`, each VL/8 bits wide (one bit per byte-lane). Predicates control which lanes are active — inactive lanes don't produce results and don't cause faults. This eliminates the need for "remainder loops" at the end of vectorized loops. **FFR (First Fault Register)**: Used for speculative memory access — lets you try to load a whole vector, and the hardware tells you which lanes faulted (instead of crashing). ### 33.2 SVE Programming Model ```asm // SVE loop: add two arrays, works for ANY vector length // X0 = dst, X1 = src_a, X2 = src_b, X3 = count MOV X4, #0 // i = 0 loop: WHILELT P0.S, X4, X3 // P0 = predicate: which lanes have i+lane < count B.NONE done // If no active lanes, we're done LD1W {Z0.S}, P0/Z, [X1, X4, LSL #2] // Load active elements from src_a LD1W {Z1.S}, P0/Z, [X2, X4, LSL #2] // Load active elements from src_b ADD Z0.S, Z0.S, Z1.S // Add (only active lanes matter) ST1W {Z0.S}, P0, [X0, X4, LSL #2] // Store active elements to dst INCW X4 // i += VL/32 (number of 32-bit elements per vector) B loop done: ``` **Why `WHILELT` and predicates matter**: In a traditional NEON loop, if your array has 1000 elements and vectors hold 4 elements, you do 250 iterations cleanly. But if it's 1001 elements, you need a separate scalar loop for the last 1. SVE predicates handle this automatically — the last iteration simply has a predicate that activates only 1 lane. ### 33.3 SVE2 SVE2 (ARMv9-A) adds operations from NEON that SVE was missing: byte-level permutations, polynomial multiply, complex number multiply-accumulate, histograms, and crypto (SM4, SHA3). SVE2 is intended to be a complete superset of NEON functionality, so eventually all NEON code can be replaced by SVE2 code that also benefits from wider vectors. --- ## 34. Memory Tagging Extension (MTE) MTE (ARMv8.5-A) detects **memory safety bugs** — use-after-free, buffer overflow, and similar errors — by associating a 4-bit **tag** with each 16-byte region of memory and each pointer. If a pointer's tag doesn't match the memory's tag, the CPU faults (or logs the mismatch, depending on configuration). This catches bugs that would otherwise be silent data corruption or security vulnerabilities. **Why MTE exists**: C/C++ memory bugs are the #1 source of security vulnerabilities. Tools like AddressSanitizer (ASan) detect them but with 2× memory overhead and 2× slowdown. MTE provides similar detection with roughly 3-8% overhead, making it usable in production. ### 34.1 How MTE Works Each pointer carries a 4-bit tag in bits [59:56] (the "logical tag"). Each 16-byte aligned block of memory has a corresponding 4-bit tag stored in a separate "tag memory" (managed by the hardware, not directly visible in the address space). When you access memory, the CPU compares the pointer's tag with the memory's tag — a mismatch indicates a bug. ```asm // Allocate tagged memory: IRG Xd|SP, Xn|SP{, Xm|XZR} // Insert Random tag into pointer: Xd = Xn with a random tag in bits [59:56] // Optional Xm excludes specific tags from the random selection STG Xt|SP, [Xn|SP{, #simm}] // Store Allocation Tag: set the memory's tag to match Xt's pointer tag // Covers 16 bytes starting at the aligned address ST2G Xt|SP, [Xn|SP{, #simm}] // Store tags for 2 consecutive 16-byte granules (32 bytes) STZ2G Xt|SP, [Xn|SP{, #simm}] // Store tags AND zero the memory (combine tag set + memset(0)) STZG Xt|SP, [Xn|SP{, #simm}] // Store tag AND zero one 16-byte granule // Load allocation tag: LDG Xt|XZR, [Xn|SP{, #simm}] // Load the memory's tag into Xt's pointer tag bits // Add/subtract with tag manipulation: ADDG Xd|SP, Xn|SP, #uimm6, #uimm4 // Xd = Xn + (uimm6 * 16), with tag = Xn_tag + uimm4 SUBG Xd|SP, Xn|SP, #uimm6, #uimm4 // Xd = Xn - (uimm6 * 16), with tag = Xn_tag - uimm4 // Tag mask: GMI Xd|XZR, Xn|SP, Xm|XZR // Get tag mask: Xd = Xm | (1 << tag_of(Xn)) SUBP Xd|XZR, Xn|SP, Xm|SP // Subtract pointers, ignoring tags: Xd = Xn - Xm (tag bits stripped) SUBPS Xd|XZR, Xn|SP, Xm|SP // Same + set flags ``` ### 34.2 MTE Modes MTE can operate in three modes (configured per-thread via `SCTLR_EL1` / `PSTATE.TCO`): - **Synchronous**: Tag mismatch causes an immediate synchronous exception. Best for debugging — gives you the exact faulting instruction. - **Asynchronous**: Tag mismatches are accumulated and reported later (e.g., at the next system call). Lower overhead than synchronous, useful for production. - **Off**: Tags are ignored. Used to disable MTE for performance-critical code. **MTE in practice**: Memory allocators (like Android's Scudo) tag heap allocations with random tags. When you `free()` a block, the allocator changes the memory's tag. If code later accesses the freed block through a stale pointer, the pointer's old tag won't match the new memory tag → fault → bug caught. --- ## 35. Rules, Gotchas & Pitfalls This section collects every non-obvious rule and common mistake in one place. Each shows what goes wrong and why. ### 35.1 The W-Register Zeroing Rule (and when it bites) **The rule**: Any instruction that writes to `Wd` zeroes bits [63:32] of `Xd`. Always. No exceptions. ```asm // This LOOKS like it only modifies the low 32 bits: MOV X0, #0xDEADBEEF12345678 // X0 = 0xDEADBEEF12345678 ADD W0, W0, #1 // W0 = 0x12345679, BUT X0 = 0x0000000012345679 // The 0xDEADBEEF is GONE. Zeroed by the W-register write. // MOVK Wd also zeros upper 32 — this surprises people: MOV X0, #0xFFFFFFFF00000000 // X0 = 0xFFFFFFFF00000000 MOVK W0, #0x1234 // W0 low 16 = 0x1234 (keeps bits [31:16] of W0) // BUT upper 32 of X0 zeroed → X0 = 0x0000000000001234 // The 0xFFFFFFFF is GONE. // To modify only 16 bits while preserving the full 64-bit value, use: MOVK X0, #0x1234 // This truly keeps all other 48 bits ``` ### 35.2 Signed Extension: Wd vs Xd Gives Different Results ```asm // If W1 = 0x0000ABCD and you extract bits [11:4] = 0xBC (bit 7 = 1): SBFX W0, W1, #4, #8 // Sign-extend to 32 bits: W0 = 0xFFFFFFBC // Then W→X zeroing: X0 = 0x00000000FFFFFFBC // X0 is POSITIVE (as 64-bit signed)! SBFX X0, X1, #4, #8 // Sign-extend to 64 bits: X0 = 0xFFFFFFFFFFFFFFBC // X0 is NEGATIVE (as 64-bit signed)! // These give DIFFERENT mathematical values for the same input. // Use Xd when you need the signed value for 64-bit arithmetic. // Use Wd when you're staying in 32-bit land and the upper bits don't matter. ``` ### 35.3 CMP Wn vs CMP Xn: Different Flags, Different Branches ```asm // X0 = 0x00000001_80000000 (upper word = 1, lower word = 0x80000000) CMP W0, #0 // Compares 0x80000000 as 32-bit: this is INT32_MIN (negative!) B.LT negative_32 // TAKEN — 0x80000000 is negative as a 32-bit signed value CMP X0, #0 // Compares 0x0000000180000000 as 64-bit: this is +6442450944 (positive!) B.LT negative_64 // NOT taken — it's positive as a 64-bit signed value // The SAME register value gives OPPOSITE comparison results depending on W vs X. // Rule: match your CMP width to your data type. If the value is a 32-bit int, use CMP Wn. ``` ### 35.4 ARM Carry Is Inverted for Subtraction ```asm CMP X0, X1 // SUBS XZR, X0, X1 // After CMP, C=1 means X0 >= X1 (unsigned) — NO borrow // After CMP, C=0 means X0 < X1 (unsigned) — borrow occurred // This is OPPOSITE to x86: // x86: CF=1 after CMP means a < b (borrow) // ARM: C=1 after CMP means a >= b (no borrow) // Consequence: if you're porting x86 code that checks CF after SUB, // you need to invert the condition. x86's JC (Jump if Carry) = ARM's B.CC (not B.CS). ``` ### 35.5 NEG / ABS of INT_MIN Wraps to Itself ```asm // NEG X0, X0 when X0 = INT64_MIN = 0x8000000000000000: // -(-2^63) = +2^63, but that doesn't fit in signed 64-bit (max is 2^63-1) // Result: X0 = 0x8000000000000000 = INT64_MIN again! // This means branchless abs() has an edge case: // CMP X0, #0; CNEG X0, X0, LT // If X0 = INT64_MIN → after CNEG, X0 is STILL INT64_MIN (negative!). // abs(INT64_MIN) cannot be represented. This is a fundamental limitation of two's complement. ``` ### 35.6 NaN Breaks FP Comparisons (and Some Conditions Include It) ```asm // After FCMP with NaN, flags = N=0, Z=0, C=1, V=1 (unordered) // This means some conditions are TRUE even though NaN is not really comparable: // Conditions that EXCLUDE NaN (safe for ordered comparison): // B.EQ → not taken ✓ (NaN is not equal to anything) // B.GT → not taken ✓ (NaN is not greater than anything) // B.GE → not taken ✓ (NaN is not greater-or-equal) // B.MI → not taken ✓ (use instead of B.LT for "less than, excluding NaN") // B.LS → not taken ✓ (use instead of B.LE for "less-or-equal, excluding NaN") // Conditions that INCLUDE NaN (will trigger on unordered!): // B.NE → TAKEN! (NaN is "not equal" — be careful) // B.LT → TAKEN! ← SURPRISE! LT means "less than OR unordered" // B.LE → TAKEN! ← SURPRISE! LE means "less-or-equal OR unordered" // B.HI → TAKEN! ← HI means "greater OR unordered" // B.VS → TAKEN (this is the NaN detector) // This is a common trap: if you write "FCMP S0, S1; B.LT less_than", // the branch IS taken when either operand is NaN — even though NaN // is not less than anything! Use B.MI instead for "less than, not NaN". FCMP S0, S0 // Comparing a value to ITSELF // If S0 is NaN: flags = unordered (V=1) // B.EQ → NOT taken (NaN != NaN) // To check if a value is NaN: FCMP S0, S0 // Compare with self B.VS is_nan // VS = unordered = NaN (the only value that doesn't equal itself) // Safe FP comparison pattern (handles NaN correctly): FCMP S0, S1 B.VS handle_nan // Check for NaN FIRST B.MI is_less // Then: ordered less-than (MI, not LT!) B.GT is_greater // Ordered greater-than (GT is safe) // Fall through: equal ``` **Why this happens**: ARM's condition codes were designed for integer comparisons. When reused for FP, the "unordered" result (NaN) maps to flags that accidentally satisfy some conditions. Specifically, NaN sets V=1, and `LT` checks `N!=V` which is true when V=1 and N=0. ARM intentionally arranged this so that each condition has an inverse that covers the "unordered" case: GT and LE are inverses (GT excludes NaN, LE includes it), GE and LT are inverses (GE excludes NaN, LT includes it). ### 35.7 SP and XZR Share Encoding 31 ```asm // Register 31 means SP in some instructions and XZR in others. // The instruction's opcode determines which. You CANNOT choose. ADD X0, SP, #16 // Reg 31 = SP here (ADD immediate allows SP) ADD X0, XZR, X1 // Reg 31 = XZR here (ADD shifted-register uses XZR for reg 31) // SUBTLE: when the base is SP, the assembler uses the EXTENDED register form, // where LSL is an alias for UXTX. So these ARE valid: ADD X0, SP, X1, LSL #2 // VALID — assembler encodes as ADD (extended): SP + UXTX(X1, #2) CMP SP, X0 // VALID — assembler encodes as CMP (extended): SUBS XZR, SP, X0, UXTX // These are genuinely ILLEGAL (no encoding exists): // AND X0, SP, X1 ← shifted-register AND doesn't accept SP as source // ORR X0, SP, X1 ← same (but ORR IMMEDIATE can write to SP: ORR SP, X0, #imm) // ADDS X0, SP, X1, LSL #5 ← extended register shift max is #4, so #5 is out of range // Rule of thumb: // SP is usable in: ADD/SUB immediate, ADD/SUB extended register (including LSL alias), // logical immediate (AND/ORR/EOR #bitmask as destination), LDR/STR addressing, // CMP/CMN extended register // XZR is used in: shifted-register forms, as the discard destination for CMP/TST/CMN ``` ### 35.8 TST Clears C and V ```asm // After TST (= ANDS XZR), C=0 and V=0 ALWAYS. // This matters when TST is followed by CCMP: TST X0, #1 // Sets Z based on bit 0. But also: C=0, V=0! CCMP X1, #5, #0, NE // If NE (bit 0 was set): compare X1 vs 5 // If EQ (bit 0 was clear): flags = #0 (NZCV=0000) // The C=0,V=0 from TST won't affect anything here because CCMP overwrites flags. // But if you chain TST → B.HI (unsigned higher), remember HI needs C=1 && Z=0. // TST always clears C, so B.HI after TST is ALWAYS not taken! // (B.NE is what you want after TST — it checks Z, which TST does set correctly.) ``` ### 35.9 Divide By Zero Returns 0, Not an Exception ```asm UDIV X0, X1, XZR // X0 = X1 / 0 = 0 (no exception, no trap, no NaN — just 0) SDIV X0, X1, XZR // Same: 0 // This is DIFFERENT from: // - x86: divide by zero triggers a #DE exception // - FP: FDIV S0, S1, S2 with S2=0 gives ±infinity (IEEE 754), not 0 // If you need to catch divide-by-zero, check before dividing: CBZ X2, div_by_zero_handler UDIV X0, X1, X2 ``` ### 35.10 SDIV Overflow: INT_MIN / -1 ```asm // SDIV X0, X1, X2 where X1 = INT64_MIN, X2 = -1: // Mathematical result: +2^63, which overflows signed 64-bit (max is 2^63 - 1) // ARM returns: INT64_MIN (0x8000000000000000) — it wraps! // No exception, no flag, just a silently wrong result. // Same issue for 32-bit: SDIV W0, W1, W2 with W1=INT32_MIN, W2=-1 → INT32_MIN ``` ### 35.11 Branch Range Limits ```asm // Each branch type has a different range. If your target is out of range, // the assembler/linker errors (or silently inserts a trampoline): B far_away // ±128 MB — almost always enough B.EQ far_away // ±1 MB — CAN fail for large functions or distant targets CBZ X0, far_away // ±1 MB — same range as B.cond TBZ X0, #3, far // ±32 KB — VERY limited! Easily exceeded in large functions // Fix for out-of-range B.cond: invert and trampoline // Instead of: B.EQ far_away (out of range) // Write: B.NE skip; B far_away; skip: ``` ### 35.12 Extended Register Shift Is Only 0–4 ```asm // ADD X0, X1, W2, SXTW #5 ← ILLEGAL! Max shift is #4 // The #amount in extended register form is 0, 1, 2, 3, or 4. // This covers element sizes 1, 2, 4, 8, 16 bytes — enough for any C data type. // If you need a larger shift, use a separate LSL instruction first. ``` ### 35.13 LDXR/STXR Rules ```asm // Between LDXR and STXR, AVOID these (they may cause STXR to always fail): // 1. Accessing other memory addresses (may clear the exclusive monitor on some CPUs) // 2. Calling functions (they access memory and may trigger context switches) // 3. Executing too many instructions (increases the window for monitor to be cleared) // The ARM architecture PERMITS the monitor to be cleared by other memory accesses, // so even if it works on your CPU today, it may fail on a different implementation. // BAD (may cause infinite retry on some implementations): LDXR X1, [X0] LDR X3, [X4] // ← Other memory access — may clear the monitor ADD X1, X1, #1 STXR W2, X1, [X0] // STXR may always fail → infinite retry loop // GOOD (only register operations between LDXR and STXR): LDXR X1, [X0] ADD X1, X1, #1 // Pure register operation — safe STXR W2, X1, [X0] // Addresses MUST be naturally aligned (this one IS absolute — not a guideline): // LDXR Xt → 8-byte aligned, LDXR Wt → 4-byte aligned // LDXP Xt → 16-byte aligned // Unaligned → alignment fault (always, regardless of SCTLR.A) ``` --- ## 36. Quick Reference Cheat Sheet ### Instruction Format Summary ``` ┌─ Shifted Register ──────── ADD Xd|XZR, Xn|XZR, Xm|XZR, LSL #n │ ADD Wd|WZR, Wn|WZR, Wm|WZR, LSL #n Data Processing ────┼─ Extended Register ──────── ADD Xd|SP, Xn|SP, Wm|WZR, SXTW #n │ ADD Wd|WSP, Wn|WSP, Wm|WZR, SXTW #n ├─ Immediate ──────────────── ADD Xd|SP, Xn|SP, #imm12{, LSL #12} │ ADD Wd|WSP, Wn|WSP, #imm12{, LSL #12} └─ Bitmask Immediate ──────── AND Xd|SP, Xn|XZR, #bitmask_imm AND Wd|WSP, Wn|WZR, #bitmask_imm Load/Store ─────────── LDR Xt|XZR, [Xn|SP, #imm] STR Xt|XZR, [Xn|SP, #imm] LDP Xt1|XZR, Xt2|XZR, [Xn|SP] STP Xt1|XZR, Xt2|XZR, [Xn|SP] Reg 31 rule ────────── Shifted register / most data-proc: reg 31 = XZR Immediate ADD/SUB, extended reg: reg 31 = SP (Rd,Rn), XZR (Rm) Logical immediate (non-S): reg 31 = SP (Rd), XZR (Rn) Load/store base: reg 31 = SP ``` ### Flag-Setting Quick Ref | Want flags? | Arithmetic | Logical | |---|---|---| | No flags | ADD/SUB | AND/ORR/EOR/BIC | | Set flags | ADDS/SUBS | ANDS/BICS | | Discard result | CMP (=SUBS XZR) / CMN (=ADDS XZR) | TST (=ANDS XZR) | ### Encoding Constraints Cheat Sheet | Operand type | 64-bit (Xd) | 32-bit (Wd) | |---|---|---| | 12-bit immediate | 0–4095, optionally LSL #12 | Same | | Bitmask immediate | Repeating rotated ones, element ≤64 | Element ≤32 (fewer valid patterns) | | MOVZ/MOVK/MOVN | 16-bit value at LSL #0/16/32/48 | LSL #0/16 ONLY (2 slots) | | Shifted register amount | 0–63 | 0–31 | | BFM #immr, #imms | 0–63 each, MOD 64 | 0–31 each, MOD 32 | | Branch offset (B) | ±128 MB (26-bit signed × 4) | — | | Branch offset (B.cond) | ±1 MB (19-bit signed × 4) | — | | Branch offset (TBZ) | ±32 KB (14-bit signed × 4), bit 0–63 | bit 0–31 for Wn form | | LDR unsigned offset | #imm12 × element_size | Same | | LDUR signed offset | −256 to +255 (9-bit signed) | Same | | LDP signed offset | −512 to +504 (7-bit × 8) | −256 to +252 (7-bit × 4) | | LDP Qt signed offset | −1024 to +1008 (7-bit × 16) | — | | Extended register shift | #0–4 only (×1, ×2, ×4, ×8, ×16) | Same | ### Common Mnemonics Reference ``` Arithmetic: ADD ADDS SUB SUBS ADC ADCS SBC SBCS MUL MADD MSUB SMULL UMULL SMULH UMULH UDIV SDIV ABS SMAX SMIN UMAX UMIN (FEAT_CSSC) Logical: AND ANDS ORR EOR BIC BICS ORN EON Shift: LSL LSR ASR ROR (aliases for UBFM/SBFM/EXTR/xSLV) Move: MOV MVN MOVZ MOVK MOVN Compare: CMP CMN TST CCMP CCMN Bitfield: SBFM UBFM BFM (base), BFI BFXIL SBFX UBFX SBFIZ UBFIZ (aliases) Extension: SXTB SXTH SXTW UXTB UXTH Bit manip: CLZ CLS RBIT REV REV16 REV32 EXTR CTZ CNT (FEAT_CSSC) CondSelect: CSEL CSINC CSINV CSNEG (base), CSET CSETM CINC CINV CNEG (aliases) Load: LDR LDRB LDRH LDRSW LDRSH LDRSB LDUR LDP LDXR LDAR LDAPR Store: STR STRB STRH STUR STP STXR STLR Prefetch: PRFM PRFUM (PLD/PST, L1/L2/L3, KEEP/STRM) Branch: B BL BR BLR RET B.cond CBZ CBNZ TBZ TBNZ System: SVC HVC SMC BRK MRS MSR NOP WFE WFI ERET Cache: DC ZVA/CVAC/CVAU/CIVAC, IC IALLU/IVAU FP: FADD FSUB FMUL FDIV FSQRT FMADD FCMP FCVT SCVTF UCVTF FMOV NEON: LD1-4 ST1-4 ADD FADD MUL ZIP UZP TBL INS UMOV CNT ADDV SQADD UQADD SQSUB UQSUB (saturating), DUP SHL BSL BIT BIF SVE: LD1W ST1W ADD MUL WHILELT INCW RDVL (predicated, VL-agnostic) Atomic: LDADD CAS SWP LDXR STXR LDAXR STLXR (LSE: +A/L/AL variants) STADD STSET STCLR (fire-and-forget atomics) Barrier: DMB DSB ISB Security: PACIA AUTIA PACIASP AUTIASP RETAA BTI (PAC + BTI) MTE: IRG STG ST2G STZ2G LDG ADDG SUBG (memory tagging) ``` --- *This document covers AArch64 (ARMv8-A/ARMv9-A) with notes on AArch32 differences. For the full authoritative reference, see the "Arm Architecture Reference Manual for A-profile architecture" (DDI 0487).*