x86-64 アセンブリでの文字列関数の実装:string 命令によるアプローチ(2025 年)

2026/04/18 1:22

x86-64 アセンブリでの文字列関数の実装:string 命令によるアプローチ(2025 年)

RSS: https://news.ycombinator.com/rss

要約

本文

IntroductionThe C standard library offers a bunch of functions (whose declarations can be found in the string.h header) to manage NULL-terminated strings and arrays. These are some of the most used C functions, often implemented as builtin by the C compiler, as they are crucial to the speed of programs.On the other hand, the x86 architecture contains “string instructions”, aimed at implementing operations on strings at the hardware level. Moreover, the x86 architecture was incrementally enhanced with SIMD instructions over the years, allowing for the processing of multiple bytes of data in a single instruction.In this article, we’ll inspect the implementation of string.h of the GNU standard library for x86, and see how it compares with a pure assembly implementation of these functions using string instructions and SIMD, and try to explain the choices made by the GNU developers and help you write better assembly.Disassembling a call to memcpyOne of the most popular C functions is memcpy. It copies an array of bytes to another, which is a very common operation and makes its performance particularly important.There are several ways you can perform this operation using x86 asm. Let’s see how it is implemented by gcc using this simple C program:#include <string.h>

#define BUF_LEN 1024 char a[BUF_LEN]; char b[BUF_LEN];

int main(void) { memcpy(b, a, BUF_LEN); return EXIT_SUCCESS; }We can observe the generated asm by using godbolt.Or compile the code using gcc 14.2: gcc -O1 -g -o string main.cAnd then disassemble the executable using:objdump --source-comment="; " --disassembler-color=extended --disassembler-options=intel --no-show-raw-insn --disassemble=main stringYou should get this result:0000000000401134

: ; ; int main(int argc, char argv[]) { ; memcpy(b, a, BUF_LEN); 401134: mov esi,0x404440 401139: mov edi,0x404040 40113e: mov ecx,0x80 401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi] ; return 0; ; } 401146: mov eax,0x0 40114b: retThe first surprising thing you notice is that the machine code does not contain any call to the memcpy function. It has been replaced by 3 mov instructions preceding a mysterious rep movsq instruction.rep movsq is one of the five string instructions defined in the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 5.1.8”.So it is time to learn more about these string instructions.The string instructions of x86String instructions perform operations on array elements pointed by rsi (source register) and rdi (destination register).instructionDescriptionEffect on registersmovsMove string(rdi++) = *(rsi++)cmpsCompare stringcmp *(rsi++), *(rdi++)scasScan stringcmp rax, *(rdi++)lodsLoad stringrax = (rsi++)stosStore string(rdi++) = raxEach of these instructions must have a suffix (b,w,d,q) indicating the type of elements pointed by rdi and rsi (byte, word, doubleword, quadword).These instructions may also have a prefix indicating how to repeat themselves.prefixDescriptionEffect on registersrepRepeat while the ECX register not zerofor(; rcx != 0; rcx–)repe/repzRepeat while the ECX register not zero and the ZF flag is setfor(; rcx != 0 && ZF == true; rcx–)repne/repnzRepeat while the ECX register not zero and the ZF flag is clearfor(; rcx != 0 && ZF == false; rcx–)The repe/repz and repne/repnz prefixes are used only with the cmps and scas instructions (as they are the only ones modifying the RFLAGS register).The movs instructionNow that we have learned more about the string instructions, we can break down the effect of the rep movsq instruction:Copy the quadword pointed by rsi to rdiAdd 8 to rsi and rdi so that they point onto the next quadwordDecrement rcx and repeat until rcx == 0This is what we would expect memcpy to do, except for one thing: bytes are not copied one by one, but in blocks of 8. Here, as the byte size of our arrays is a multiple of 8, we can copy the source array as an array of quadwords. This will necessitate 8 times fewer operations than copying the array one byte at a time.Let’s change the size of the arrays to 1023 to see how the compiler will react when the array size is not a multiple of 8 anymore:0000000000401134 : ; ; int main(int argc, char *argv[]) { ; memcpy(b, a, BUF_LEN); 401134: mov esi,0x404440 401139: mov edi,0x404040 40113e: mov ecx,0x7f 401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi] 401146: mov eax,DWORD PTR [rsi] 401148: mov DWORD PTR [rdi],eax 40114a: movzx eax,WORD PTR [rip+0x36eb] # 40483c <a+0x3fc> 401151: mov WORD PTR [rip+0x32e4],ax # 40443c <b+0x3fc> 401158: movzx eax,BYTE PTR [rip+0x36df] # 40483e <a+0x3fe> 40115f: mov BYTE PTR [rip+0x32d9],al # 40443e <b+0x3fe> ; return 0; ; } 401165: mov eax,0x0 40116a: retInstead of replacing the rep movsq by the rep movsb instruction, gcc preferred to stop the repetition of the movsq instruction 8 bytes earlier and add mov instructions to copy a doubleword, a word, and a byte.The cmps instructionThe cmps instruction will compare the elements pointed by rsi and rdi and will set the flag accordingly. As cmps will set the ZF flag, we can use the repe/repz and repne/repnz prefixes to, respectively, continue until the strings differ or stop when matching characters are encountered.Let’s write a basic memcmp function using this instruction:; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp: mov rcx, rdx ; rcx = n xor eax, eax ; Set return value to zero xor edx, edx ; rdx = 0 repe cmpsb ; for(; rcx != 0 and ZF == true; rcx--) ; cmp *(rsi++), *(rdi++) setb al ; if(ZF == false and CF == true) al = 1 seta dl ; if(ZF == false and CF == false) bl = 1 sub eax, edx ; return al - dl .exit retWe use the repe cmpsb instruction to iterate over the strings s1 and s2 until two bytes differ.When we exit the repe cmpsb instruction, the RFLAGS register is set according to the last byte comparison. We can then use the set[cc] instructions to set bytes al and dl according to the result comparison.In the same way, the memcpy function copies groups of 8 bytes; we can use the repe cmpsq instruction to compare bytes in groups of 8 (or cmpsd for groups of 4 bytes on 32-bit architectures).; int memcmp_cmpsq_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_cmpsq_unaligned: lea rcx, [rdx + 0x7] ; rcx = n and rdx, (8 - 1) ; rdx = n % 8 shr rcx, 3 ; rcx = n / 8 xor eax, eax ; rax = 0 repe cmpsq ; for(; rcx != 0 and ZF == true; rcx += 8) ; cmp *(rsi++), *(rdi++) je .exit ; If no difference was found return mov r8, [rdi - 0x8] ; Read the last (unaligned) quadword of s1 mov r9, [rsi - 0x8] ; Read the last (unaligned) quadword of s2 test rcx, rcx ; if(rcx != 0) jnz .cmp ; goto .cmp shl rdx, 3 ; rdx = 8 * (8 - n % 8) jz .cmp ; if(rdx == 0) goto .cmp bzhi r8, r8, rdx ; r8 <<= 8 * (8 - n % 8) bzhi r9, r9, rdx ; r9 <<= 8 * (8 - n % 8) .cmp: bswap r8 ; Convert r8 to big-endian for lexical comparison bswap r9 ; Convert r9 to big-endian for lexical comparison cmp r8, r9 ; Lexical comparison of quadwords seta al ; if (r8 > r9) al = 1 setb cl ; if (r8 < r9) cl = 1 sub eax, ecx ; return eax - ecx .exit: retTo get the result of the comparison, we need to compare the last two quadwords. However, on little-endian systems, the lowest significant byte will be the first one, and we want to compare the byte in lexical order. Hence, the need to convert the quadword to big-endian using the bswap instruction.Zero High Bits Starting with Specified Bit PositionThe instruction bzhi is useful when you need to mask out the higher bits of a register. Here, when comparing the last quadword, we need to erase all bits in r8 and r9 that aren’t “valid” (i.e. which are not part of the input arrays).You can find documentation about this instruction here.WarningThis function should only be used for blocks of memory of size multiples of 8 with 8-byte alignment.For production use, refer to the Benchmarking section.The scas instructionThe scas instruction will compare the content of rax with the element pointed by rdi and set the flag accordingly. We can use it in a similar way to what we did for cmps, taking advantage of the repe/repz and repne/repnz prefixes.Let’s write a simple strlen function using the scasb instruction:; size_t strlen(rdi: const char s) strlen: mov rcx, -1 repnz scasb ; for(; rcx != 0 and ZF == false; rcx--) ; cmp rax, (rdi++) not rcx ; before this insn rcx = - (len(rdi) + 2) dec rcx ; after this insn rcx = ~(- (len(rdi) + 2)) - 1 ; = -(- (len(rdi) + 2)) - 1 - 1 ; = len(rdi) xchg rax, rcx ; rax = len(rdi) retTipThe instruction sequence used here to calculate the length of a text string is a well-known technique that dates back to the 8086 CPU.WarningDon’t use this function for production code, as it can only compare bytes one by one.For production, always prefer a loop alternative to compare groups of bytes using the largest registers (see the Benchmarking section).The lods instructionThe lods instruction will load to the rax register the element pointed to by rsi and increment rsi to point to the next element. As this instruction does nothing but set register, it is never used with a prefix (the value would be overwritten for each repetition).It can, however, be used to examine a string, for instance, to find a character:; char strchr_lodsb(rdi: const char s, rsi: int c) strchr_lodsb: xchg rdi, rsi ; rdi = c, rsi = s .loop: lodsb ; al = *(rsi++) cmp dil, al ; if(c == al) je .end ; goto .end test al, al ; if(al != 0) jnz .loop ; goto .loop xor rax, rax ; return 0 ret .end: lea rax, [rsi - 1] ; return rsi - 1 retWarningFor production code, always prefer to load data using the largest registers (see the Benchmarking section).The stos instructionThe stos instruction will write the content of the rax register to the element pointed by rdi and increment rdi to point to the next element.Note that according to the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 7.3.9.2”:a REP STOS instruction is the fastest way to initialize a large block of memory.Actually, this is the way gcc will implement a memset when it knows the size and alignment of the string:#include <string.h>

#define BUF_LEN 1024

char a[BUF_LEN];

int main(int argc, char *argv[]) { memset(a, 1, BUF_LEN); return 0; }000000000040115a

: ; ; int main(int argc, char *argv[]) { ; memset(a, 1, BUF_LEN); 40115a: mov edx,0x404460 40115f: mov ecx,0x80 401164: movabs rax,0x101010101010101 40116e: mov rdi,rdx 401171: rep stos QWORD PTR es:[rdi],rax ; return 0; ; } 401174: mov eax,0x0 401179: retTipAlways use the rep movsq instruction to initialize large blocks of memory.For small blocks of memory, use an unrolled loop and the largest registers.Let’s turn aroundThe direction flagIt wouldn’t be as much fun if we couldn’t make it backward 😄On x86, the flag register (RFLAGS) has a direction flag, RFFLAGS.DF, which controls the direction of the string operations. This flag can be set and cleared respectively, using the std and cld instructions.std ; SeT Direction ; Here DF = 1, rdi and rsi are decremented cld ; CLear Direction ; Here DF = 0, rdi and rsi are incrementedHere you have a detailed view of the RFLAGS register:The RFLAGS registerOn Intel64, the upper 32 bits of the RFLAGS register are reserved, and the lower 32-bits are the same as the EFLAGS register of 32-bit architectures.The EFLAGS register on Intel64 and IA-32 architecturesSaving and restoring RFLAGSWhen you write an assembly subroutine using string instructions, you should always:save the state of the RFLAGS registerset or clear RFLAGS.DFdo your workrestore the RFLAGS registerThis way, your subroutine will work independently of the state of the direction flag and won’t break other routines relying on this flag.To do so, you can push RFLAGS to the stack using the pushfq instruction and restore it using the popfq instruction.However, the System V Application Binary Interface for AMD64 states that:The direction flag DF in the %rFLAGS register must be clear (set to “forward” direction) on function entry and return.So, if you’re writing code targeting the System V ABI, you may assume that the direction flag is clear and ensure that you keep it clear when leaving your functions.An example with strrchrWe can code a simple version of strrchr ( which looks for the last occurrence of a character in a string) based on our strchr function by simply setting rdi = s + len(s) + 1 and by setting the direction flag.; char *strrchr(rdi: const char *s, esi: int c); strchr: push rdi ; push 1st arg on the stack push rsi ; push 2nd arg on the stack

; rcx = strlen(rdi) + 1
call strlen
mov rcx, rax
add rcx, 1

std				; RFLAGS.DF = 1
pop rax			; rax = c
pop rdi			; rdi = s
add rdi, rcx	; rdi = s + len(s) + 1
xor rdx, rdx 	; rdx = 0
repne scasb		; for(; rcx != 0 and ZF == false; rcx--)
				;	 cmp al, *(rdi--)
jne .exit
mov rdx, rdi	; if(ZF == true)
sub rdx, 1		;	 rdx = rdi - 1

.exit: cld ; RFLAGS.DF = 0 mov rax, rdx ; return rdx retWe use the System V ABI for our functions, so there is no need to save the RFLAGS register, but we make sure to clear the direction flag before returning from the function.WarningYou should avoid setting the direction flag, especially when using movs and stos instructions, because, as we’ll see in the last part, it may disable a whole class of optimization known as “fast-string operations”.Vectorized string instructionsVectorized string instructions are quite intricate and should not be used in most cases.If you’re in to discover one of the strangest instructions of x86 you can continue, but if you prefer to watch graphs, you can skip to the benchmarking section.Implicit ? Explicit ? Index ? Mask ?SSE4.2 introduced another set of 4 instructions: The vectorized string instructions.instructionDescriptionpcmpestriPacked Compare Implicit Length Strings, Return IndexpcmpistriPacked Compare Explicit Length Strings, Return IndexpcmpestrmPacked Compare Implicit Length Strings, Return MaskpcmpeistrmPacked Compare Explicit Length Strings, Return MaskNoteAdding a v before any of these instructions makes it VEX.128 encoded and will zero out the upper 128 bits of the ymm registers, which may avoid some performance issues in case you forgot a vzeroupper.A vectorized string instruction takes three parameters:pcmpestri xmm1, xmm2/m128, imm8The first and second arguments of the instructions are meant to contain the string fragments to be compared.The fragments are considered valid until:The first null byte for the “Implicit Length” versionsThe length contained in eax (for xmm1) and edx (for xmm2) for the “Explicit Length” versionsThe result is:The index of the first match, stored in the ecx register for the “Index” version.A mask of bits/bytes/words (depending on imm8), stored in the xmm0 register for the “Mask” version.Other information is carried out by the RFLAGS register:FlagInformationCarryResult is non-zeroSignThe first string endsZeroThe second string endsOverflowLeast Significant Bit of resultThe Parity and Adjust flags are reset.The imm8 control byteBut this doesn’t tell us which comparisons these instructions can perform. Well, they can perform 4 basic operations:Find characters from a set: Finds which of the bytes in the second vector operand belong to the set defined by the bytes in the first vector operand, comparing all 256 possible combinations in one operation.Find characters in a range: Finds which of the bytes in the second vector operand are within the range defined by the first vector operand.Compare strings: Determine if the two strings are identical.Substring search: Finds all occurrences of a substring defined by the first vector operand in the second vector operand.The operation performed by the vectorized string instruction is controlled by the value of the imm8 byte:imm8Description·······0b128-bit sources treated as 16 packed bytes.·······1b128-bit sources treated as 8 packed words.······0·bPacked bytes/words are unsigned.······1·bPacked bytes/words are signed.····00··bMode is equal any.····01··bMode is ranges.····10··bMode is equal each.····11··bMode is equal ordered.···0····bIntRes1 is unmodified.···1····bIntRes1 is negated (1’s complement).··0·····bNegation of IntRes1 is for all 16 (8) bits.··1·····bNegation of IntRes1 is masked by reg/mem validity.·0······bIndex: Index of the least significant, set, bit is used (regardless of corresponding input element validity).Mask: IntRes2 is returned in the least significant bits of XMM0.·1······bIndex: Index of the most significant, set, bit is used (regardless of corresponding input element validity).Mask: Each bit of IntRes2 is expanded to byte/word.The MSB of imm8 has no defined effect and should be 0.This means that if we want to compare xmm1 and xmm2 for byte equality and get the index of the first non-matching byte, we have to set imm8 = 0b0001'1000We can define some macros so you do not have to remember this:PACKED_UBYTE equ 0b00 PACKED_UWORD equ 0b01 PACKED_BYTE equ 0b10 PACKED_WORD equ 0b11 CMP_STR_EQU_ANY equ (0b00 << 2) CMP_STR_EQU_RANGES equ (0b01 << 2) CMP_STR_EQU_EACH equ (0b10 << 2) CMP_STR_EQU_ORDERED equ (0b11 << 2) CMP_STR_INV_ALL equ (0b01 << 4) CMP_STR_INV_VALID_ONLY equ (0b11 << 4) CMP_STRI_FIND_LSB_SET equ (0b00 << 6) CMP_STRI_FIND_MSB_SET equ (0b01 << 6) CMP_STRM_BIT_MASK equ (0b00 << 6) CMP_STRM_BYTE_MASK equ (0b01 << 6)From these definitions, we can create the imm8 flag for a vpcmpxstrx instruction with bitwise or.For instance, a flag for the vpcmpestri instruction to get the index of the first differing byte:BYTEWISE_CMP equ (PACKED_UBYTE | CMP_STR_EQU_EACH | CMP_STR_INV_VALID_ONLY | CMP_STRI_FIND_LSB_SET)A vectorized memcmp versionNow that we know how to use the vpcmpestri instruction and that we defined the imm8 flag to make a byte-wise comparison of the content of AVX registers, we can write a version of memcmp using the vectorized string instructions.; int memcmp_vpcmpestri_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_vpcmpestri_unaligned: xor r10d, r10d mov rax, rdx ; rax = n test rdx, rdx ; if(n == 0) jz .exit ; return n vmovdqu xmm2, [rdi + r10] ; xmm2 = rdi[r10] vmovdqu xmm3, [rsi + r10] ; xmm3 = rsi[r10] .loop: ; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string ; If all byte are the same rcx = 16 vpcmpestri xmm2, xmm3, BYTEWISE_CMP test cx, 0x10 ; if(rcx != 16) jz .end ; break add r10, 0x10 ; r10 += 10 vmovdqu xmm2, [rdi + r10] ; xmm2 = rdi[r10] vmovdqu xmm3, [rsi + r10] ; xmm3 = rsi[r10] sub rdx, 0x10 ; rdx -= 16 ja .loop ; if(rdx > 0) goto .loop xor eax, eax ; rax = 0 ret ; return .end: xor eax, eax ; rax = 0 cmp rcx, rdx ; if(index >= rdx) jae .exit ; return; xor edx, edx ; rdx = 0 add r10, rcx ; r10 += index mov r8b, [rdi + r10] ; r8b = rdi[r10] mov r9b, [rsi + r10] ; r9b = rsi[r10] cmp r8b, r9b ; if(r8b > r9b) seta al ; return 1 setb dl ; else sub eax, edx ; return -1 .exit: retBenchmarking our string instructionsBenchmark 1: memcpyAt the beginning of the previous part, we disassembled a call to memcpy to see that it had been inlined with a rep movsq instruction by gcc.The compiler is able to realize this optimization because the alignment and the size of both arrays are known at compile time. Let’s add an indirection to the memcpy call so that the compiler can’t rely on this information anymore.#include <string.h>

#define BUF_LEN 1023

char a[BUF_LEN]; char b[BUF_LEN];

void *copy(void *restrict dest, const void *restrict src, size_t n) { return memcpy(dest, src, n); }

int main(int argc, char *argv[]) { copy(b, a, BUF_LEN); return 0; }objdump --source-comment="; " --disassembler-color=extended --disassembler-option=intel-mnemonic --no-show-raw-insn --disassemble=copy string0000000000401126 : ; #define BUF_LEN 1023 ; ; char a[BUF_LEN]; ; char b[BUF_LEN]; ; ; void *copy(void *restrict dest, const void *restrict src, size_t n) { 401126: sub rsp,0x8 ; return memcpy(dest, src, n); 40112a: call 401030 memcpy@plt ; } 40112f: add rsp,0x8 401133: retAnother situation in which gcc will produce a proper call to the libc memcpy function is when the target architecture has vector extensions. In this situation, the compiler is aware that the memcpy implementation will use vector instructions, which may be faster than rep movs. You can test it by adding the flag -march=corei7 to your gcc command to see what code gcc will produce for an architecture with vector extensions (you can see this in godbolt).We can now compare different assembly versions of memcpy to its glibc implementation.I wrote 8 versions of a program copying 4MiB of memory using: an unoptimized for loop, the glibc memcpy function, rep movsb, rep movsq, the SSE2 extension, and the AVX and AVX2 extensions. I also wrote a backward copy to compare the speed of rep movsb when RFLAGS.DF is set. generic#include <stddef.h>

attribute((optimize("O1"))) void * memcpy_dummy(void *restrict dst, const void *restrict src, size_t n) { void *const ret = dst; for (int i = 0; i < n; i++) *((char *)dst++) = *((char *)src++); return ret; } movsb; void *memcpy_movsb(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_movsb: mov rax, rdi ; rax = dst mov rcx, rdx ; rcx = n rep movsb ; for(; rcx != 0; rcx--) ; *(rdi++) = *(rsi++) ret ; return rax movsb reversed; void *memcpy_movsb_std(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_movsb_std: mov rax, rdi ; rax = dst std ; Set direction flag mov rcx, rdx ; rcx = n sub rdx, 1 ; rdx = n - 1 add rdi, rdx ; rdi = dst + (n - 1) add rsi, rdx ; rsi = src + (n - 1) rep movsb ; for(; rcx != 0; rcx--) ; *(rdi)++ = *(rsi++) cld ; Clear direction flag ret ; return rax movb; void *memcpy_movb(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_movb: mov rax, rdi ; Save dst in rax test rdx, rdx ; if(rdx == 0) jz .exit ; goto .exit xor ecx, ecx ; rcx = 0 .loop: movzx r8d, byte [rsi+rcx] ; r8b = rsi[rdx] mov byte [rdi+rcx], r8b ; rdi[rcx] = r8b add rcx, 1 ; rcx++ cmp rcx, rdx ; if(rcx != n) jne .loop ; goto .loop .exit: ret movsq; Macro used to copy less than a quadword ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 8-byte aligned offset %macro memcpy_epilog_qword 2 cmp %1, 4 ; if(r8 < 4) jb %%word ; goto .word mov eax, [rsi + %2] ; eax = src[%2] mov [rdi + %2], eax ; dst[%2] = eax add %2, 4 ; %2 += 4 sub %1, 4 ; %1 -= 4 %%word: cmp %1, 2 ; if(r8 < 2) jb %%byte ; goto .byte movzx eax, word [rsi + %2] ; ax = src[%2] mov [rdi + %2], ax ; dst[%2] = ax add %2, 2 ; %2 += 2 sub %1, 2 ; %1 -= 2 %%byte: test %1, %1 ; if(r8 == 0) jz %%exit ; goto .exit movzx eax, byte [rsi + %2] ; al = src[%2] mov [rdi + %2], al ; dst[%2] = al %%exit: %endmacro

; void *memcpy_movsq(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_movsq: push rdi ; rax = dst cmp rdx, 8 ; if(n < 8) jb .end ; goto .end lea rcx, [rsi + 8] ; rcx = src + 8 movsq ; Copy first quadword and rsi, -8 ; rsi = align(src + 8, 8) sub rcx, rsi ; rcx = (src + 8) - align(src + 8, 8) sub rdi, rcx ; rdi = (dst + 8) - ((src + 8) - align(src + 8, 8)) lea rdx, [rdx + rcx - 8] ; rdx = n - (8 - ((src + 8) - align(src + 8, 8))) mov rcx, rdx ; rcx = n - (align(src + 8, 8) - src) and rcx, -8 ; rcx = align(n - (align(src + 8, 8) - src), 8) shr rcx, 3 ; rcx = align(n - (align(src + 8, 8) - src), 8) / 8 rep movsq ; for(; rcx != 0; rcx--) ; *(rdi++) = *(rsi++) and rdx, (8 - 1) ; rdx = n - (align(src + 8, 8) - src) % 8 .end: xor ecx, ecx memcpy_epilog_qword rdx, rcx .exit: pop rax ret movq; void *memcpy_movq(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_movq: xor ecx, ecx ; rcx = 0 mov r9, rdx ; r9 = n cmp r9, 8 ; if(n < 8) jb .end ; goto .end mov r8, [rsi] ; r8 = *(rsi) mov [rdi], r8 ; *(rsi) = r8 lea rcx, [rdi + 7] ; rcx = dst + 7 and rcx, -8 ; rcx = align((dst + 7), 8) sub rcx, rdi ; rcx = align((dst + 7), 8) - dst sub r9, rcx ; r9 = dst + n - align((dst + 7), 8) .loop: mov rax, [rsi + rcx] ; r8 = src[rcx] mov [rdi + rcx], rax ; dst[rcx] = r8 add rcx, 8 ; rcx += 8 sub r9, 8 ; r9 -= 8 cmp r9, 8 ; if(r9 >= 8) jae .loop ; goto .loop .end: memcpy_epilog_qword r9, rcx mov rax, rdi ; return dst ret avx; Macro used to copy less than a 16 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 16-byte aligned offset %macro memcpy_epilog_avx 2 cmp %1, 8 jb %%dword mov rax, [rsi + %2] mov [rdi + %2], rax add %2, 8 sub %1, 8 %%dword: memcpy_epilog_qword %1, %2 %endmacro

; void *memcpy_avx(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_avx: xor ecx, ecx ; rcx = 0 cmp rdx, 16 ; if(n < 16) jb .end ; goto .end vmovdqu xmm0, [rsi] ; Copy the first vmovdqu [rdi], xmm0 ; 32 bytes lea rcx, [rsi + 15] ; rcx = src + 15 and rcx, -16 ; rcx = align((src + 15), 16) sub rcx, rsi ; rcx = align((src + 15), 16) - src sub rdx, rcx ; rdx = src + n - align((src + 15), 16) .loop: vmovdqa xmm0, [rsi + rcx] ; xmm0 = rsi[rcx] vmovdqu [rdi + rcx], xmm0 ; rdi[rcx] = xmm0 add rcx, 16 ; rcx += 16 sub rdx, 16 ; rdx -= 16 cmp rdx, 16 ; if(rdx >= 16) jae .loop ; goto .loop .end: memcpy_epilog_avx rdx, rcx mov rax, rdi ret avx2; Macro used to copy less than a 32 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 32-byte aligned offset %macro memcpy_epilog_avx2 2 cmp %1, 16 jb %%qword movdqa xmm0, [rsi + %2] movdqu [rdi + %2], xmm0 add %2, 16 sub %1, 16 %%qword: memcpy_epilog_avx %1, %2 %endmacro

; void memcpy_avx2(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n) memcpy_avx2: xor ecx, ecx ; rcx = 0 mov r9, rdx ; r9 = n cmp r9, 32 ; if(n < 16) jb .end ; goto .end vmovdqu ymm0, [rsi] ; Copy the first vmovdqu [rdi], ymm0 ; 32 bytes lea rcx, [rsi + 31] ; rcx = src + 31 and rcx, -32 ; rcx = align((src + 31), 32) sub rcx, rsi ; rcx = align((src + 31), 32) - src sub r9, rcx ; r9 = src + n - align((src + 31), 32) .loop: vmovdqa ymm0, [rsi + rcx] ; xmm0 = rsi[rcx] vmovdqu [rdi + rcx], ymm0 ; rdi[rcx] = xmm0 add rcx, 32 ; rcx += 16 sub r9, 32 ; r9 -= 16 cmp r9, 32 ; if(r9 >= 16) jae .loop ; goto .loop .end: memcpy_epilog_avx2 r9, rcx mov rax, rdi vzeroupper retNoteNote that for the “dummy version” I forced the optimization level to -O1.Otherwise, gcc would replace the call to our custom copy function with a call to memcpy (when -O2) or write a vectorized loop using SSE2 extension (when -O3). You can check this in godbolt by changing the level of optimization.This means that when writing casual C code with -O2 or -O3 level optimization, a simple for loop will often be identical or more efficient than a call to memcpy.Here are the results I got on my “13th Gen Intel(R) Core(TM) i7-1355U (12) @ 5.00 GHz” using the b63 micro-benchmarking tool:On my hardware, all the implementations are reasonably close, the slowest by far being the backward copy (setting RFLAGS.DF), the for loop, and the movb version (which copies only one byte at a time), but you may have different results depending on your CPU.This is an example of string instructions being nearly as fast as copying using the greatest registers of the processor.According to “Optimizing subroutines in assembly language”:REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. The largest word size (DWORD in 32-bit mode, QWORD in 64-bit mode) is preferred. Both source and destination should be aligned by the word size or better. In many cases, however, it is faster to use vector registers. Moving data in the largest available registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors.The speed of rep movs and rep stos relies heavily on fast-string operations, which need certain processor-dependent conditions to be met.The “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 7.3.9.3” defines fast-string operations:To improve performance, more recent processors support modifications to the processor’s operation during the string store operations initiated with the MOVS, MOVSB, STOS, and STOSB instructions. This optimized operation, called fast-string operation, is used when the execution of one of those instructions meets certain initial conditions (see below). Instructions using fast-string operation effectively operate on the string in groups that may include multiple elements of the native data size (byte, word, doubleword, or quadword).The general conditions for fast-string operations to happen are:The count of bytes must be high.Both source and destination must be aligned (on the size of the greatest registers you have).The direction must be forward.The distance between source and destination must be at least the cache line size.The memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met).So, how does the glibc implement the memcpy function?We have two ways to know how the memcpy works underneath. The first one is looking at the source code of the glibc, the second one is to dump the disassembled machine code in gdb.We can get the code of glibc v.2.40, the GNU implementation of libc, and verify its signature using these commands:wget https://ftp.gnu.org/gnu/glibc/glibc-2.40.tar.xz https://ftp.gnu.org/gnu/glibc/glibc-2.40.tar.xz.sig gpg --recv-keys 7273542B39962DF7B299931416792B4EA25340F8 gpg --verify glibc-2.40.tar.xz.sig glibc-2.40.tar.xz tar xvf glibc-2.40.tar.xzYou can find the implementation of the memcpy function in the string/memcpy.c file:/ Copy memory to memory until the specified number of bytes has been copied. Overlap is NOT handled correctly. Copyright (C) 1991-2024 Free Software Foundation, Inc. This file is part of the GNU C Library.

The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with the GNU C Library; if not, see https://www.gnu.org/licenses/. */

#include <string.h> #include <memcopy.h>

#ifndef MEMCPY

define MEMCPY memcpy

#endif

void * MEMCPY (void *dstpp, const void *srcpp, size_t len) { unsigned long int dstp = (long int) dstpp; unsigned long int srcp = (long int) srcpp;

/* Copy from the beginning to the end. */

/* If there not too few bytes to copy, use word copy. / if (len >= OP_T_THRES) { / Copy just a few bytes to make DSTP aligned. */ len -= (-dstp) % OPSIZ; BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

  /* Copy whole pages from SRCP to DSTP by virtual address manipulation,
 as much as possible.  */

  PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

  /* Copy from SRCP to DSTP taking advantage of the known alignment of
 DSTP.  Number of bytes remaining is put in the third argument,
 i.e. in LEN.  This number may vary from machine to machine.  */

  WORD_COPY_FWD (dstp, srcp, len, len);

  /* Fall out and copy the tail.  */
}

/* There are just a few bytes to copy. Use byte memory operations. */ BYTE_COPY_FWD (dstp, srcp, len);

return dstpp; } libc_hidden_builtin_def (MEMCPY)InfoThe PAGE_COPY_FWD_MAYBE macro is empty for all architectures except the mach architecture, so you can ignore it.The included sysdeps/generic/memcopy.h file contains the definitions of the macro used by memcpy and an explanation of its internals:/* The strategy of the memory functions is:

 1. Copy bytes until the destination pointer is aligned.

 2. Copy words in unrolled loops.  If the source and destination
 are not aligned in the same way, use word memory operations,
 but shift and merge two read words before writing.

 3. Copy the few remaining bytes.

This is fast on processors that have at least 10 registers for allocation by GCC, and that can access memory at reg+const in one instruction.

I made an "exhaustive" test of this memmove when I wrote it, exhaustive in the sense that I tried all alignment and length combinations, with and without overlap. /For i386 (ie x86_32) architectures, the sysdeps/i386/memcopy.h macro definitions will replace the generic ones defining the WORD_COPY_FWD and BYTE_COPY_FWD macros in terms of string instructions:#undef BYTE_COPY_FWD #define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)
do {
int __d0;
asm volatile(/
Clear the direction flag, so copying goes forward. /
"cld\n"
/
Copy bytes. */
"rep\n"
"movsb" :
"=D" (dst_bp), "=S" (src_bp), "=c" (__d0) :
"0" (dst_bp), "1" (src_bp), "2" (nbytes) :
"memory");
} while (0)

#undef WORD_COPY_FWD #define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes)
do
{
int __d0;
asm volatile(/* Clear the direction flag, so copying goes forward. /
"cld\n"
/
Copy longwords. /
"rep\n"
"movsl" :
"=D" (dst_bp), "=S" (src_bp), "=c" (__d0) :
"0" (dst_bp), "1" (src_bp), "2" ((nbytes) / 4) :
"memory");
(nbytes_left) = (nbytes) % 4;
} while (0)For x86_64, the implementation is much more intricate as it requires choosing a memcpy implementation according to the vector extensions available. The sysdeps/x86_64/multiarch/memcpy.c file includes the ifunc-memmove.h, which defines an IFUNC_SELECTOR function that returns a pointer to a function according to the characteristics of the CPU running the program:sysdeps/x86_64/multiarch/memcpy.c/
Multiple versions of memcpy. All versions must be listed in ifunc-impl-list.c. Copyright (C) 2017-2024 Free Software Foundation, Inc. This file is part of the GNU C Library.

The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with the GNU C Library; if not, see https://www.gnu.org/licenses/. */

/* Define multiple versions only for the definition in libc. */ #if IS_IN (libc)

define memcpy __redirect_memcpy

include <string.h>

undef memcpy

define SYMBOL_NAME memcpy

include "ifunc-memmove.h"

libc_ifunc_redirected (__redirect_memcpy, __new_memcpy, IFUNC_SELECTOR ());

ifdef SHARED

__hidden_ver1 (__new_memcpy, __GI_memcpy, __redirect_memcpy) attribute ((visibility ("hidden")));

endif

include <shlib-compat.h>

versioned_symbol (libc, __new_memcpy, memcpy, GLIBC_2_14); #endifsysdeps/x86_64/multiarch/ifunc-memmove.h/* Common definition for memcpy, mempcpy and memmove implementation. All versions must be listed in ifunc-impl-list.c. Copyright (C) 2017-2024 Free Software Foundation, Inc. This file is part of the GNU C Library.

The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with the GNU C Library; if not, see https://www.gnu.org/licenses/. */

#include <init-arch.h>

extern __typeof (REDIRECT_NAME) OPTIMIZE (erms) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned_erms) attribute_hidden;

static inline void * IFUNC_SELECTOR (void) { const struct cpu_features *cpu_features = __get_cpu_features ();

if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS) || CPU_FEATURES_ARCH_P (cpu_features, Prefer_FSRM)) return OPTIMIZE (erms);

if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512F) && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) { if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) return OPTIMIZE (avx512_unaligned_erms);

  return OPTIMIZE (avx512_unaligned);
}

  return OPTIMIZE (avx512_no_vzeroupper);
}

if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, )) { if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) return OPTIMIZE (evex_unaligned_erms);

  return OPTIMIZE (evex_unaligned);
}

  if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
{
  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
    return OPTIMIZE (avx_unaligned_erms_rtm);

  return OPTIMIZE (avx_unaligned_rtm);
}

  if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
			       Prefer_No_VZEROUPPER, !))
{
  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
    return OPTIMIZE (avx_unaligned_erms);

  return OPTIMIZE (avx_unaligned);
}
}

if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3) /* Leave this as runtime check. The SSSE3 is optimized almost exclusively for avoiding unaligned memory access during the copy and by and large is not better than the sse2 implementation as a general purpose memmove. */ && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy)) { return OPTIMIZE (ssse3); }

if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) return OPTIMIZE (sse2_unaligned_erms);

return OPTIMIZE (sse2_unaligned); }You can learn more about indirect functions in glibc in this blog post. What you need to understand here is that the glibc will be able to determine at runtime what function should be called when calling memcpy based on your hardware capabilities.Let’s use gdb to find out how our call to memcpy will be resolved at runtime.Run the following commands in gdb to run until the first call to memcpy:b main run b memcpy c disassembleYou should get something like this:<+0>: endbr64 <+4>: mov rcx,QWORD PTR [rip+0x13b0f5] # 0x7ffff7fa0eb0 <+11>: lea rax,[rip+0x769e] # 0x7ffff7e6d460 <__memmove_erms> <+18>: mov edx,DWORD PTR [rcx+0x1c4] <+24>: test dh,0x90 <+27>: jne 0x7ffff7e65e41 <memcpy@@GLIBC_2.14+145> <+29>: mov esi,DWORD PTR [rcx+0xb8] <+35>: test esi,0x10000 <+41>: jne 0x7ffff7e65e48 <memcpy@@GLIBC_2.14+152> <+43>: test dh,0x2 <+46>: je 0x7ffff7e65e20 <memcpy@@GLIBC_2.14+112> <+48>: test esi,esi <+50>: js 0x7ffff7e65e88 <memcpy@@GLIBC_2.14+216> <+56>: test esi,0x800 <+62>: je 0x7ffff7e65e10 <memcpy@@GLIBC_2.14+96> <+64>: and esi,0x200 <+70>: lea rdx,[rip+0xc9d43] # 0x7ffff7f2fb40 <__memmove_avx_unaligned_erms_rtm> <+77>: lea rax,[rip+0xc9cac] # 0x7ffff7f2fab0 <__memmove_avx_unaligned_rtm> <+84>: cmovne rax,rdx <+88>: ret <+89>: nop DWORD PTR [rax+0x0] <+96>: test dh,0x8 <+99>: je 0x7ffff7e65ea8 <memcpy@@GLIBC_2.14+248> <+105>: nop DWORD PTR [rax+0x0] <+112>: test BYTE PTR [rcx+0x9d],0x2 <+119>: jne 0x7ffff7e65e78 <memcpy@@GLIBC_2.14+200> <+121>: and esi,0x200 <+127>: lea rax,[rip+0x774a] # 0x7ffff7e6d580 <__memmove_sse2_unaligned_erms> <+134>: lea rdx,[rip+0x76b3] # 0x7ffff7e6d4f0 <memcpy@GLIBC_2.2.5> <+141>: cmove rax,rdx <+145>: ret <+146>: nop WORD PTR [rax+rax1+0x0] <+152>: test dh,0x20 <+155>: jne 0x7ffff7e65ddb <memcpy@@GLIBC_2.14+43> <+157>: lea rax,[rip+0xdc23c] # 0x7ffff7f42090 <__memmove_avx512_no_vzeroupper> <+164>: test esi,esi <+166>: jns 0x7ffff7e65e41 <memcpy@@GLIBC_2.14+145> <+168>: and esi,0x200 <+174>: lea rdx,[rip+0xda19b] # 0x7ffff7f40000 <__memmove_avx512_unaligned_erms> <+181>: lea rax,[rip+0xda104] # 0x7ffff7f3ff70 <__memmove_avx512_unaligned> <+188>: cmovne rax,rdx <+192>: ret <+193>: nop DWORD PTR [rax+0x0] <+200>: and edx,0x20 <+203>: lea rax,[rip+0xdcc3e] # 0x7ffff7f42ac0 <__memmove_ssse3> <+210>: jne 0x7ffff7e65e29 <memcpy@@GLIBC_2.14+121> <+212>: ret <+213>: nop DWORD PTR [rax] <+216>: and esi,0x200 <+222>: lea rdx,[rip+0xd112b] # 0x7ffff7f36fc0 <__memmove_evex_unaligned_erms> <+229>: lea rax,[rip+0xd1094] # 0x7ffff7f36f30 <__memmove_evex_unaligned> <+236>: cmovne rax,rdx <+240>: ret <+241>: nop DWORD PTR [rax+0x0] <+248>: and esi,0x200 <+254>: lea rdx,[rip+0xc124b] # 0x7ffff7f27100 <__memmove_avx_unaligned_erms> <+261>: lea rax,[rip+0xc11b4] # 0x7ffff7f27070 <__memmove_avx_unaligned> <+268>: cmovne rax,rdx <+272>: retInstead of breaking inside the code of the memcpy function, the glibc runtime called the IFUNC_SELECTOR. This function will return the pointer that will be used for later calls of the memcpy function.Let’s now hit the finish gdb command to see what the selector returns in rax.Return value of IFUNC_SELECTORThe IFUNC_SELECTOR returned a pointer to __memmove_avx_unaligned_erms. This function is defined in sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S and sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S with handwritten assembly.But it’s quite difficult to figure out the execution flow while reading this, so let’s put a breakpoint on this symbol, hit continue, and run this function step by step.I highlighted the code reached during the execution of the function:__memmove_avx_unaligned_erms<+0>: endbr64 <+4>: mov rax,rdi ; Copy the pointer to destination into rax <+7>: cmp rdx,0x20 ; Compare size to 0x20 (32 bytes) <+11>: jb 0x7ffff7f27130 <__memmove_avx_unaligned_erms+48> ; If smaller jump <+13>: vmovdqu ymm0,YMMWORD PTR [rsi] ; Otherwise loads the first 32 bytes of src into ymm0 <+17>: cmp rdx,0x40 ; Compare size to 0x40 (64 bytes) <+21>: ja 0x7ffff7f271c0 <__memmove_avx_unaligned_erms+192> ; If above jump <+27>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x20] <+33>: vmovdqu YMMWORD PTR [rdi],ymm0 <+37>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm1 <+43>: vzeroupper <+46>: ret <+47>: nop <+48>: cmp edx,0x10 <+51>: jae 0x7ffff7f27162 <__memmove_avx_unaligned_erms+98> <+53>: cmp edx,0x8 <+56>: jae 0x7ffff7f27180 <__memmove_avx_unaligned_erms+128> <+58>: cmp edx,0x4 <+61>: jae 0x7ffff7f27155 <__memmove_avx_unaligned_erms+85> <+63>: cmp edx,0x1 <+66>: jl 0x7ffff7f27154 <__memmove_avx_unaligned_erms+84> <+68>: mov cl,BYTE PTR [rsi] <+70>: je 0x7ffff7f27152 <__memmove_avx_unaligned_erms+82> <+72>: movzx esi,WORD PTR [rsi+rdx1-0x2] <+77>: mov WORD PTR [rdi+rdx1-0x2],si <+82>: mov BYTE PTR [rdi],cl <+84>: ret <+85>: mov ecx,DWORD PTR [rsi+rdx1-0x4] <+89>: mov esi,DWORD PTR [rsi] <+91>: mov DWORD PTR [rdi+rdx1-0x4],ecx <+95>: mov DWORD PTR [rdi],esi <+97>: ret <+98>: vmovdqu xmm0,XMMWORD PTR [rsi] <+102>: vmovdqu xmm1,XMMWORD PTR [rsi+rdx1-0x10] <+108>: vmovdqu XMMWORD PTR [rdi],xmm0 <+112>: vmovdqu XMMWORD PTR [rdi+rdx1-0x10],xmm1 <+118>: ret <+119>: nop WORD PTR [rax+rax1+0x0] <+128>: mov rcx,QWORD PTR [rsi+rdx1-0x8] <+133>: mov rsi,QWORD PTR [rsi] <+136>: mov QWORD PTR [rdi],rsi <+139>: mov QWORD PTR [rdi+rdx1-0x8],rcx <+144>: ret <+145>: vmovdqu ymm2,YMMWORD PTR [rsi+rdx1-0x20] <+151>: vmovdqu ymm3,YMMWORD PTR [rsi+rdx1-0x40] <+157>: vmovdqu YMMWORD PTR [rdi],ymm0 <+161>: vmovdqu YMMWORD PTR [rdi+0x20],ymm1 <+166>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm2 <+172>: vmovdqu YMMWORD PTR [rdi+rdx1-0x40],ymm3 <+178>: vzeroupper <+181>: ret <+182>: cs nop WORD PTR [rax+rax1+0x0] <+192>: cmp rdx,QWORD PTR [rip+0x7a051] # 0x7ffff7fa1218 <__x86_rep_movsb_threshold> <+199>: ja 0x7ffff7f273c0 <__memmove_avx_unaligned_erms+704> ; Jump if size > __x86_rep_movsb_threshold <+205>: cmp rdx,0x100 <+212>: ja 0x7ffff7f27235 <__memmove_avx_unaligned_erms+309> <+214>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+219>: cmp rdx,0x80 <+226>: jbe 0x7ffff7f27191 <__memmove_avx_unaligned_erms+145> <+228>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+233>: vmovdqu ymm3,YMMWORD PTR [rsi+0x60] <+238>: vmovdqu ymm4,YMMWORD PTR [rsi+rdx1-0x20] <+244>: vmovdqu ymm5,YMMWORD PTR [rsi+rdx1-0x40] <+250>: vmovdqu ymm6,YMMWORD PTR [rsi+rdx1-0x60] <+256>: vmovdqu ymm7,YMMWORD PTR [rsi+rdx1-0x80] <+262>: vmovdqu YMMWORD PTR [rdi],ymm0 <+266>: vmovdqu YMMWORD PTR [rdi+0x20],ymm1 <+271>: vmovdqu YMMWORD PTR [rdi+0x40],ymm2 <+276>: vmovdqu YMMWORD PTR [rdi+0x60],ymm3 <+281>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm4 <+287>: vmovdqu YMMWORD PTR [rdi+rdx1-0x40],ymm5 <+293>: vmovdqu YMMWORD PTR [rdi+rdx1-0x60],ymm6 <+299>: vmovdqu YMMWORD PTR [rdi+rdx1-0x80],ymm7 <+305>: vzeroupper <+308>: ret <+309>: mov rcx,rdi <+312>: sub rcx,rsi <+315>: cmp rcx,rdx <+318>: jb 0x7ffff7f272f0 <__memmove_avx_unaligned_erms+496> <+324>: cmp rdx,QWORD PTR [rip+0x80fe5] # 0x7ffff7fa8230 <__x86_shared_non_temporal_threshold> <+331>: ja 0x7ffff7f27420 <__memmove_avx_unaligned_erms+800> <+337>: lea r8,[rcx+rdx1] <+341>: xor r8,rcx <+344>: shr r8,0x3f <+348>: and ecx,0xf00 <+354>: add ecx,r8d <+357>: je 0x7ffff7f272f5 <__memmove_avx_unaligned_erms+501> <+363>: vmovdqu ymm5,YMMWORD PTR [rsi+rdx1-0x20] <+369>: vmovdqu ymm6,YMMWORD PTR [rsi+rdx1-0x40] <+375>: mov rcx,rdi <+378>: or rdi,0x1f <+382>: vmovdqu ymm7,YMMWORD PTR [rsi+rdx1-0x60] <+388>: vmovdqu ymm8,YMMWORD PTR [rsi+rdx1-0x80] <+394>: sub rsi,rcx <+397>: inc rdi <+400>: add rsi,rdi <+403>: lea rdx,[rcx+rdx1-0x80] <+408>: nop DWORD PTR [rax+rax1+0x0] <+416>: vmovdqu ymm1,YMMWORD PTR [rsi] <+420>: vmovdqu ymm2,YMMWORD PTR [rsi+0x20] <+425>: vmovdqu ymm3,YMMWORD PTR [rsi+0x40] <+430>: vmovdqu ymm4,YMMWORD PTR [rsi+0x60] <+435>: sub rsi,0xffffffffffffff80 <+439>: vmovdqa YMMWORD PTR [rdi],ymm1 <+443>: vmovdqa YMMWORD PTR [rdi+0x20],ymm2 <+448>: vmovdqa YMMWORD PTR [rdi+0x40],ymm3 <+453>: vmovdqa YMMWORD PTR [rdi+0x60],ymm4 <+458>: sub rdi,0xffffffffffffff80 <+462>: cmp rdx,rdi <+465>: ja 0x7ffff7f272a0 <__memmove_avx_unaligned_erms+416> <+467>: vmovdqu YMMWORD PTR [rdx+0x60],ymm5 <+472>: vmovdqu YMMWORD PTR [rdx+0x40],ymm6 <+477>: vmovdqu YMMWORD PTR [rdx+0x20],ymm7 <+482>: vmovdqu YMMWORD PTR [rdx],ymm8 <+486>: vmovdqu YMMWORD PTR [rcx],ymm0 <+490>: vzeroupper <+493>: ret <+494>: xchg ax,ax <+496>: test rcx,rcx <+499>: je 0x7ffff7f272ea <__memmove_avx_unaligned_erms+490> <+501>: vmovdqu ymm5,YMMWORD PTR [rsi+0x20] <+506>: vmovdqu ymm6,YMMWORD PTR [rsi+0x40] <+511>: lea rcx,[rdi+rdx1-0x81] <+519>: vmovdqu ymm7,YMMWORD PTR [rsi+0x60] <+524>: vmovdqu ymm8,YMMWORD PTR [rsi+rdx1-0x20] <+530>: sub rsi,rdi <+533>: and rcx,0xffffffffffffffe0 <+537>: add rsi,rcx <+540>: nop DWORD PTR [rax+0x0] <+544>: vmovdqu ymm1,YMMWORD PTR [rsi+0x60] <+549>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+554>: vmovdqu ymm3,YMMWORD PTR [rsi+0x20] <+559>: vmovdqu ymm4,YMMWORD PTR [rsi] <+563>: add rsi,0xffffffffffffff80 <+567>: vmovdqa YMMWORD PTR [rcx+0x60],ymm1 <+572>: vmovdqa YMMWORD PTR [rcx+0x40],ymm2 <+577>: vmovdqa YMMWORD PTR [rcx+0x20],ymm3 <+582>: vmovdqa YMMWORD PTR [rcx],ymm4 <+586>: add rcx,0xffffffffffffff80 <+590>: cmp rdi,rcx <+593>: jb 0x7ffff7f27320 <__memmove_avx_unaligned_erms+544> <+595>: vmovdqu YMMWORD PTR [rdi],ymm0 <+599>: vmovdqu YMMWORD PTR [rdi+0x20],ymm5 <+604>: vmovdqu YMMWORD PTR [rdi+0x40],ymm6 <+609>: vmovdqu YMMWORD PTR [rdi+0x60],ymm7 <+614>: vmovdqu YMMWORD PTR [rdx+rdi1-0x20],ymm8 <+620>: vzeroupper <+623>: ret <+624>: data16 cs nop WORD PTR [rax+rax1+0x0] <+635>: nop DWORD PTR [rax+rax1+0x0] <+640>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+645>: test ecx,0xe00 <+651>: jne 0x7ffff7f273f2 <__memmove_avx_unaligned_erms+754> <+653>: mov r9,rcx <+656>: lea rcx,[rsi+rdx1-0x1] <+661>: or rsi,0x3f <+665>: lea rdi,[rsi+r91+0x1] <+670>: sub rcx,rsi <+673>: inc rsi <+676>: rep movs BYTE PTR es:[rdi],BYTE PTR ds:[rsi] <+678>: vmovdqu YMMWORD PTR [r8],ymm0 <+683>: vmovdqu YMMWORD PTR [r8+0x20],ymm1 <+689>: vzeroupper <+692>: ret <+693>: data16 cs nop WORD PTR [rax+rax1+0x0] <+704>: mov rcx,rdi ; Copy pointer to dest in rcx <+707>: sub rcx,rsi ; rcx = dest - src <+710>: cmp rcx,rdx ; if (rcx < size) <+713>: jb 0x7ffff7f272f0 <__memmove_avx_unaligned_erms+496> ; jump <+719>: mov r8,rdi ; Copy pointer to dest in r8 <+722>: cmp rdx,QWORD PTR [rip+0x80e4f] # 0x7ffff7fa8228 <__x86_rep_movsb_stop_threshold> <+729>: jae 0x7ffff7f27420 <__memmove_avx_unaligned_erms+800> ; if(size >= __x86_rep_movsb_stop_threshold) jump <+731>: test BYTE PTR [rip+0x80e3e],0x1 # 0x7ffff7fa8220 <__x86_string_control> <+738>: je 0x7ffff7f27380 <__memmove_avx_unaligned_erms+640> ; if(__x86_string_control & 0x1 == 0) jump <+740>: cmp ecx,0xffffffc0 ; if(rcx > 0xffffffc0) <+743>: ja 0x7ffff7f2726b <__memmove_avx_unaligned_erms+363> ; jump <+749>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] ; Loads the first 32th-63th bytes of src into ymm1 <+754>: sub rsi,rdi ; rsi = src - dest <+757>: add rdi,0x3f ; rdi = dest + 63 bytes <+761>: lea rcx,[r8+rdx1] ; rcx = dest + size <+765>: and rdi,0xffffffffffffffc0 ; Align rdi on 64 bytes <+769>: add rsi,rdi ; rsi = (src - dest + (dest{64 bytes aligned})) <+772>: sub rcx,rdi ; rcx = dest + size - dest{64 bytes aligned} <+775>: rep movs BYTE PTR es:[rdi],BYTE PTR ds:[rsi] ; OUR REP MOVSB INSTRUCTION !!! <+777>: vmovdqu YMMWORD PTR [r8],ymm0 ; Copy ymm0 content to dest <+782>: vmovdqu YMMWORD PTR [r8+0x20],ymm1 ; Copy ymm1 content to dest <+788>: vzeroupper ; Reset for SSE2 <+791>: ret <+792>: nop DWORD PTR [rax+rax1+0x0] <+800>: mov r11,QWORD PTR [rip+0x80e09] # 0x7ffff7fa8230 <__x86_shared_non_temporal_threshold> <+807>: cmp rdx,r11 <+810>: jb 0x7ffff7f27251 <__memmove_avx_unaligned_erms+337> <+816>: neg rcx <+819>: cmp rdx,rcx <+822>: ja 0x7ffff7f2726b <__memmove_avx_unaligned_erms+363> <+828>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+833>: vmovdqu YMMWORD PTR [rdi],ymm0 <+837>: vmovdqu YMMWORD PTR [rdi+0x20],ymm1 <+842>: mov r8,rdi <+845>: and r8,0x3f <+849>: sub r8,0x40 <+853>: sub rsi,r8 <+856>: sub rdi,r8 <+859>: add rdx,r8 <+862>: not ecx <+864>: mov r10,rdx <+867>: test ecx,0xf00 <+873>: je 0x7ffff7f275f0 <__memmove_avx_unaligned_erms+1264> <+879>: shl r11,0x4 <+883>: cmp rdx,r11 <+886>: jae 0x7ffff7f275f0 <__memmove_avx_unaligned_erms+1264> <+892>: and edx,0x1fff <+898>: shr r10,0xd <+902>: cs nop WORD PTR [rax+rax1+0x0] <+912>: mov ecx,0x20 <+917>: prefetcht0 BYTE PTR [rsi+0x80] <+924>: prefetcht0 BYTE PTR [rsi+0xc0] <+931>: prefetcht0 BYTE PTR [rsi+0x100] <+938>: prefetcht0 BYTE PTR [rsi+0x140] <+945>: prefetcht0 BYTE PTR [rsi+0x1080] <+952>: prefetcht0 BYTE PTR [rsi+0x10c0] <+959>: prefetcht0 BYTE PTR [rsi+0x1100] <+966>: prefetcht0 BYTE PTR [rsi+0x1140] <+973>: vmovdqu ymm0,YMMWORD PTR [rsi] <+977>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+982>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+987>: vmovdqu ymm3,YMMWORD PTR [rsi+0x60] <+992>: vmovdqu ymm4,YMMWORD PTR [rsi+0x1000] <+1000>: vmovdqu ymm5,YMMWORD PTR [rsi+0x1020] <+1008>: vmovdqu ymm6,YMMWORD PTR [rsi+0x1040] <+1016>: vmovdqu ymm7,YMMWORD PTR [rsi+0x1060] <+1024>: sub rsi,0xffffffffffffff80 <+1028>: vmovntdq YMMWORD PTR [rdi],ymm0 <+1032>: vmovntdq YMMWORD PTR [rdi+0x20],ymm1 <+1037>: vmovntdq YMMWORD PTR [rdi+0x40],ymm2 <+1042>: vmovntdq YMMWORD PTR [rdi+0x60],ymm3 <+1047>: vmovntdq YMMWORD PTR [rdi+0x1000],ymm4 <+1055>: vmovntdq YMMWORD PTR [rdi+0x1020],ymm5 <+1063>: vmovntdq YMMWORD PTR [rdi+0x1040],ymm6 <+1071>: vmovntdq YMMWORD PTR [rdi+0x1060],ymm7 <+1079>: sub rdi,0xffffffffffffff80 <+1083>: dec ecx <+1085>: jne 0x7ffff7f27495 <__memmove_avx_unaligned_erms+917> <+1091>: add rdi,0x1000 <+1098>: add rsi,0x1000 <+1105>: dec r10 <+1108>: jne 0x7ffff7f27490 <__memmove_avx_unaligned_erms+912> <+1114>: sfence <+1117>: cmp edx,0x80 <+1123>: jbe 0x7ffff7f275ba <__memmove_avx_unaligned_erms+1210> <+1125>: prefetcht0 BYTE PTR [rsi+0x80] <+1132>: prefetcht0 BYTE PTR [rsi+0xc0] <+1139>: prefetcht0 BYTE PTR [rdi+0x80] <+1146>: prefetcht0 BYTE PTR [rdi+0xc0] <+1153>: vmovdqu ymm0,YMMWORD PTR [rsi] <+1157>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+1162>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+1167>: vmovdqu ymm3,YMMWORD PTR [rsi+0x60] <+1172>: sub rsi,0xffffffffffffff80 <+1176>: add edx,0xffffff80 <+1179>: vmovdqa YMMWORD PTR [rdi],ymm0 <+1183>: vmovdqa YMMWORD PTR [rdi+0x20],ymm1 <+1188>: vmovdqa YMMWORD PTR [rdi+0x40],ymm2 <+1193>: vmovdqa YMMWORD PTR [rdi+0x60],ymm3 <+1198>: sub rdi,0xffffffffffffff80 <+1202>: cmp edx,0x80 <+1208>: ja 0x7ffff7f27565 <__memmove_avx_unaligned_erms+1125> <+1210>: vmovdqu ymm0,YMMWORD PTR [rsi+rdx1-0x80] <+1216>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x60] <+1222>: vmovdqu ymm2,YMMWORD PTR [rsi+rdx1-0x40] <+1228>: vmovdqu ymm3,YMMWORD PTR [rsi+rdx1-0x20] <+1234>: vmovdqu YMMWORD PTR [rdi+rdx1-0x80],ymm0 <+1240>: vmovdqu YMMWORD PTR [rdi+rdx1-0x60],ymm1 <+1246>: vmovdqu YMMWORD PTR [rdi+rdx1-0x40],ymm2 <+1252>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm3 <+1258>: vzeroupper <+1261>: ret <+1262>: xchg ax,ax <+1264>: and edx,0x3fff <+1270>: shr r10,0xe <+1274>: nop WORD PTR [rax+rax1+0x0] <+1280>: mov ecx,0x20 <+1285>: prefetcht0 BYTE PTR [rsi+0x80] <+1292>: prefetcht0 BYTE PTR [rsi+0xc0] <+1299>: prefetcht0 BYTE PTR [rsi+0x1080] <+1306>: prefetcht0 BYTE PTR [rsi+0x10c0] <+1313>: prefetcht0 BYTE PTR [rsi+0x2080] <+1320>: prefetcht0 BYTE PTR [rsi+0x20c0] <+1327>: prefetcht0 BYTE PTR [rsi+0x3080] <+1334>: prefetcht0 BYTE PTR [rsi+0x30c0] <+1341>: vmovdqu ymm0,YMMWORD PTR [rsi] <+1345>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+1350>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+1355>: vmovdqu ymm3,YMMWORD PTR [rsi+0x60] <+1360>: vmovdqu ymm4,YMMWORD PTR [rsi+0x1000] <+1368>: vmovdqu ymm5,YMMWORD PTR [rsi+0x1020] <+1376>: vmovdqu ymm6,YMMWORD PTR [rsi+0x1040] <+1384>: vmovdqu ymm7,YMMWORD PTR [rsi+0x1060] <+1392>: vmovdqu ymm8,YMMWORD PTR [rsi+0x2000] <+1400>: vmovdqu ymm9,YMMWORD PTR [rsi+0x2020] <+1408>: vmovdqu ymm10,YMMWORD PTR [rsi+0x2040] <+1416>: vmovdqu ymm11,YMMWORD PTR [rsi+0x2060] <+1424>: vmovdqu ymm12,YMMWORD PTR [rsi+0x3000] <+1432>: vmovdqu ymm13,YMMWORD PTR [rsi+0x3020] <+1440>: vmovdqu ymm14,YMMWORD PTR [rsi+0x3040] <+1448>: vmovdqu ymm15,YMMWORD PTR [rsi+0x3060] <+1456>: sub rsi,0xffffffffffffff80 <+1460>: vmovntdq YMMWORD PTR [rdi],ymm0 <+1464>: vmovntdq YMMWORD PTR [rdi+0x20],ymm1 <+1469>: vmovntdq YMMWORD PTR [rdi+0x40],ymm2 <+1474>: vmovntdq YMMWORD PTR [rdi+0x60],ymm3 <+1479>: vmovntdq YMMWORD PTR [rdi+0x1000],ymm4 <+1487>: vmovntdq YMMWORD PTR [rdi+0x1020],ymm5 <+1495>: vmovntdq YMMWORD PTR [rdi+0x1040],ymm6 <+1503>: vmovntdq YMMWORD PTR [rdi+0x1060],ymm7 <+1511>: vmovntdq YMMWORD PTR [rdi+0x2000],ymm8 <+1519>: vmovntdq YMMWORD PTR [rdi+0x2020],ymm9 <+1527>: vmovntdq YMMWORD PTR [rdi+0x2040],ymm10 <+1535>: vmovntdq YMMWORD PTR [rdi+0x2060],ymm11 <+1543>: vmovntdq YMMWORD PTR [rdi+0x3000],ymm12 <+1551>: vmovntdq YMMWORD PTR [rdi+0x3020],ymm13 <+1559>: vmovntdq YMMWORD PTR [rdi+0x3040],ymm14 <+1567>: vmovntdq YMMWORD PTR [rdi+0x3060],ymm15 <+1575>: sub rdi,0xffffffffffffff80 <+1579>: dec ecx <+1581>: jne 0x7ffff7f27605 <__memmove_avx_unaligned_erms+1285> <+1587>: add rdi,0x3000 <+1594>: add rsi,0x3000 <+1601>: dec r10 <+1604>: jne 0x7ffff7f27600 <__memmove_avx_unaligned_erms+1280> <+1610>: sfence <+1613>: cmp edx,0x80 <+1619>: jbe 0x7ffff7f277aa <__memmove_avx_unaligned_erms+1706> <+1621>: prefetcht0 BYTE PTR [rsi+0x80] <+1628>: prefetcht0 BYTE PTR [rsi+0xc0] <+1635>: prefetcht0 BYTE PTR [rdi+0x80] <+1642>: prefetcht0 BYTE PTR [rdi+0xc0] <+1649>: vmovdqu ymm0,YMMWORD PTR [rsi] <+1653>: vmovdqu ymm1,YMMWORD PTR [rsi+0x20] <+1658>: vmovdqu ymm2,YMMWORD PTR [rsi+0x40] <+1663>: vmovdqu ymm3,YMMWORD PTR [rsi+0x60] <+1668>: sub rsi,0xffffffffffffff80 <+1672>: add edx,0xffffff80 <+1675>: vmovdqa YMMWORD PTR [rdi],ymm0 <+1679>: vmovdqa YMMWORD PTR [rdi+0x20],ymm1 <+1684>: vmovdqa YMMWORD PTR [rdi+0x40],ymm2 <+1689>: vmovdqa YMMWORD PTR [rdi+0x60],ymm3 <+1694>: sub rdi,0xffffffffffffff80 <+1698>: cmp edx,0x80 <+1704>: ja 0x7ffff7f27755 <__memmove_avx_unaligned_erms+1621> <+1706>: vmovdqu ymm0,YMMWORD PTR [rsi+rdx1-0x80] <+1712>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x60] <+1718>: vmovdqu ymm2,YMMWORD PTR [rsi+rdx1-0x40] <+1724>: vmovdqu ymm3,YMMWORD PTR [rsi+rdx1-0x20] <+1730>: vmovdqu YMMWORD PTR [rdi+rdx1-0x80],ymm0 <+1736>: vmovdqu YMMWORD PTR [rdi+rdx1-0x60],ymm1 <+1742>: vmovdqu YMMWORD PTR [rdi+rdx1-0x40],ymm2 <+1748>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm3 <+1754>: vzeroupper <+1757>: retRemember: We are copying 4Mib of data, so rdx = 0x400000.The __memmove_avx_unaligned_erms performs several tests over the rdx register, which contains the size of our buffer (3rd argument for System V ABI).Here is each test performed and the result of the conditional jump:ConditionTaken ?size is greater than 32 bytes✅size is greater than 64 bytes✅size is greater than __x86_rep_movsb_threshold✅dest doesn’t start in src✅size is greater than __x86_rep_movsb_stop_threshold❌LSB(__x86_string_control) == 0❌dest - src is greater than 2^32-64❌✅ = taken ❌ = not takenWe can check the value of the thresholds and of the “string control” constant.Value of the comparison constants in gdbAnd after aligning rsi and rdi on 64 bytes, we call our beloved rep movsb. 😄But note that this is very dependent on our memory layout:with size >= 0x700000, memcpy would copy the buffer in blocks of 2^14 bytes using all the ymm registers and prefetching, then terminate to copy by blocks of 128 bytes until the end.with 32 < size < 64, it would be a simple 64-byte copy using ymm0 and ymm1etc…We can find an explanation for the __x86_rep_movsb_threshold value in this commit message:In the process of optimizing memcpy for AMD machines, we have found the vector move operations are outperforming enhanced REP MOVSB for data transfers above the L2 cache size on Zen3 architectures. To handle this use case, we are adding an upper bound parameter on enhanced REP MOVSB:’__x86_rep_movsb_stop_threshold'.As per large-bench results, we are configuring this parameter to the L2 cache size for AMD machines and applicable from Zen3 architecture supporting the ERMS feature. For architectures other than AMD, it is the computed value of non-temporal threshold parameter.Note that since this commit sysdeps/x86/dl-cacheinfo.h has been changed and __x86_rep_movsb_stop_threshold is equal to __x86_shared_non_temporal_threshold for all x86 architectures nowadays.You can read all the details of the choices made for the __x86_shared_non_temporal_threshold tunable in sysdeps/x86/dl-cacheinfo.h: /* The default setting for the non_temporal threshold is [1/8, 1/2] of size of the chip's cache (depending on

cachesize_non_temporal_divisor
which is microarch specific. The default is 1/4). For most Intel processors with an initial release date between 2017 and 2023, a thread's typical share of the cache is from 18-64MB. Using a reasonable size fraction of L3 is meant to estimate the point where non-temporal stores begin out-competing REP MOVSB. As well the point where the fact that non-temporal stores are forced back to main memory would already occurred to the majority of the lines in the copy. Note, concerns about the entire L3 cache being evicted by the copy are mostly alleviated by the fact that modern HW detects streaming patterns and provides proper LRU hints so that the maximum thrashing capped at 1/associativity. */ unsigned long int non_temporal_threshold = shared / cachesize_non_temporal_divisor;

/* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most likely have incorrect/incomplete cache info in which case, default to 3/4 * per-thread L3 to avoid regressions. */ unsigned long int non_temporal_threshold_lowbound = shared_per_thread * 3 / 4; if (non_temporal_threshold < non_temporal_threshold_lowbound) non_temporal_threshold = non_temporal_threshold_lowbound;

/* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, their performance in highly parallel situations is noticeably worse. Zhaoxin processors are an exception, the lowbound is not suitable for them based on actual test data. / if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS) && cpu_features->basic.kind != arch_kind_zhaoxin) non_temporal_threshold = non_temporal_threshold_lowbound; / SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of 'x86_non_temporal_threshold' by

LOG_4X_MEMCPY_THRESH
(4) and it is best if that operation cannot overflow. Minimum of 0x4040 (16448) because the L(large_memset_4x) loops need 64-byte to cache align and enough space for at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are reflected in the manual. */ unsigned long int maximum_non_temporal_threshold = SIZE_MAX >> 4; unsigned long int minimum_non_temporal_threshold = 0x4040;

/* If

non_temporal_threshold
less than
minimum_non_temporal_threshold
it most likely means we failed to detect the cache info. We don't want to default to
minimum_non_temporal_threshold
as such a small value, while correct, has bad performance. We default to 64MB as reasonable default bound. 64MB is likely conservative in that most/all systems would choose a lower value so it should never forcing non-temporal stores when they otherwise wouldn't be used. */ if (non_temporal_threshold < minimum_non_temporal_threshold) non_temporal_threshold = 64 * 1024 * 1024; else if (non_temporal_threshold > maximum_non_temporal_threshold) non_temporal_threshold = maximum_non_temporal_threshold;When the block to copy is greater than __x86_shared_non_temporal_threshold, the __memmove_avx_unaligned_erms implementation uses an unrolled loop of AVX2 registers with prefetching and non-temporal stores using the vmovntdq (vex mov non-temporal double quadword).The “Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1” section 9.6.1 gives hints on when to use non-temporal stores:Use non-temporal stores in the cases when the data to be stored is:Write-once (non-temporal).Too large and thus cause cache thrashing.Non-temporal stores do not invoke a cache line allocation, which means they are not write-allocate. As a result, caches are not polluted and no dirty writeback is generated to compete with useful data bandwidth. Without using non-temporal stores, bus bandwidth will suffer when caches start to be thrashed because of dirty writebacks.The memcpy function of the glibc is a piece of code tailored to be the most efficient and adaptable to every hardware and memory layout possible.Benchmark 2: memsetAs you can see with godbolt, gcc will inline a call to memset using the rep stosq instruction under certain circumstances.This is already a good hint about the efficiency of this instruction.#include <string.h>

#define BUF_LEN (1 << 13) char a[BUF_LEN]; int value;

int main(void) { memset(a, value, BUF_LEN); return 0; }gcc -O1 -g -o memset memset.c; int main(void) { ; memset(a, value, BUF_LEN); 401106: mov edi,0x404060 40110b: movzx eax,BYTE PTR [rip+0x2f2e] # 404040 401112: movabs rdx,0x101010101010101 40111c: imul rax,rdx 401120: mov ecx,0x400 401125: rep stos QWORD PTR es:[rdi],rax ; return 0; ; } 401128: mov eax,0x0 40112d: retTo broadcast the first byte of value in all bytes of the rax register, the compiler puts 0x1010101010101010 in rdx and multiplies rax and rdx.I wrote a macro to reproduce this: ; void *memset_movb(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_movb: test rdx, rdx ; if(rdx == 0) jz .exit ; goto .exit xor ecx, ecx ; rcx = 0 .loop: mov [rdi + rcx], sil ; s[rcx] = sil add rcx, 1 ; rcx++ sub rdx, 1 ; rdx-- jnz .loop ; if(rdx != 0) goto .loop .exit: mov rax, rdi ; return s ret

%macro setup_memset_rax 0 mov rcx, rdx ; rcx = n movzx eax, sil ; Mask the first byte of esi mov rdx, 0x0101010101010101 ; Load the multiplication mask mul rdx ; Replicate al in all rax register (rdx is erased) mov rdx, rcx ; rdx = n %endmacro

; Macro used to set less than a quadword ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 8-byte aligned offset %macro memset_epilog_qword 2 cmp %1, 4 jb %%word mov [rdi + %2], eax add %2, 4 sub %1, 4 %%word: cmp %1, 2 jb %%byte mov [rdi + %2], ax add %2, 2 sub %1, 2 %%byte: test %1, %1 jz %%exit mov [rdi + %2], al %%exit: %endmacro

; void *memset_stosq(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_stosq: setup_memset_rax mov rsi, rdi ; rsi = s cmp rdx, 8 ; if(n < 8) jb .end ; goto .end mov rcx, rdi ; rcx = s add rdx, rdi ; rdx = dst + n stosq ; *(rdi++) = rax and rdi, -8 ; rdi = align(dst + 8, 8) sub rdx, rdi ; rdx = dst + n - align(dst + 8, 8) mov rcx, rdx ; rcx = dst + n - align(dst + 8, 8) and rdx, (8 - 1) ; rdx = (dst + n - align(dst + 8, 8)) % 8 shr rcx, 3 ; rcx = (dst + n - align(dst + 8, 8)) / 8 rep stosq ; for(; rcx != 0; rcx -= 1) ; *(rdi++) = rax .end: xor ecx, ecx memset_epilog_qword rdx, rcx .exit: mov rax, rsi ; Return s ret

; void *memset_movq(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_movq: cmp rdx, 8 jb .end setup_memset_rax xor ecx, ecx .loop: mov [rdi + rcx], rax add rcx, 8 sub rdx, 8 cmp rdx, 8 jae .loop .end: memset_epilog_qword rdx, rcx mov rax, rdi ret

; Macro used to set less than a 16 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 16-byte aligned offset %macro memset_epilog_avx 2 vmovq rax, xmm0 cmp %1, 8 jb %%dword mov [rdi + %2], rax add %2, 8 sub %1, 8 %%dword: memset_epilog_qword %1, %2 %endmacro

; void *memset_avx(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_avx: movzx esi, sil ; Copy sil in eax with zero-extend vmovd xmm0, esi ; Copy the low dword (eax) into xmm0 vpxor xmm1, xmm1 ; Load a shuffle mask filled with zero indexes vpshufb xmm0, xmm0, xmm1 ; Broadcast the first byte of xmm0 to all xmm0 bytes cmp rdx, 16 jb .end vmovdqu [rdi], xmm0 lea rcx, [rdi + 15] ; rcx = src + 15 and rcx, -16 ; rcx = align((src + 15), 16) sub rcx, rdi ; rcx = align((src + 15), 16) - src sub rdx, rcx ; rdx = src + n - align((src + 15), 16) cmp rdx, 16 jb .end .loop: vmovdqa [rdi + rcx], xmm0 ; Copy the 16 bytes of xmm0 to s add rcx, 16 ; rcx += 16 sub rdx, 16 ; rdx -= 16 cmp rdx, 16 ; if(rdx >= 16) jae .loop ; goto .loop .end: memset_epilog_avx rdx, rcx .exit: mov rax, rdi ; rax = s ret

; Macro used to set less than a 32 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 32-byte aligned offset %macro memset_epilog_avx2 2 cmp %1, 16 jb %%qword vmovdqu [rdi + %2], xmm0 add %2, 16 sub %1, 16 %%qword: memset_epilog_avx %1, %2 %endmacro

; void *memset_avx2(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_avx2: xor ecx, ecx movzx esi, sil ; Copy sil in eax with zero-extend vmovd xmm1, esi ; Copy the low dword (eax) into xmm0 vpbroadcastb ymm0, xmm1 ; Broadcast the first byte of xmm1 to all ymm0 cmp rdx, 32 jb .end vmovdqu [rdi], ymm0 lea rcx, [rdi + 31] ; rcx = src + 31 and rcx, -32 ; rcx = align((src + 31), 32) sub rcx, rdi ; rcx = align((src + 31), 32) - src sub rdx, rcx ; rdx = src + n - align((src + 31), 32) cmp rdx, 32 ; if(rdx < 32) jb .end ; goto .end .loop: vmovdqa [rdi + rcx], ymm0 ; Copy the 32 bytes of ymm0 to s add rcx, 32 ; rcx += 32 sub rdx, 32 ; rdx -= 32 cmp rdx, 32 ; if(rdx >= 32) jae .loop ; goto .loop .end: memset_epilog_avx2 rdx, rcx mov rax, rdi ; rax = s vzeroupper retOnce again, I wrote 6 different versions of the memset function: an unoptimized for loop, the glibc memset function, rep stosb, rep stosq, and the AVX and AVX2 extensions. genericvoid *attribute((optimize("O1"))) memset_dummy(void *dst, unsigned int c, unsigned long n) { void *const ret = dst; for (int i = 0; i < n; i++) *((char *)dst++) = (char)c; return ret; } stosbsection .text ; void *memset_stosb(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_stosb: mov rcx, rdx ; rcx = n movzx eax, sil ; eax = c mov rsi, rdi ; rdx = s rep stosb ; for(; rcx != 0; rcx--) ; *(rdi++) = al mov rax, rsi ; return s ret stosb reversed; void *memset_stosb_std(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_stosb_std: std ; Set direction flag movzx eax, sil ; eax = c mov rsi, rdi ; rsi = s mov rcx, rdx ; rcx = n sub rdx, 1 ; rdx = n - 1 add rdi, rdx ; rdi = s + (n - 1) rep stosb ; for(; rcx != 0; rcx--) ; *(rdi--) = al mov rax, rsi ; return s cld ; ret movb; void *memset_movb(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_movb: test rdx, rdx ; if(rdx == 0) jz .exit ; goto .exit xor ecx, ecx ; rcx = 0 .loop: mov [rdi + rcx], sil ; s[rcx] = sil add rcx, 1 ; rcx++ sub rdx, 1 ; rdx-- jnz .loop ; if(rdx != 0) goto .loop .exit: mov rax, rdi ; return s ret stosq; Macro used to set less than a quadword ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 8-byte aligned offset %macro memset_epilog_qword 2 cmp %1, 4 jb %%word mov [rdi + %2], eax add %2, 4 sub %1, 4 %%word: cmp %1, 2 jb %%byte mov [rdi + %2], ax add %2, 2 sub %1, 2 %%byte: test %1, %1 jz %%exit mov [rdi + %2], al %%exit: %endmacro

; void *memset_stosq(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_stosq: setup_memset_rax mov rsi, rdi ; rsi = s cmp rdx, 8 ; if(n < 8) jb .end ; goto .end mov rcx, rdi ; rcx = s add rdx, rdi ; rdx = dst + n stosq ; *(rdi++) = rax and rdi, -8 ; rdi = align(dst + 8, 8) sub rdx, rdi ; rdx = dst + n - align(dst + 8, 8) mov rcx, rdx ; rcx = dst + n - align(dst + 8, 8) and rdx, (8 - 1) ; rdx = (dst + n - align(dst + 8, 8)) % 8 shr rcx, 3 ; rcx = (dst + n - align(dst + 8, 8)) / 8 rep stosq ; for(; rcx != 0; rcx -= 1) ; *(rdi++) = rax .end: xor ecx, ecx memset_epilog_qword rdx, rcx .exit: mov rax, rsi ; Return s ret movq; void *memset_movq(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_movq: cmp rdx, 8 jb .end setup_memset_rax xor ecx, ecx .loop: mov [rdi + rcx], rax add rcx, 8 sub rdx, 8 cmp rdx, 8 jae .loop .end: memset_epilog_qword rdx, rcx mov rax, rdi ret avx; Macro used to set less than a 16 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 16-byte aligned offset %macro memset_epilog_avx 2 vmovq rax, xmm0 cmp %1, 8 jb %%dword mov [rdi + %2], rax add %2, 8 sub %1, 8 %%dword: memset_epilog_qword %1, %2 %endmacro

; void *memset_avx(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_avx: movzx esi, sil ; Copy sil in eax with zero-extend vmovd xmm0, esi ; Copy the low dword (eax) into xmm0 vpxor xmm1, xmm1 ; Load a shuffle mask filled with zero indexes vpshufb xmm0, xmm0, xmm1 ; Broadcast the first byte of xmm0 to all xmm0 bytes cmp rdx, 16 jb .end vmovdqu [rdi], xmm0 lea rcx, [rdi + 15] ; rcx = src + 15 and rcx, -16 ; rcx = align((src + 15), 16) sub rcx, rdi ; rcx = align((src + 15), 16) - src sub rdx, rcx ; rdx = src + n - align((src + 15), 16) cmp rdx, 16 jb .end .loop: vmovdqa [rdi + rcx], xmm0 ; Copy the 16 bytes of xmm0 to s add rcx, 16 ; rcx += 16 sub rdx, 16 ; rdx -= 16 cmp rdx, 16 ; if(rdx >= 16) jae .loop ; goto .loop .end: memset_epilog_avx rdx, rcx .exit: mov rax, rdi ; rax = s ret avx2; Macro used to set less than a 32 bytes ; The macro take two parameters: ; - A register containing the byte count to copy ; - A register containing the current 32-byte aligned offset %macro memset_epilog_avx2 2 cmp %1, 16 jb %%qword vmovdqu [rdi + %2], xmm0 add %2, 16 sub %1, 16 %%qword: memset_epilog_avx %1, %2 %endmacro

; void memset_avx2(rdi: void s[.n], rsi: int c, rdx: size_t n); memset_avx2: xor ecx, ecx movzx esi, sil ; Copy sil in eax with zero-extend vmovd xmm1, esi ; Copy the low dword (eax) into xmm0 vpbroadcastb ymm0, xmm1 ; Broadcast the first byte of xmm1 to all ymm0 cmp rdx, 32 jb .end vmovdqu [rdi], ymm0 lea rcx, [rdi + 31] ; rcx = src + 31 and rcx, -32 ; rcx = align((src + 31), 32) sub rcx, rdi ; rcx = align((src + 31), 32) - src sub rdx, rcx ; rdx = src + n - align((src + 31), 32) cmp rdx, 32 ; if(rdx < 32) jb .end ; goto .end .loop: vmovdqa [rdi + rcx], ymm0 ; Copy the 32 bytes of ymm0 to s add rcx, 32 ; rcx += 32 sub rdx, 32 ; rdx -= 32 cmp rdx, 32 ; if(rdx >= 32) jae .loop ; goto .loop .end: memset_epilog_avx2 rdx, rcx mov rax, rdi ; rax = s vzeroupper retInfoLike for the memcpy implementation, without the attribute((optimize("O1"))), gcc would replace the call to our custom function by a call to memset.Here are the results of the benchmarks on my computer:Benchmark of memset implementationsWe can see that, except for the dummy, movb (very similar implementation) and stosb_std (which is a reversed copy), all the implementations managed to use the fast-string operations and have the same low execution time.Once again, we can explore the implementation of glibc using gdb. Like memcpy, it is an indirect function which, in my case, resolves into __memset_avx2_unaligned_erms.As for memcpy, I highlighted the code reached during the execution of my memset for a block of 4MiB.__memset_avx2_unaligned_erms<+0>: endbr64 <+4>: vmovd xmm0,esi <+8>: mov rax,rdi <+11>: cmp rdx,0x20 <+15>: jb 0x7ffff7f27be0 <__memset_avx2_unaligned_erms+224> <+21>: vpbroadcastb ymm0,xmm0 <+26>: cmp rdx,0x40 <+30>: ja 0x7ffff7f27b40 <__memset_avx2_unaligned_erms+64> <+32>: vmovdqu YMMWORD PTR [rdi],ymm0 <+36>: vmovdqu YMMWORD PTR [rdi+rdx1-0x20],ymm0 <+42>: vzeroupper <+45>: ret <+46>: xchg ax,ax <+48>: vmovdqu YMMWORD PTR [rdi-0x40],ymm0 <+53>: vmovdqu YMMWORD PTR [rdi-0x20],ymm0 <+58>: vzeroupper <+61>: ret <+62>: xchg ax,ax <+64>: cmp rdx,QWORD PTR [rip+0x796c9] # 0x7ffff7fa1210 <__x86_rep_stosb_threshold> <+71>: ja 0x7ffff7f27bc0 <__memset_avx2_unaligned_erms+192> <+73>: vmovdqu YMMWORD PTR [rdi],ymm0 <+77>: vmovdqu YMMWORD PTR [rdi+0x20],ymm0 <+82>: add rdi,rdx <+85>: cmp rdx,0x80 <+92>: jbe 0x7ffff7f27b30 <__memset_avx2_unaligned_erms+48> <+94>: vmovdqu YMMWORD PTR [rax+0x40],ymm0 <+99>: vmovdqu YMMWORD PTR [rax+0x60],ymm0 <+104>: add rdi,0xffffffffffffff80 <+108>: cmp rdx,0x100 <+115>: jbe 0x7ffff7f27ba0 <__memset_avx2_unaligned_erms+160> <+117>: lea rdx,[rax+0x80] <+124>: and rdx,0xffffffffffffffe0 <+128>: vmovdqa YMMWORD PTR [rdx],ymm0 <+132>: vmovdqa YMMWORD PTR [rdx+0x20],ymm0 <+137>: vmovdqa YMMWORD PTR [rdx+0x40],ymm0 <+142>: vmovdqa YMMWORD PTR [rdx+0x60],ymm0 <+147>: sub rdx,0xffffffffffffff80 <+151>: cmp rdx,rdi <+154>: jb 0x7ffff7f27b80 <__memset_avx2_unaligned_erms+128> <+156>: nop DWORD PTR [rax+0x0] <+160>: vmovdqu YMMWORD PTR [rdi],ymm0 <+164>: vmovdqu YMMWORD PTR [rdi+0x20],ymm0 <+169>: vmovdqu YMMWORD PTR [rdi+0x40],ymm0 <+174>: vmovdqu YMMWORD PTR [rdi+0x60],ymm0 <+179>: vzeroupper <+182>: ret <+183>: nop WORD PTR [rax+rax1+0x0] <+192>: movzx eax,sil <+196>: mov rcx,rdx <+199>: mov rdx,rdi <+202>: rep stos BYTE PTR es:[rdi],al <+204>: mov rax,rdx <+207>: vzeroupper <+210>: ret <+211>: data16 cs nop WORD PTR [rax+rax1+0x0] <+222>: xchg ax,ax <+224>: vpbroadcastb xmm0,xmm0 <+229>: cmp edx,0x10 <+232>: jge 0x7ffff7f27c00 <__memset_avx2_unaligned_erms+256> <+234>: cmp edx,0x8 <+237>: jge 0x7ffff7f27c10 <__memset_avx2_unaligned_erms+272> <+239>: cmp edx,0x4 <+242>: jge 0x7ffff7f27c20 <__memset_avx2_unaligned_erms+288> <+244>: cmp edx,0x1 <+247>: jg 0x7ffff7f27c30 <__memset_avx2_unaligned_erms+304> <+249>: jl 0x7ffff7f27bfe <__memset_avx2_unaligned_erms+254> <+251>: mov BYTE PTR [rdi],sil <+254>: ret <+255>: nop <+256>: vmovdqu XMMWORD PTR [rdi],xmm0 <+260>: vmovdqu XMMWORD PTR [rdi+rdx1-0x10],xmm0 <+266>: ret <+267>: nop DWORD PTR [rax+rax1+0x0] <+272>: vmovq QWORD PTR [rdi],xmm0 <+276>: vmovq QWORD PTR [rdi+rdx1-0x8],xmm0 <+282>: ret <+283>: nop DWORD PTR [rax+rax1+0x0] <+288>: vmovd DWORD PTR [rdi],xmm0 <+292>: vmovd DWORD PTR [rdi+rdx1-0x4],xmm0 <+298>: ret <+299>: nop DWORD PTR [rax+rax1+0x0] <+304>: mov BYTE PTR [rdi],sil <+307>: mov BYTE PTR [rdi+0x1],sil <+311>: mov BYTE PTR [rdi+rdx*1-0x1],sil <+316>: retThis is way shorter than the code for memcpy, and there is one big difference: there is a threshold for the min value for which we should use rep stosq: __x86_rep_stosb_threshold, but no threshold for the max value.Use of rep stosq__x86_rep_stosb_threshold = 0x800 = 8 * 256This means that for more than 8 copies using ymm registers, the memset implementation will use the rep stosq instruction.Benchmark 3: strlenIf, like many C developers, you’re used to passing around zero-terminated strings, your code may call strlen a bunch of times, so this function had better be fast.We saw a way to write a strlen function with the repne scasb instruction. But there are other ways to write a strlen function by using vectorization, either on a 64-bit register or using SIMD extensions.To find a null byte in a quadword, we can define a helper macro which takes two parameters:A destination register whose bytes will have their sign bit set only if the corresponding byte in the source register is null.A source register where we’ll look for the null bytes..%macro find_zero 2 mov %1, %2 not %2 sub %1, [_find_zero_one_vec] and %1, %2 and %1, [_find_zero_sign_mask] %endmacro

section .rodata _find_zero_one_vec: dq 0x0101010101010101 _find_zero_sign_mask: dq 0x8080808080808080This macro uses a bit trick documented here to find a byte in a quadword. We will use it in the strlen_movq implementation. We cannot use it to code a repne scasq implementation because scasq can only stop on a 64-bit equality. generic#include <stddef.h>

size_t attribute((optimize("O1"))) strlen_dummy(const char s) { unsigned int size = 0; while ((s++) != 0) size++; return size; } scasbsection .text ; size_t strlen_scasb(rdi: const char *s) strlen_scasb: xor eax, eax ; rax = 0 mov rcx, -1 ; rcx = -1 repnz scasb ; for(; rcx != 0 and ZF == false; rcx--) ; cmp rax, *(rdi++) not rcx ; before this insn rcx = - (len(rdi) + 2) dec rcx ; after this insn rcx = ~(- (len(rdi) + 2)) - 1 ; = -(- (len(rdi) + 2)) - 1 - 1 ; = len(rdi) mov rax, rcx ; rax = len(rdi) ret movb; size_t strlen_movb(rdi: const char *s) strlen_movb: xor eax, eax ; rax = 0 .loop: movzx rdx, byte [rdi + rax] test edx, edx jz .end add rax, 1 jmp .loop .end: ret movq; size_t strlen_movq(rdi: const char *s) strlen_movq: xor eax, eax ; rax = s mov rdx, [rdi] ; rdx = *s find_zero rcx, rdx jnz .end mov rax, rdi and rax, -8 ; rax = align(s, 8) sub rax, rdi .loop: add rax, 0x8 ; rax += 8 find_zero rcx, rdx ; Put 1 in sign bits of rcx bytes where rdx bytes are 0 mov rdx, [rdi + rax] ; rdx = *rsi jz .loop ; zero byte found ? sub rax, 8 .end: bsf rcx, rcx ; find right-most set sign bit shr rcx, 3 ; divide by 8 = byte index add rax, rcx ; rax += rcx ret avx; size_t strlen_avx(rdi: const char *s) strlen_avx: xor eax, eax ; rax = -16 vpxor xmm0, xmm0 ; xmm0 = 0 vmovdqu xmm1, [rdi] ; xmm1 = *(rdi) vpcmpeqb xmm1, xmm0 ; xmm1 = byte_mask(i => xmm1[i] == xmm0[i]) vpmovmskb edx, xmm1 ; edx = bitmask(i => sign(xmm1[i])) test edx, edx ; if(edx == 0) jnz .end ; goto .loop mov rax, rdi and rax, -16 ; align rdi on 16 bytes sub rax, rdi .loop: add rax, 0x10 ; rax += 16 vmovdqa xmm1, [rdi + rax] ; xmm1 = *(rdi) vpcmpeqb xmm1, xmm0 ; xmm1 = byte_mask(i => xmm1[i] == xmm0[i]) vpmovmskb edx, xmm1 ; edx = bitmask(i => sign(xmm1[i])) test edx, edx ; if(edx == 0) jz .loop ; goto .loop .end: bsf edx, edx ; edx = index of the first set bit in edx add rax, rdx ; rax += rdx ret avx2; size_t strlen_avx2(rdi: const char *s) strlen_avx2: xor eax, eax ; rax = -32 vpxor ymm0, ymm0 ; ymm0 = 0 vmovdqu ymm1, [rdi] ; ymm1 = *(rdi) vpcmpeqb ymm1, ymm0 ; ymm1 = byte_mask(i => ymm1[i] == ymm0[i]) vpmovmskb edx, ymm1 ; edx = bitmask(i => sign(ymm1[i])) test edx, edx ; if(edx == 0) jnz .end ; goto .loop mov rax, rdi and rax, -32 ; align rdi on 16 bytes sub rax, rdi .loop: add rax, 0x20 ; rax += 32 vmovdqa ymm1, [rdi + rax] ; ymm1 = *(rdi) vpcmpeqb ymm1, ymm0 ; ymm1 = byte_mask(i => ymm1[i] == ymm0[i]) vpmovmskb edx, ymm1 ; edx = bitmask(i => sign(ymm1[i])) test edx, edx ; if(edx == 0) jz .loop ; goto .loop .end: bsf edx, edx ; edx = index of the first set bit in edx add rax, rdx ; rax += rdx vzeroupper ret sse2; size_t strlen_sse2(rdi: const char *s) strlen_sse2: xor eax, eax ; rax = 0 pxor xmm0, xmm0 ; xmm0 = 0 movdqu xmm1, [rdi] ; xmm1 = (rdi) pcmpeqb xmm1, xmm0 ; xmm1 = byte_mask(i => xmm1[i] == xmm0[i]) pmovmskb edx, xmm1 ; edx = bitmask(i => sign(xmm1[i])) test edx, edx ; if(edx == 0) jnz .end ; goto .loop mov rax, rdi and rax, -16 ; align rdi on 16 bytes sub rax, rdi .loop: add rax, 0x10 ; rax += 16 movdqa xmm1, [rdi + rax] ; xmm1 = (rdi) pcmpeqb xmm1, xmm0 ; xmm1 = byte_mask(i => xmm1[i] == xmm0[i]) pmovmskb edx, xmm1 ; edx = bitmask(i => sign(xmm1[i])) test edx, edx ; if(edx == 0) jz .loop ; goto .loop .end: bsf edx, edx ; edx = index of the first set bit in edx add rax, rdx ; rax += rdx retInfoLike for the memcpy implementation, without the attribute((optimize("O1"))), gcc would replace the call to our custom function by a call to strlen.Here are the results of the benchmarks on my computer:Benchmark of strlen implementationsAs we can see, the repne scasb instruction is by far the worst way to write a strlen function, while the standard strlen function outperforms every vectorized implementation.We can again have a look at the glibc implementation of strlen for AVX2:__strlen_avx2<+0>: endbr64 <+4>: mov eax,edi <+6>: mov rdx,rdi <+9>: vpxor xmm0,xmm0,xmm0 <+13>: and eax,0xfff <+18>: cmp eax,0xfe0 <+23>: ja 0x7ffff7f29f90 <__strlen_avx2+336> <+29>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi] <+33>: vpmovmskb eax,ymm1 <+37>: test eax,eax <+39>: je 0x7ffff7f29ec0 <__strlen_avx2+128> <+41>: tzcnt eax,eax <+45>: vzeroupper <+48>: ret <+49>: data16 cs nop WORD PTR [rax+rax1+0x0] <+60>: nop DWORD PTR [rax+0x0] <+64>: tzcnt eax,eax <+68>: sub edi,edx <+70>: inc edi <+72>: add eax,edi <+74>: vzeroupper <+77>: ret <+78>: xchg ax,ax <+80>: tzcnt eax,eax <+84>: sub edi,edx <+86>: add edi,0x21 <+89>: add eax,edi <+91>: vzeroupper <+94>: ret <+95>: nop <+96>: tzcnt eax,eax <+100>: sub edi,edx <+102>: add edi,0x41 <+105>: add eax,edi <+107>: vzeroupper <+110>: ret <+111>: nop <+112>: tzcnt eax,eax <+116>: sub edi,edx <+118>: add edi,0x61 <+121>: add eax,edi <+123>: vzeroupper <+126>: ret <+127>: nop <+128>: or rdi,0x1f <+132>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x1] <+137>: vpmovmskb eax,ymm1 <+141>: test eax,eax <+143>: jne 0x7ffff7f29e80 <__strlen_avx2+64> <+145>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x21] <+150>: vpmovmskb eax,ymm1 <+154>: test eax,eax <+156>: jne 0x7ffff7f29e90 <__strlen_avx2+80> <+158>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x41] <+163>: vpmovmskb eax,ymm1 <+167>: test eax,eax <+169>: jne 0x7ffff7f29ea0 <__strlen_avx2+96> <+171>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x61] <+176>: vpmovmskb eax,ymm1 <+180>: test eax,eax <+182>: jne 0x7ffff7f29eb0 <__strlen_avx2+112> <+184>: inc rdi <+187>: or rdi,0x7f <+191>: nop <+192>: vmovdqa ymm1,YMMWORD PTR [rdi+0x1] <+197>: vpminub ymm2,ymm1,YMMWORD PTR [rdi+0x21] <+202>: vmovdqa ymm3,YMMWORD PTR [rdi+0x41] <+207>: vpminub ymm4,ymm3,YMMWORD PTR [rdi+0x61] <+212>: vpminub ymm5,ymm4,ymm2 <+216>: vpcmpeqb ymm5,ymm0,ymm5 <+220>: vpmovmskb ecx,ymm5 <+224>: sub rdi,0xffffffffffffff80 <+228>: test ecx,ecx <+230>: je 0x7ffff7f29f00 <__strlen_avx2+192> <+232>: vpcmpeqb ymm1,ymm0,ymm1 <+236>: vpmovmskb eax,ymm1 <+240>: sub rdi,rdx <+243>: test eax,eax <+245>: jne 0x7ffff7f29f70 <__strlen_avx2+304> <+247>: vpcmpeqb ymm2,ymm0,ymm2 <+251>: vpmovmskb eax,ymm2 <+255>: test eax,eax <+257>: jne 0x7ffff7f29f80 <__strlen_avx2+320> <+259>: vpcmpeqb ymm3,ymm0,ymm3 <+263>: vpmovmskb eax,ymm3 <+267>: shl rcx,0x20 <+271>: or rax,rcx <+274>: tzcnt rax,rax <+279>: sub rdi,0x3f <+283>: add rax,rdi <+286>: vzeroupper <+289>: ret <+290>: data16 cs nop WORD PTR [rax+rax1+0x0] <+301>: nop DWORD PTR [rax] <+304>: tzcnt eax,eax <+308>: sub rdi,0x7f <+312>: add rax,rdi <+315>: vzeroupper <+318>: ret <+319>: nop <+320>: tzcnt eax,eax <+324>: sub rdi,0x5f <+328>: add rax,rdi <+331>: vzeroupper <+334>: ret <+335>: nop <+336>: or rdi,0x1f <+340>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi-0x1f] <+345>: vpmovmskb eax,ymm1 <+349>: sarx eax,eax,edx <+354>: test eax,eax <+356>: je 0x7ffff7f29ec4 <__strlen_avx2+132> <+362>: tzcnt eax,eax <+366>: vzeroupper <+369>: retAfter aligning rdi on 16 bytes, __strlen_avx2 performs an unrolled loop to find the null byte. In the loop body, the function loads 4 * 32 = 128 bytes each time, reduces them using the vpminub (Vector Packed MIN Unsigned Byte) instruction and compares the result to zero using the vpcmpeqb (Vector Packed CoMPare EQual Byte) instruction.Note that in order to find the null byte position in the register, the __strlen_avx2 function uses the tzcnt (Count Trailing Zeros) instruction instead of the bsf (Bit Scan Forward) instruction.These two instructions are very similar.According to the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 2: Instruction Set Reference 4.3”:TZCNT counts the number of trailing least significant zero bits in source operand (second operand) and returns the result in destination operand (first operand). TZCNT is an extension of the BSF instruction. The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.Benchmark 4: memcmpWe can benchmark the rep cmps instruction by writing several implementations of the memcmp function.This function can also be implemented using vpcmpestri. generic#include <stddef.h> int memcmp_dummy(const void *s1, const void *s2, size_t n) { int diff; for (int i = 0; i < n; i++) { diff = *((char *)s1++) - *((char *)s2++); if (diff != 0) return diff < 0 ? -1 : 1; } return 0; } cmpsb; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_cmpsb: mov rcx, rdx ; rcx = n xor eax, eax ; Set return value to zero xor edx, edx repe cmpsb ; for(; rcx != 0 and ZF == true; rcx--) ; cmp *(rsi++), *(rdi++) setb al ; if(ZF == false and CF == true) al = 1 seta dl ; if(ZF == false and CF == false) bl = 1 sub eax, edx ; return al - dl ret movb; int memcmp_movb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_movb: xor eax, eax ; rax = 0 xor ecx, ecx ; rcx = 0 .loop: movzx r8d, byte [rdi+rcx] movzx r9d, byte [rsi+rcx] add rcx, 1 cmp r8b, r9b jne .end sub rdx, 1 jnz .loop .end: xor edx, edx setb al ; if(ZF == false and CF == true) al = 1 seta dl ; if(ZF == false and CF == false) bl = 1 sub eax, edx ; return al - dl ret cmpsq_unaligned; int memcmp_cmpsq_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_cmpsq_unaligned: lea rcx, [rdx + 0x7] ; rcx = n and rdx, (8 - 1) ; rdx = n % 8 shr rcx, 3 ; rcx = n / 8 xor eax, eax ; rax = 0 repe cmpsq ; for(; rcx != 0 and ZF == true; rcx += 8) ; cmp *(rsi++), *(rdi++) je .exit ; If no difference was found return mov r8, [rdi - 0x8] ; Read the last (unaligned) quadword of s1 mov r9, [rsi - 0x8] ; Read the last (unaligned) quadword of s2 test rcx, rcx ; if(rcx != 0) jnz .cmp ; goto .cmp shl rdx, 3 ; rdx = 8 * (8 - n % 8) jz .cmp ; if(rdx == 0) goto .cmp bzhi r8, r8, rdx ; r8 <<= 8 * (8 - n % 8) bzhi r9, r9, rdx ; r9 <<= 8 * (8 - n % 8) .cmp: bswap r8 ; Convert r8 to big-endian for lexical comparison bswap r9 ; Convert r9 to big-endian for lexical comparison cmp r8, r9 ; Lexical comparison of quadwords seta al ; if (r8 > r9) al = 1 setb cl ; if (r8 < r9) cl = 1 sub eax, ecx ; return eax - ecx .exit: ret cmpsq; int memcmp_cmpsq(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_cmpsq: xor eax, eax ; rax = 0 mov rcx, rdx ; rcx = n shr rcx, 3 ; rcx = n / 8 cmpsq ; Compare first quadword (increments rsi and rdi) jne .end ; If quadwords differ jump to .end mov rax, rsi ; rax = s2 and rax, -8 ; rax = align(s2, 8) sub rax, rsi ; rax = - (s2 - align(s2, 8)) add rdx, 8 ; rdx = n - 8 sub rdx, rax ; rdx = n - 8 + (s2 - align(s2, 8)) add rsi, rax ; rsi = s2 - (s2 - align(s2, 8)) add rdi, rax ; rdi = s1 - (s2 - align(s2, 8)) xor eax, eax ; rax = 0 repe cmpsq ; for(; rcx != 0 and ZF == true; rcx += 8) ; cmp *(rsi++), *(rdi++) je .exit .end: mov r8, [rdi - 0x8] ; Read the last quadword of s1 mov r9, [rsi - 0x8] ; Read the last quadword of s2 test rcx, rcx ; if(rcx != 0) jnz .cmp ; goto .cmp and rdx, (8 - 1) ; rdx = (n - 8 + (s2 - align(s2, 8))) % 8 mov rcx, 8 ; rcx = 8 sub rcx, rdx ; rcx -= rdx shl rcx, 3 ; rcx *= 8 shl r8, cl ; r8 <<= cl shl r9, cl ; r9 <<= cl .cmp: xor ecx, ecx ; rcx = 0 bswap r8 ; Convert r8 to big-endian for lexical comparison bswap r9 ; Convert r9 to big-endian for lexical comparison cmp r8, r9 ; Lexical comparison of quadwords seta al ; if (r8 > r9) al = 1 setb cl ; if (r8 < r9) cl = 1 sub eax, ecx ; return eax - ecx .exit: ret avx; int memcmp_avx(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_avx: xor eax, eax ; rax = 0 test rdx, rdx ; if(n == 0) jz .exit ; return 0 vmovdqu xmm2, [rdi] ; xmm2 = *(rsi) vmovdqu xmm3, [rsi] ; xmm3 = *(rdi) vpcmpeqb xmm0, xmm2, xmm3 ; xmm0 = byte_mask(i => xmm2[i] == xmm3[i]) vpmovmskb r8d, xmm0 ; Create a mask from the most significant bit of each byte cmp r8d, 0xFFFF ; if(xmm2 != xmm3) jne .end ; goto .end lea rcx, [rdi + 15] and rcx, -16 ; rcx = align((src + 15), 16) sub rcx, rdi ; rcx = align((src + 15), 16) - src sub rdx, rcx ; rdx = src + n - align((src + 15), 16) .loop: vmovdqu xmm3, [rsi + rcx] ; xmm3 = rsi[rcx] vmovdqa xmm2, [rdi + rcx] ; xmm2 = rdi[rcx] add rcx, 0x10 vpcmpeqb xmm0, xmm2, xmm3 ; xmm0 = byte_mask(i => xmm2[i] == xmm3[i]) vpmovmskb r8d, xmm0 ; Create a mask from the most significant bit of each byte cmp r8d, 0xFFFF ; if(xmm2 != xmm3) jne .end ; goto .end cmp rdx, 0x10 jb .end ; if(rcx != 0) goto .loop sub rdx, 0x10 jmp .loop .end: vpcmpgtb xmm1, xmm2, xmm3 ; xmm1 = byte_mask(i => xmm2[i] > xmm3[i]) vpmovmskb r9d, xmm1 ; Create a mask from the most significant bit of each byte not r8w ; r8w = bitmask(i => xmm2[i] != xmm3[i]) test r8w, r8w jz .exit bsf ecx, r8d ; ecx = index of first differing bytes in xmm1/2 cmp ecx, edx jae .exit

shr r9d, cl						; Find the corresponding bit in r9d	
xor edx, edx
test r9d, 0x1					; If it is set
setnz al						;	 return 1
setz dl							; else
sub eax, edx					;	 return -1

.exit: ret avx2; int memcmp_avx2(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_avx2: xor eax, eax ; rax = 0 test rdx, rdx ; if(n == 0) jz .exit ; return 0 vmovdqu ymm2, [rdi] ; ymm2 = *(rsi) vmovdqu ymm3, [rsi] ; ymm3 = *(rdi) vpcmpeqb ymm0, ymm2, ymm3 ; ymm0 = byte_mask(i => ymm2[i] == ymm3[i]) vpmovmskb r8d, ymm0 ; Create a mask from the most significant bit of each byte cmp r8d, 0xFFFFFFFF ; if(ymm2 != ymm3) jne .end ; goto .end lea rcx, [rdi + 31] and rcx, -32 ; rcx = align((src + 15), 16) sub rcx, rdi ; rcx = align((src + 15), 16) - src sub rdx, rcx ; rdx = src + n - align((src + 15), 16) .loop: vmovdqu ymm3, [rsi + rcx] ; ymm3 = rsi[rcx] vmovdqa ymm2, [rdi + rcx] ; ymm2 = rdi[rcx] add rcx, 0x20 ; rcx += 32 vpcmpeqb ymm0, ymm2, ymm3 ; ymm0 = byte_mask(i => ymm2[i] == ymm3[i]) vpmovmskb r8d, ymm0 ; Create a mask from the most significant bit of each byte cmp r8d, 0xFFFFFFFF ; if(ymm2 != ymm3) jne .end ; goto .end cmp rdx, 0x20 ; if(rcx < 32) jb .end ; goto .end sub rdx, 0x20 ; rdx -= 32 jmp .loop .end: vpcmpgtb ymm1, ymm2, ymm3 ; ymm1 = byte_mask(i => ymm2[i] > ymm3[i]) vpmovmskb r9d, ymm1 ; Create a mask from the most significant bit of each byte not r8d ; r8w = bitmask(i => ymm2[i] != ymm3[i]) test r8d, r8d jz .exit bsf ecx, r8d ; ecx = index of first differing bytes in xmm1/2 cmp ecx, edx jae .exit shr r9d, cl ; Find the corresponding bit in r9d

xor edx, edx
test r9d, 0x1				; If it is set
setnz al					;	 return 1
setz dl						; else
sub eax, edx				;	 return -1

.exit: vzeroupper ret vpcmpestri_unaligned; int memcmp_vpcmpestri_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_vpcmpestri_unaligned: xor r10d, r10d mov rax, rdx ; rax = n test rdx, rdx ; if(n == 0) jz .exit ; return n vmovdqu xmm2, [rdi + r10] ; xmm2 = rdi[r10] vmovdqu xmm3, [rsi + r10] ; xmm3 = rsi[r10] .loop: ; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string ; If all byte are the same rcx = 16 vpcmpestri xmm2, xmm3, BYTEWISE_CMP test cx, 0x10 ; if(rcx != 16) jz .end ; break add r10, 0x10 ; r10 += 10 vmovdqu xmm2, [rdi + r10] ; xmm2 = rdi[r10] vmovdqu xmm3, [rsi + r10] ; xmm3 = rsi[r10] sub rdx, 0x10 ; rdx -= 16 ja .loop ; if(rdx > 0) goto .loop xor eax, eax ; rax = 0 ret ; return .end: xor eax, eax ; rax = 0 cmp rcx, rdx ; if(index >= rdx) jae .exit ; return; xor edx, edx ; rdx = 0 add r10, rcx ; r10 += index mov r8b, [rdi + r10] ; r8b = rdi[r10] mov r9b, [rsi + r10] ; r9b = rsi[r10] cmp r8b, r9b ; if(r8b > r9b) seta al ; return 1 setb dl ; else sub eax, edx ; return -1 .exit: ret vpcmpestri; int memcmp_vpcmpestri(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_vpcmpestri: xor r10d, r10d mov rax, rdx ; rax = n test rdx, rdx ; if(n == 0) jz .exit ; return 0 vmovdqu xmm2, [rdi] ; xmm2 = *rdi vmovdqu xmm3, [rsi] ; xmm3 = *rsi

; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx
; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string
; If all byte are the same rcx = 16
vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
test cx, 0x10				; if(rcx != 16)
jz .end						; 	break
lea r10, [rdi + 15]         ; r10 = s1 + 15
and r10, -16                ; r10 = align(s1 + 15, 16)
sub r10, rdi                ; rdi = s1 - align(s1 + 15, 16)
sub rdx, r10                ; rdx = n - (s1 - align(s1 + 15, 16))
vmovdqa xmm2, [rdi + r10]			; xmm2 = rdi[r10]
vmovdqu xmm3, [rsi + r10]			; xmm3 = rsi[r10]

.loop: ; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string ; If all byte are the same rcx = 16 vpcmpestri xmm2, xmm3, BYTEWISE_CMP test cx, 0x10 ; if(rcx != 16) jz .end ; break

add r10, 0x10				; r10 += 10

vmovdqa xmm2, [rdi + r10]			; xmm2 = rdi[r10]
vmovdqu xmm3, [rsi + r10]			; xmm3 = rsi[r10]

sub rdx, 0x10				; rdx -= 16
ja .loop					; if(rdx > 0) goto .loop
xor eax, eax				; rax = 0
ret							; return

.end: xor eax, eax ; rax = 0 cmp rcx, rdx ; if(index >= rdx) jae .exit ; return; xor edx, edx ; rdx = 0 add r10, rcx mov r8b, [rdi + r10] ; r8b = rdi[rcx] mov r9b, [rsi + r10] ; r9b = rsi[rcx] cmp r8b, r9b ; if(r8b > r9b) seta al ; return 1 setb dl ; else sub eax, edx ; return -1 .exit: retInfoNote that even with -O3, the compiler couldn’t replace our custom C function with a call to memcmp.Indeed, memcmp only guarantees the value of the sign bit of the return value in the case of different data. But we may be wanting specifically -1, 0, or 1 to be returned, and the compiler can’t assume otherwise.Benchmark of memcmp implementationsThe repe cmpsb instruction is almost as bad as the repne scasb instruction. But this time we could give an implementation using repe cmpsq that performs almost as well as the vectorized string and AVX extensions on my computer.TipWhen you are in a situation where you could use a cmpsb or cmpsq instruction, you should always prefer make comparisons using the greatest registers available.Here is the disassembled code of __memcmp_avx2_movbe:__memcmp_avx2_movbe<+0>: endbr64 <+4>: cmp rdx,0x20 <+8>: jb 0x7ffff7f26c80 <__memcmp_avx2_movbe+736> <+14>: vmovdqu ymm1,YMMWORD PTR [rsi] <+18>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi] <+22>: vpmovmskb eax,ymm1 <+26>: inc eax <+28>: jne 0x7ffff7f26a80 <__memcmp_avx2_movbe+224> <+34>: cmp rdx,0x40 <+38>: jbe 0x7ffff7f26c04 <__memcmp_avx2_movbe+612> <+44>: vmovdqu ymm2,YMMWORD PTR [rsi+0x20] <+49>: vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20] <+54>: vpmovmskb eax,ymm2 <+58>: inc eax <+60>: jne 0x7ffff7f26aa0 <__memcmp_avx2_movbe+256> <+66>: cmp rdx,0x80 <+73>: jbe 0x7ffff7f26bf0 <__memcmp_avx2_movbe+592> <+79>: vmovdqu ymm3,YMMWORD PTR [rsi+0x40] <+84>: vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40] <+89>: vpmovmskb eax,ymm3 <+93>: inc eax <+95>: jne 0x7ffff7f26ac0 <__memcmp_avx2_movbe+288> <+101>: vmovdqu ymm4,YMMWORD PTR [rsi+0x60] <+106>: vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60] <+111>: vpmovmskb ecx,ymm4 <+115>: inc ecx <+117>: jne 0x7ffff7f26afb <__memcmp_avx2_movbe+347> <+123>: cmp rdx,0x100 <+130>: ja 0x7ffff7f26b10 <__memcmp_avx2_movbe+368> <+136>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x80] <+142>: vmovdqu ymm2,YMMWORD PTR [rsi+rdx1-0x60] <+148>: lea rdi,[rdi+rdx1-0x80] <+153>: lea rsi,[rsi+rdx1-0x80] <+158>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi] <+162>: vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20] <+167>: vmovdqu ymm3,YMMWORD PTR [rsi+0x40] <+172>: vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40] <+177>: vmovdqu ymm4,YMMWORD PTR [rsi+0x60] <+182>: vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60] <+187>: vpand ymm5,ymm2,ymm1 <+191>: vpand ymm6,ymm4,ymm3 <+195>: vpand ymm7,ymm6,ymm5 <+199>: vpmovmskb ecx,ymm7 <+203>: inc ecx <+205>: jne 0x7ffff7f26ae3 <__memcmp_avx2_movbe+323> <+207>: vzeroupper <+210>: ret <+211>: data16 cs nop WORD PTR [rax+rax1+0x0] <+222>: xchg ax,ax <+224>: tzcnt eax,eax <+228>: movzx ecx,BYTE PTR [rsi+rax1] <+232>: movzx eax,BYTE PTR [rdi+rax1] <+236>: sub eax,ecx <+238>: vzeroupper <+241>: ret <+242>: data16 cs nop WORD PTR [rax+rax1+0x0] <+253>: nop DWORD PTR [rax] <+256>: tzcnt eax,eax <+260>: movzx ecx,BYTE PTR [rsi+rax1+0x20] <+265>: movzx eax,BYTE PTR [rdi+rax1+0x20] <+270>: sub eax,ecx <+272>: vzeroupper <+275>: ret <+276>: data16 cs nop WORD PTR [rax+rax1+0x0] <+287>: nop <+288>: tzcnt eax,eax <+292>: movzx ecx,BYTE PTR [rsi+rax1+0x40] <+297>: movzx eax,BYTE PTR [rdi+rax1+0x40] <+302>: sub eax,ecx <+304>: vzeroupper <+307>: ret <+308>: data16 cs nop WORD PTR [rax+rax1+0x0] <+319>: nop <+320>: add rsi,rdi <+323>: vpmovmskb eax,ymm1 <+327>: inc eax <+329>: jne 0x7ffff7f26a80 <__memcmp_avx2_movbe+224> <+331>: vpmovmskb eax,ymm2 <+335>: inc eax <+337>: jne 0x7ffff7f26aa0 <__memcmp_avx2_movbe+256> <+339>: vpmovmskb eax,ymm3 <+343>: inc eax <+345>: jne 0x7ffff7f26ac0 <__memcmp_avx2_movbe+288> <+347>: tzcnt ecx,ecx <+351>: movzx eax,BYTE PTR [rdi+rcx1+0x60] <+356>: movzx ecx,BYTE PTR [rsi+rcx1+0x60] <+361>: sub eax,ecx <+363>: vzeroupper <+366>: ret <+367>: nop <+368>: lea rdx,[rdi+rdx1-0x80] <+373>: sub rsi,rdi <+376>: and rdi,0xffffffffffffffe0 <+380>: sub rdi,0xffffffffffffff80 <+384>: vmovdqu ymm1,YMMWORD PTR [rsi+rdi1] <+389>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi] <+393>: vmovdqu ymm2,YMMWORD PTR [rsi+rdi1+0x20] <+399>: vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20] <+404>: vmovdqu ymm3,YMMWORD PTR [rsi+rdi1+0x40] <+410>: vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40] <+415>: vmovdqu ymm4,YMMWORD PTR [rsi+rdi1+0x60] <+421>: vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60] <+426>: vpand ymm5,ymm2,ymm1 <+430>: vpand ymm6,ymm4,ymm3 <+434>: vpand ymm7,ymm6,ymm5 <+438>: vpmovmskb ecx,ymm7 <+442>: inc ecx <+444>: jne 0x7ffff7f26ae0 <__memcmp_avx2_movbe+320> <+446>: sub rdi,0xffffffffffffff80 <+450>: cmp rdi,rdx <+453>: jb 0x7ffff7f26b20 <__memcmp_avx2_movbe+384> <+455>: sub rdi,rdx <+458>: cmp edi,0x60 <+461>: jae 0x7ffff7f26bd0 <__memcmp_avx2_movbe+560> <+463>: vmovdqu ymm3,YMMWORD PTR [rsi+rdx1+0x40] <+469>: cmp edi,0x40 <+472>: jae 0x7ffff7f26bc0 <__memcmp_avx2_movbe+544> <+474>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1] <+479>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdx] <+483>: vmovdqu ymm2,YMMWORD PTR [rsi+rdx1+0x20] <+489>: vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdx+0x20] <+494>: vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdx+0x40] <+499>: vmovdqu ymm4,YMMWORD PTR [rsi+rdx1+0x60] <+505>: vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdx+0x60] <+510>: vpand ymm5,ymm2,ymm1 <+514>: vpand ymm6,ymm4,ymm3 <+518>: vpand ymm7,ymm6,ymm5 <+522>: vpmovmskb ecx,ymm7 <+526>: mov rdi,rdx <+529>: inc ecx <+531>: jne 0x7ffff7f26ae0 <__memcmp_avx2_movbe+320> <+537>: vzeroupper <+540>: ret <+541>: nop DWORD PTR [rax] <+544>: vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdx+0x40] <+549>: vpmovmskb eax,ymm3 <+553>: inc eax <+555>: jne 0x7ffff7f26c20 <__memcmp_avx2_movbe+640> <+557>: nop DWORD PTR [rax] <+560>: vmovdqu ymm4,YMMWORD PTR [rsi+rdx1+0x60] <+566>: vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdx+0x60] <+571>: vpmovmskb eax,ymm4 <+575>: inc eax <+577>: jne 0x7ffff7f26c24 <__memcmp_avx2_movbe+644> <+579>: vzeroupper <+582>: ret <+583>: nop WORD PTR [rax+rax1+0x0] <+592>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x40] <+598>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi+rdx1-0x40] <+604>: vpmovmskb eax,ymm1 <+608>: inc eax <+610>: jne 0x7ffff7f26c40 <__memcmp_avx2_movbe+672> <+612>: vmovdqu ymm1,YMMWORD PTR [rsi+rdx1-0x20] <+618>: vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi+rdx1-0x20] <+624>: vpmovmskb eax,ymm1 <+628>: inc eax <+630>: jne 0x7ffff7f26c60 <__memcmp_avx2_movbe+704> <+632>: vzeroupper <+635>: ret <+636>: nop DWORD PTR [rax+0x0] <+640>: sub rdx,0x20 <+644>: tzcnt eax,eax <+648>: add rax,rdx <+651>: movzx ecx,BYTE PTR [rsi+rax1+0x60] <+656>: movzx eax,BYTE PTR [rax+0x60] <+660>: sub eax,ecx <+662>: vzeroupper <+665>: ret <+666>: nop WORD PTR [rax+rax1+0x0] <+672>: tzcnt eax,eax <+676>: add eax,edx <+678>: movzx ecx,BYTE PTR [rsi+rax1-0x40] <+683>: movzx eax,BYTE PTR [rdi+rax1-0x40] <+688>: sub eax,ecx <+690>: vzeroupper <+693>: ret <+694>: cs nop WORD PTR [rax+rax1+0x0] <+704>: tzcnt eax,eax <+708>: add eax,edx <+710>: movzx ecx,BYTE PTR [rsi+rax1-0x20] <+715>: movzx eax,BYTE PTR [rdi+rax1-0x20] <+720>: sub eax,ecx <+722>: vzeroupper <+725>: ret <+726>: cs nop WORD PTR [rax+rax1+0x0] <+736>: cmp edx,0x1 <+739>: jbe 0x7ffff7f26d00 <__memcmp_avx2_movbe+864> <+741>: mov eax,edi <+743>: or eax,esi <+745>: and eax,0xfff <+750>: cmp eax,0xfe0 <+755>: jg 0x7ffff7f26cc0 <__memcmp_avx2_movbe+800> <+757>: vmovdqu ymm2,YMMWORD PTR [rsi] <+761>: vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi] <+765>: vpmovmskb eax,ymm2 <+769>: inc eax <+771>: bzhi edx,eax,edx <+776>: jne 0x7ffff7f26a80 <__memcmp_avx2_movbe+224> <+782>: xor eax,eax <+784>: vzeroupper <+787>: ret <+788>: data16 cs nop WORD PTR [rax+rax1+0x0] <+799>: nop <+800>: cmp edx,0x10 <+803>: jae 0x7ffff7f26d43 <__memcmp_avx2_movbe+931> <+805>: cmp edx,0x8 <+808>: jae 0x7ffff7f26d20 <__memcmp_avx2_movbe+896> <+810>: cmp edx,0x4 <+813>: jb 0x7ffff7f26d80 <__memcmp_avx2_movbe+992> <+819>: movbe eax,DWORD PTR [rdi] <+823>: movbe ecx,DWORD PTR [rsi] <+827>: shl rax,0x20 <+831>: shl rcx,0x20 <+835>: movbe edi,DWORD PTR [rdi+rdx1-0x4] <+841>: movbe esi,DWORD PTR [rsi+rdx1-0x4] <+847>: or rax,rdi <+850>: or rcx,rsi <+853>: sub rax,rcx <+856>: jne 0x7ffff7f26d10 <__memcmp_avx2_movbe+880> <+858>: ret <+859>: nop DWORD PTR [rax+rax1+0x0] <+864>: jb 0x7ffff7f26d16 <__memcmp_avx2_movbe+886> <+866>: movzx ecx,BYTE PTR [rsi] <+869>: movzx eax,BYTE PTR [rdi] <+872>: sub eax,ecx <+874>: ret <+875>: nop DWORD PTR [rax+rax1+0x0] <+880>: sbb eax,eax <+882>: or eax,0x1 <+885>: ret <+886>: xor eax,eax <+888>: ret <+889>: nop DWORD PTR [rax+0x0] <+896>: movbe rax,QWORD PTR [rdi] <+901>: movbe rcx,QWORD PTR [rsi] <+906>: sub rax,rcx <+909>: jne 0x7ffff7f26d10 <__memcmp_avx2_movbe+880> <+911>: movbe rax,QWORD PTR [rdi+rdx1-0x8] <+918>: movbe rcx,QWORD PTR [rsi+rdx1-0x8] <+925>: sub rax,rcx <+928>: jne 0x7ffff7f26d10 <__memcmp_avx2_movbe+880> <+930>: ret <+931>: vmovdqu xmm2,XMMWORD PTR [rsi] <+935>: vpcmpeqb xmm2,xmm2,XMMWORD PTR [rdi] <+939>: vpmovmskb eax,xmm2 <+943>: sub eax,0xffff <+948>: jne 0x7ffff7f26a80 <__memcmp_avx2_movbe+224> <+954>: vmovdqu xmm2,XMMWORD PTR [rsi+rdx1-0x10] <+960>: lea rdi,[rdi+rdx1-0x10] <+965>: lea rsi,[rsi+rdx1-0x10] <+970>: vpcmpeqb xmm2,xmm2,XMMWORD PTR [rdi] <+974>: vpmovmskb eax,xmm2 <+978>: sub eax,0xffff <+983>: jne 0x7ffff7f26a80 <__memcmp_avx2_movbe+224> <+989>: ret <+990>: xchg ax,ax <+992>: movzx eax,WORD PTR [rdi] <+995>: movzx ecx,WORD PTR [rsi] <+998>: bswap eax <+1000>: bswap ecx <+1002>: shr eax,1 <+1004>: shr ecx,1 <+1006>: movzx edi,BYTE PTR [rdi+rdx1-0x1] <+1011>: movzx esi,BYTE PTR [rsi+rdx*1-0x1] <+1016>: or eax,edi <+1018>: or ecx,esi <+1020>: sub eax,ecx <+1022>: ret…aaaand there is no use of the repe cmps instruction 😥The _memcmp_avx2_movbe function only uses an unrolled loop of avx2 instructions.Benchmark 5: strchrI chose the strchr function to demonstrate the use of the lodsb instruction because we need to make two checks on the read byte, which makes it unsuitable for scasb. But as lodsb doesn’t have side effects nor set the flags, we can’t use a repetition prefix, which makes it very similar to a movb.That’s why the implementation using movsb and movb as well as mosq and movq are factorized in two macros strchr_byte and strchr_quad. genericchar *strchr_dummy(const char *s, int c) { do { if (*s == c) return (char )s; } while ((s++) != '\0'); return 0; } byte%macro strchr_byte 1 xchg rdi, rsi %%loop: %1 cmp dil, al je %%end test al, al jnz %%loop xor rax, rax ret %%end: lea rax, [rsi - 1] ret %endmacro

; char* strchr_lodsb(rdi: const char* s, rsi: int c) strchr_lodsb: strchr_byte lodsb

%macro load_byte_movzx 0 movzx eax, byte [rsi] add rsi, 1 %endmacro

; char* strchr_movb(rdi: const char* s, rsi: int c) strchr_movb: strchr_byte load_byte_movzx quad%macro strchr_quad 1 movzx eax, sil ; eax = sil mul qword [%%one_mask] ; rax *= 0x0101010101010101 mov rsi, rdi ; rsi = s mov rdi, rax ; rdi = rax %%loop: %1 mov rcx, rax find_zero r8, rcx xor rax, rdi find_zero r9, rax bsf rdx, r9 mov ecx, edx shr ecx, 3 test r8, r8 jnz %%end test r9, r9 jz %%loop lea rax, [rsi+rcx-8] ret %%end: test rcx, rcx jz %%notfound bsf rax, r8 cmp eax, edx jl %%notfound lea rax, [rsi+rcx-8] ret %%notfound: xor eax, eax ret %%one_mask: dq 0x0101010101010101 %endmacro

; char* strchr_lodsq(rdi: const char* s, rsi: int c) strchr_lodsq: strchr_quad lodsq

%macro load_quad_mov 0 mov rax, [rsi] add rsi, 8 %endmacro

; char* strchr_movq(rdi: const char* s, rsi: int c) strchr_movq: strchr_quad load_quad_mov avx; size_t strchr_avx(rdi: const char *s, rsi: int c) strchr_avx: vpxor xmm0, xmm0 ; xmm0 = 0

; Broadcast the sil byte to all bytes of xmm1
movzx ecx, sil				; Copy sil in eax with zero-extend
vmovd xmm1, ecx				; Copy the low dword (eax) into xmm1
vpshufb xmm1, xmm1, xmm0	; Broadcast the first byte of xmm1 to all xmm1

vmovdqu xmm2, [rdi]         ; xmm2 = *(rdi)

vpcmpeqb xmm3, xmm1, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
vpmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

vpcmpeqb xmm3, xmm0, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
vpmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jnz .notfound			    ; 	goto .notfound

and rdi, -16		        ; align rdi on 16 bytes

.loop: add rdi, 0x10 ; rax += 16 vmovdqa xmm2, [rdi] ; xmm2 = *(rdi)

vpcmpeqb xmm3, xmm1, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
vpmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

vpcmpeqb xmm3, xmm0, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
vpmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jz .loop			        ; 	goto .loop

.notfound: xor eax, eax ; return NULL ret .found: bsf r9d, r9d ; r8d = index of the first set bit in r8d test r8d, r8d ; if(r8d == 0) jz .noend ; goto .noend bsf r8d, r8d ; r8d = index of the first set bit in r8d cmp r8d, r9d ; if(r8d < r9d) jb .notfound ; goto .notfound .noend: mov rax, rdi ; rax = rdi add rax, r9 ; rax += rdx ret avx2; size_t strchr_avx2(rdi: const char *s, rsi: int c) strchr_avx2: vpxor ymm0, ymm0 ; ymm0 = 0

; Broadcast the sil byte to all bytes of ymm1
movzx ecx, sil				; Copy sil in eax with zero-extend
vmovd xmm1, ecx				; Copy the low dword (eax) into ymm1
vpbroadcastb ymm1, xmm1	    ; Broadcast the first byte of ymm1 to all ymm1

vmovdqu ymm2, [rdi]         ; ymm2 = *(rdi)

vpcmpeqb ymm3, ymm1, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
vpmovmskb r9d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

vpcmpeqb ymm3, ymm0, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
vpmovmskb r8d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jnz .notfound			    ; 	goto .notfound

and rdi, -32		        ; align rdi on 16 bytes

.loop: add rdi, 0x20 ; rax += 16 vmovdqa ymm2, [rdi] ; ymm2 = *(rdi)

vpcmpeqb ymm3, ymm1, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
vpmovmskb r9d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

vpcmpeqb ymm3, ymm0, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
vpmovmskb r8d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jz .loop			        ; 	goto .loop

.notfound: xor eax, eax ; return NULL ret .found: bsf r9d, r9d ; r8d = index of the first set bit in r8d test r8d, r8d ; if(r8d == 0) jz .noend ; goto .noend bsf r8d, r8d ; r8d = index of the first set bit in r8d cmp r8d, r9d ; if(r8d < r9d) jb .notfound ; goto .notfound .noend: mov rax, rdi ; rax = rdi add rax, r9 ; rax += rdx vzeroupper ret sse2; size_t strchr_sse2(rdi: const char *s) strchr_sse2: pxor xmm0, xmm0 ; xmm0 = 0

; Broadcast the sil byte to all bytes of xmm1
movzx ecx, sil				; Copy sil in eax with zero-extend
movd xmm1, ecx				; Copy the low dword (eax) into xmm1
pshufb xmm1, xmm0	    ; Broadcast the first byte of xmm1 to all xmm1

movdqu xmm2, [rdi]         ; xmm2 = *(rdi)

movdqa xmm3, xmm2
pcmpeqb xmm3, xmm1	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
pmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

movdqa xmm3, xmm2
pcmpeqb xmm3, xmm0	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
pmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jnz .notfound			    ; 	goto .notfound

and rdi, -16		        ; align rdi on 16 bytes

.loop: add rdi, 0x10 ; rax += 16 movdqa xmm2, [rdi] ; xmm2 = *(rdi)

movdqa xmm3, xmm2
pcmpeqb xmm3, xmm1      	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
pmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

movdqa xmm3, xmm2
pcmpeqb xmm3, xmm0	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
pmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

test r9d, r9d               ; if(r9d != 0)
jnz .found                  ;    goto .found
test r8d, r8d		        ; if(r8d == 0)
jz .loop			        ; 	goto .loop

.notfound: xor eax, eax ; return NULL ret .found: bsf r9d, r9d ; r8d = index of the first set bit in r8d test r8d, r8d ; if(r8d == 0) jz .noend ; goto .noend bsf r8d, r8d ; r8d = index of the first set bit in r8d cmp r8d, r9d ; if(r8d < r9d) jb .notfound ; goto .notfound .noend: mov rax, rdi ; rax = rdi add rax, r9 ; rax += rdx retNoteEven with -O3, gcc did not replace my strchr implementation with a call to the standard strchr. It even actually generated a pretty bad implementation (maybe this could be fixed by feeding it a more contrived implementation).The implementation is similar to the strlen implementation, especially the 64-bit one. Indeed, we still need to find the end of the string but we also need to find a given byte, and as we know how to find a 0x00 byte in a quadword using the find_zero macro, we just need to xor the quadword with the byte to nullify all the equal bytes in the quadword (a XOR a = 0 is equivalent to a = a) and then find the 0x00 bytes.Here are the results of the benchmark on my computer:Benchmark of strcnt implementationsThe loads instruction is always slower than the mov alternatives, so… just don’t use it.Benchmark 6: iotaI wanted to end this benchmark of the string instructions with another benchmark of the stos instruction. We saw that this instruction is quite fast with a rep prefix, so let’s see how it compares to a mov when used in a loop without a prefix: generic#include <stddef.h>

void iota_dummy(unsigned char *s, size_t n) { for (int i = 0; i < n; i++) s[i] = i; } byte%macro iota_byte 1 xor eax, eax mov rcx, rsi test rcx, rcx jz %%exit %%loop: %1 add rax, 0x1 sub rcx, 0x1 jnz %%loop %%exit: ret %endmacro

; void *iota_stosb(rdi: unsigned char s[.n], rsi: size_t n); iota_stosb: iota_byte stosb

%macro store_movb 0 mov [rdi], al add rdi, 0x1 %endmacro

; void *iota_movb(rdi: unsigned char s[.n], rsi: size_t n); iota_movb: iota_byte store_movb quad%macro iota_quad 1 mov r8, 0x0706050403020100 mov r9, 0x0808080808080808 mov rax, r8 cmp rsi, 8 jb %%end %%loop: %1 add rax, r9 test al, al cmovz rax, r8 sub rsi, 0x8 cmp rsi, 0x8 jae %%loop %%end: ret %endmacro

; void *iota_stosq(rdi: unsigned char s[.n], rsi: size_t n); iota_stosq: iota_quad stosq

; void *iota_movq(rdi: unsigned char s[.n], rsi: size_t n); iota_movq: iota_quad store_movq avx; void *iota_avx(rdi: void s[.n], rsi: size_t n); iota_avx: cmp rsi, 0x10 jb .exit vmovdqa xmm0, [.init_mask] vmovdqa xmm1, [.add_mask] .loop: vmovdqu [rdi], xmm0 ; Copy the 16 bytes of xmm0 to s vpaddb xmm0, xmm0, xmm1 add rdi, 0x10 ; Increment rdi of 16 bytes sub rsi, 0x10 ; rdx-- cmp rsi, 0x10 jae .loop ; jump .exit: ret align 16 .init_mask: db 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF .add_mask: times 16 db 0x10 avx2; void *iota_avx2(rdi: void s[.n], rsi: size_t n); iota_avx2: shr rsi, 5 ; Divide n by 32 jz .exit vmovdqa ymm0, [.init_mask] vmovdqa ymm1, [.add_mask] .loop: vmovdqu [rdi], ymm0 ; Copy the 16 bytes of xmm0 to s vpaddb ymm0, ymm0, ymm1 add rdi, 0x20 ; Increment rdi of 16 bytes sub rsi, 1 ; rdx-- jnz .loop ; jump .exit: ret align 32 .init_mask: db 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F db 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F .add_mask: times 32 db 0x20The benchmark result on my computer:Benchmark of iota implementationsAs we can see, there is no interest in using the stos instruction without a prefix, as it will generally perform worse than an equivalent mov instruction.This corroborates the claim of “Optimizing subroutines in assembly language”:String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to the LOOP instruction and to JECXZ on some processors.ConclusionWe’ve seen a lot in this article. Now, what should you remember from all of this to write better assembly code?Here are some general guidelines you can keep in mind when using string instructions:You should use rep stosq to initialize large blocks of memory.You should use rep movsq to copy large blocks of memory (more than 512 bytes) which fit in your cache (~7Mib), for larger blocks of memory you should use prefetching mechanisms, AVX extensions, unrolled loop and non-temporal-stores.Do not use string instructions without repeat prefixes, as they will generally be slower than their classical alternatives.Do not use scas, cmps, and lods, as you can always come up with more efficient (AVX) versions.String instructions are very portable instructions as they’re present on all x86 processors (back to the 8086). However, most of the time, they tend to perform slower than the alternatives using the largest available registers. Therefore, except for rep stosq and rep movsq, you shouldn’t use them unless you’re optimizing for size.

同じ日のほかのニュース

一覧に戻る →

2026/04/21 5:39

ジョン・テルナス氏、次期アップル CEO に就任

## Japanese Translation: 欠落している要素は、キーポイントリストからの具体的なデータポイントおよび製品の詳細を組み込んだ改良されたバージョンを採用する価値があり、ソース資料の包括的な反映を確保するためには十分な重大さがあります。 **改善されたサマリー:** Apple は、John Ternus が 2026 年 9 月 1 日に CEO に就任し、Tim Cook を後継することを含む大規模なリーダーシップ移行を公式に確認しました。取締役会はこの計画を全会一致で承認し、Cook が 2011 年に CEO に就任してから歴史的な業績を認識しています。彼は社員の市場価値を約 3500 億ドルから 4 兆ドルへと成長させ、収益を 2011 会計年度の 1080 億ドルから 2025 会計年度の 4160 億ドル以上までほぼ四倍に拡大しました。Cook は日常業務から退き、執行議長として務める一方で、夏の間は CEO を辞任せず、円滑な引き継ぎを確保します。彼の指導の下、Apple は Apple Watch、AirPods、Vision Pro という象徴的な新カテゴリーを導入し、自社設計のシリコンに移行し、200 カ国以上および地域に店舗を持つ 500 店以上の小売網を拡大しました。また、チームメンバーが 100,000 名以上増加し、現在稼働している 25 億台超のデバイスベースを支援しています。 John Ternus は、2001 年に製品設計チームの一員として Apple に加入し、2021 年にハードウェアエンジニアリング担当副社長となり、2026 年 9 月 1 日に CEO の役職を引き継ぎ、取締役会に就任します。彼の指導の下、MacBook Neo、iPhone Air、フルモデルの iPhone 17 シリーズ、アクティブノイズキャンセレーション搭載のアバンスド AirPods、Apple Watch Ultra 3 に用いられた再生アルミニウムや 3D プリンティング钛など、重要なハードウェア革新を監督しました。さらに、Arthur Levinson は 2026 年 9 月 1 日に非執行会長からリードインディペンデントダイレクターへ移行します。この戦略的なシフトは、Apple の長期的なビジョンを固めるものであり、ハードウェア革新と持続可能性への深いコミットメントを持つ内部人材へのリーダーシップの引継ぎによって実現され、新鮮でありながら親しみのあるリーダーシップの下での継続性を確保します。

2026/04/21 6:32

「楽しさと利益のためのジュージ・メガマージ」

## 日本語訳: 記事は、JUJUTSU で導入される簡素化されたバージョン管理ワークフロー「megamerge」について紹介しています。これは、オクトパス合併(3 つ以上の親を持つ合併)を用いて、複数の開発ブランチを単一のローカルのベースコミットに統合します。不安定なブランチの先頭に直接作業を行う代わりに、開発者は関連する上流ブランチ(機能追加、バグ修正、設定など)を親とする空の megamerge コミットを作成し、作業コピーが常にすべての変更を統合してコンパイル可能になるように確保するとともに、タスクを変更する際に予期せぬリモート合併競合を排除します。 megamerge を開始するには、`jj new x y z` を実行した後に `jj commit --message "megamerge"` を実行し、指定されたブランチを親とする空のコミットを作成します。すべての書き込みは、このベース(WIP ス tack)の上で実施され、megamerge がローカルに留まることで安定性を保ちます。個々の機能ブランチは遠隔リポジトリへ通常通り公開し続けますが、megamerge 自体はプッシュされません。 `jj absorb` を用いて上流の変更を自動的に統合するワークフローでは、約 90% の更新を後続的可変コミットに圧縮して同定します。新しい作業で独自のコミットが必要になる場合は、bookmark を更新しながら WIP を megamerge の下に移動するために `jj rebase --revision y --after x --before megamerge` を使用します。並列ス tack の管理には revset アリヤス(例: `"closest_merge(to)" = "heads(::to & merges())"`)および `stack` コマンドを用い、`stage = ["stack", "closest_merge(@).. ~ empty()"]` というようにのアリヤスで一度にステージリングし、その後 `jj stage` を実行します。 メインブランチ(`trunk()`)との同期を維持するには `jj rebase --onto trunk()` を使用でき、これは自分が所有するコミットに対して動作し、他者によるブランチは保護されます。Mutable コミットのみを安全に trunk へ rebase するための場合は、`restack = ["rebase", "--onto", "trunk()", "--source", "roots(trunk()..) & mutable()"]` というようなアリヤスを使用します。全体として、このアプローチは合併による面倒を大幅に削減し、新しい作業が堅牢な統合された基盤の上に自然と構築されるような円滑で協力的なサイクルをサポートします。

2026/04/21 4:51

『Soul Player C64 ―1MHz のコモドール64で動作する本物のトランスフォーマー』

## Japanese Translation: Soul Player C64 は、未修正の Commodore 64 でネイティブ速度(約 1MHz)で完全動作する縮小版变压器モデルを実行し、画期的な成果を達成しました。このシステムは、2 レイヤーのdecoder-only アーキテクチャを実装するため、手書きの 6502/6510 アセンブリ言語を使用しており、リアルなマルチヘッド因果的自己注意機構、RMSNorm、および ソフトマックス(128 エントリのルックアップテーブル経由で)を備えており、すべてが 1 つのフロッピーディスクに収まります。主要な技術的突破口としては、6502 プロセッサの精度限界を克服しつつ有意義な重みを保つために、標準の 17 ビットではなく 14 ビットのみで注意スコアをシフトすることなどが挙げられます。 ChatGPT のような現代の巨人と並んでモデルは動作しますが、約 25,000 int8 パラメータという厳格な制約下にあります:単語書式は 128 トークン(大文字を未知として扱う)、埋め込み次元は 32、最大トレーニングコンテキストウィンドウは 20 トークンです。推論にはトークンあたり約 60 秒かかりつつも、レガシーハードウェアでのローカル機械学習の探求へのアクセシブルなパスを提供します。 このプロジェクトには、ユーザーがカスタムモデルをトレーニングするための包括的なツールが含まれています:`train.py` は Quantization-Aware Training (QAT)、FakeQuantI8、およびラベルスムージングをサポートし、重みをコンパクトなバイナリ形式にエクスポートします;`build.py` は C64 バイナリをコンパイルします;`test.py` はほぼ 90 の厳密な検証テストにより安定性を確保します。リリースパッケージにはソースファイルと即座に実行可能なビルドの両方が含まれており、高度な AI コンセプトがハードウェア変更なしでビンテージシステム上で機能することを示しています。