Use `memchr` in SRE's prefix scanner

# Feature or enhancement

### Proposal:

Python's regex prefix scanning loops advance one byte at a time to find the first character of a literal prefix. For `SIZEOF_SRE_CHAR == 1`, `memchr` is a drop-in replacement that uses SIMD. The idea is to skip say 32 bytes per iteration in the negative case (no match).

The two loops are at 

https://github.com/python/cpython/blob/7ce737ea11919aebf7eef174f910759e74d0ea50/Modules/_sre/sre_lib.h#L1756-L1759

https://github.com/python/cpython/blob/7ce737ea11919aebf7eef174f910759e74d0ea50/Modules/_sre/sre_lib.h#L1789-L1792

Similar issues/prs are gh-145797 (`memchr` in `str.split`) and building on gh-57343, gh-57345.

Proposed patch:

```diff
--- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1753,10 +1753,19 @@
         end = (SRE_CHAR *)state->end;
         state->must_advance = 0;
         while (ptr < end) {
+#if SIZEOF_SRE_CHAR == 1
+            {
+                SRE_CHAR *found = memchr(ptr, c, end - ptr);
+                if (!found)
+                    return 0;
+                ptr = found;
+            }
+#else
             while (*ptr != c) {
                 if (++ptr >= end)
                     return 0;
             }
+#endif
             TRACE(("|%p|%p|SEARCH LITERAL\n", pattern, ptr));
             state->start = ptr;
@@ -1786,10 +1795,19 @@
         while (ptr < end) {
             SRE_CHAR c = (SRE_CHAR) prefix[0];
+#if SIZEOF_SRE_CHAR == 1
+            {
+                SRE_CHAR *found = memchr(ptr, c, end - ptr);
+                if (!found)
+                    return 0;
+                ptr = found + 1;
+            }
+#else
             while (*ptr++ != c) {
                 if (ptr >= end)
                     return 0;
             }
+#endif
             if (ptr >= end)
                 return 0;
```

I ran into this while profiling a log parser (pretty much grepping for error and warning strings) that applies a handful of literal-prefix regexes to each line of a ~5MB log file. With the patch, the runtime is reduced by ~14% end-to-end. A similar use case in the same application is scanning a binary executable/library for strings with a pattern `/common/prefix/(foo|bar|baz)` where I would benefit even more from `memchr` to find the `/` anchors.

Microbenchmark on my M4 macbook (without PGO):

Single-char prefix `re.compile(r"X(A|B)")`, match at position `N` against `aaa...aXA`

| match position | baseline | patched | speedup |
|----------------|----------|---------|---------|
| 0 | 55 ns | 58 ns | ~flat |
| 50 | 70 ns | 53 ns | 1.3x |
| 100 | 83 ns | 55 ns | 1.5x |
| 500 | 217 ns | 62 ns | 3.5x |
| 1,000 | 326 ns | 70 ns | 4.7x |
| 10,000 | 2,379 ns | 219 ns | 10.9x |
| 100,000 | 23,348 ns | 1,576 ns | 14.8x |

Multi-char prefix `re.compile(r"XXXX(A|B)")`, match at position N against `aaaa...aXXXXA`

| match position | baseline | patched | speedup |
|----------------|----------|---------|---------|
| 0 | 57 ns | 56 ns | ~flat |
| 100 | 71 ns | 58 ns | 1.2x |
| 500 | 142 ns | 65 ns | 2.2x |
| 1,000 | 221 ns | 72 ns | 3.1x |
| 10,000 | 1,652 ns | 225 ns | 7.3x |
| 100,000 | 15,952 ns | 1,621 ns | 9.8x |

The reason the second benchmark is not as good as the first is that the apple clang compiler seems to have unrolled the second loop 4x while the first loop is not unrolled at all. Neither of them were auto-vectorized.

<details>
<summary>Synthetic benchmark</summary>

```py
"""Synthetic benchmark for SRE prefix scanning at various string lengths."""

import re
import time

def bench(pat, text, iterations):
    for _ in range(1000):
        pat.search(text)
    t0 = time.perf_counter()
    for _ in range(iterations):
        pat.search(text)
    dt = time.perf_counter() - t0
    ns_per_call = dt / iterations * 1e9
    print(f"  {ns_per_call:8.1f} ns/call")

# Single-char prefix: X(A|B) has prefix "X" (len=1)
pat1 = re.compile(r"X(A|B)")
print("Single-char prefix: X(A|B), match at position N")
for n in [0, 5, 10, 50, 100, 500, 1000, 10000, 100000]:
    text = "a" * n + "XA"
    print(f"  match at pos {n:>6d} (len={len(text):>6d})", end="")
    bench(pat1, text, max(100000, 1000000 // max(n, 1)))

# Multi-char prefix: XXXX(A|B) has prefix "XXXX" (len=4)
pat4 = re.compile(r"XXXX(A|B)")
print()
print("Multi-char prefix: XXXX(A|B), match at position N")
for n in [0, 5, 10, 50, 100, 500, 1000, 10000, 100000]:
    text = "a" * n + "XXXXA"
    print(f"  match at pos {n:>6d} (len={len(text):>6d})", end="")
    bench(pat4, text, max(100000, 1000000 // max(n, 1)))
```

</details>

### Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

### Links to previous discussion of this feature:

_No response_


### Linked PRs
* gh-148733

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `memchr` in SRE's prefix scanner #148729

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

match position	baseline	patched	speedup
0	55 ns	58 ns	~flat
50	70 ns	53 ns	1.3x
100	83 ns	55 ns	1.5x
500	217 ns	62 ns	3.5x
1,000	326 ns	70 ns	4.7x
10,000	2,379 ns	219 ns	10.9x
100,000	23,348 ns	1,576 ns	14.8x

match position	baseline	patched	speedup
0	57 ns	56 ns	~flat
100	71 ns	58 ns	1.2x
500	142 ns	65 ns	2.2x
1,000	221 ns	72 ns	3.1x
10,000	1,652 ns	225 ns	7.3x
100,000	15,952 ns	1,621 ns	9.8x

Uh oh!

Use memchr in SRE's prefix scanner #148729

Description

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Use `memchr` in SRE's prefix scanner #148729