Feature or enhancement
Proposal:
Python's regex prefix scanning loops advance one byte at a time to find the first character of a literal prefix. For SIZEOF_SRE_CHAR == 1, memchr is a drop-in replacement that uses SIMD. The idea is to skip say 32 bytes per iteration in the negative case (no match).
The two loops are at
|
while (*ptr != c) { |
|
if (++ptr >= end) |
|
return 0; |
|
} |
|
while (*ptr++ != c) { |
|
if (ptr >= end) |
|
return 0; |
|
} |
Similar issues/prs are gh-145797 (memchr in str.split) and building on gh-57343, gh-57345.
Proposed patch:
--- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1753,10 +1753,19 @@
end = (SRE_CHAR *)state->end;
state->must_advance = 0;
while (ptr < end) {
+#if SIZEOF_SRE_CHAR == 1
+ {
+ SRE_CHAR *found = memchr(ptr, c, end - ptr);
+ if (!found)
+ return 0;
+ ptr = found;
+ }
+#else
while (*ptr != c) {
if (++ptr >= end)
return 0;
}
+#endif
TRACE(("|%p|%p|SEARCH LITERAL\n", pattern, ptr));
state->start = ptr;
@@ -1786,10 +1795,19 @@
while (ptr < end) {
SRE_CHAR c = (SRE_CHAR) prefix[0];
+#if SIZEOF_SRE_CHAR == 1
+ {
+ SRE_CHAR *found = memchr(ptr, c, end - ptr);
+ if (!found)
+ return 0;
+ ptr = found + 1;
+ }
+#else
while (*ptr++ != c) {
if (ptr >= end)
return 0;
}
+#endif
if (ptr >= end)
return 0;
I ran into this while profiling a log parser (pretty much grepping for error and warning strings) that applies a handful of literal-prefix regexes to each line of a ~5MB log file. With the patch, the runtime is reduced by ~14% end-to-end. A similar use case in the same application is scanning a binary executable/library for strings with a pattern /common/prefix/(foo|bar|baz) where I would benefit even more from memchr to find the / anchors.
Microbenchmark on my M4 macbook (without PGO):
Single-char prefix re.compile(r"X(A|B)"), match at position N against aaa...aXA
| match position |
baseline |
patched |
speedup |
| 0 |
55 ns |
58 ns |
~flat |
| 50 |
70 ns |
53 ns |
1.3x |
| 100 |
83 ns |
55 ns |
1.5x |
| 500 |
217 ns |
62 ns |
3.5x |
| 1,000 |
326 ns |
70 ns |
4.7x |
| 10,000 |
2,379 ns |
219 ns |
10.9x |
| 100,000 |
23,348 ns |
1,576 ns |
14.8x |
Multi-char prefix re.compile(r"XXXX(A|B)"), match at position N against aaaa...aXXXXA
| match position |
baseline |
patched |
speedup |
| 0 |
57 ns |
56 ns |
~flat |
| 100 |
71 ns |
58 ns |
1.2x |
| 500 |
142 ns |
65 ns |
2.2x |
| 1,000 |
221 ns |
72 ns |
3.1x |
| 10,000 |
1,652 ns |
225 ns |
7.3x |
| 100,000 |
15,952 ns |
1,621 ns |
9.8x |
The reason the second benchmark is not as good as the first is that the apple clang compiler seems to have unrolled the second loop 4x while the first loop is not unrolled at all. Neither of them were auto-vectorized.
Synthetic benchmark
"""Synthetic benchmark for SRE prefix scanning at various string lengths."""
import re
import time
def bench(pat, text, iterations):
for _ in range(1000):
pat.search(text)
t0 = time.perf_counter()
for _ in range(iterations):
pat.search(text)
dt = time.perf_counter() - t0
ns_per_call = dt / iterations * 1e9
print(f" {ns_per_call:8.1f} ns/call")
# Single-char prefix: X(A|B) has prefix "X" (len=1)
pat1 = re.compile(r"X(A|B)")
print("Single-char prefix: X(A|B), match at position N")
for n in [0, 5, 10, 50, 100, 500, 1000, 10000, 100000]:
text = "a" * n + "XA"
print(f" match at pos {n:>6d} (len={len(text):>6d})", end="")
bench(pat1, text, max(100000, 1000000 // max(n, 1)))
# Multi-char prefix: XXXX(A|B) has prefix "XXXX" (len=4)
pat4 = re.compile(r"XXXX(A|B)")
print()
print("Multi-char prefix: XXXX(A|B), match at position N")
for n in [0, 5, 10, 50, 100, 500, 1000, 10000, 100000]:
text = "a" * n + "XXXXA"
print(f" match at pos {n:>6d} (len={len(text):>6d})", end="")
bench(pat4, text, max(100000, 1000000 // max(n, 1)))
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
Feature or enhancement
Proposal:
Python's regex prefix scanning loops advance one byte at a time to find the first character of a literal prefix. For
SIZEOF_SRE_CHAR == 1,memchris a drop-in replacement that uses SIMD. The idea is to skip say 32 bytes per iteration in the negative case (no match).The two loops are at
cpython/Modules/_sre/sre_lib.h
Lines 1756 to 1759 in 7ce737e
cpython/Modules/_sre/sre_lib.h
Lines 1789 to 1792 in 7ce737e
Similar issues/prs are gh-145797 (
memchrinstr.split) and building on gh-57343, gh-57345.Proposed patch:
I ran into this while profiling a log parser (pretty much grepping for error and warning strings) that applies a handful of literal-prefix regexes to each line of a ~5MB log file. With the patch, the runtime is reduced by ~14% end-to-end. A similar use case in the same application is scanning a binary executable/library for strings with a pattern
/common/prefix/(foo|bar|baz)where I would benefit even more frommemchrto find the/anchors.Microbenchmark on my M4 macbook (without PGO):
Single-char prefix
re.compile(r"X(A|B)"), match at positionNagainstaaa...aXAMulti-char prefix
re.compile(r"XXXX(A|B)"), match at position N againstaaaa...aXXXXAThe reason the second benchmark is not as good as the first is that the apple clang compiler seems to have unrolled the second loop 4x while the first loop is not unrolled at all. Neither of them were auto-vectorized.
Synthetic benchmark
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs