
"If computers become too powerful, we can organize them into committees. That will finish them off" (c) unknown author
Introduction
Most developers have had to deal with regular expressions one way or another. My first acquaintance was with the implementation of regexp in STL
std::regexp
. Most often, regular expressions are used in checking input data, something like checking the correctness of the user-entered URL, IPv4 address, IPv6 address, telephone number, and at the same time, the speed of the regex operation does not greatly affect the response time of the application. But what if you have to test hundreds, thousands, or even tens of thousands of rules, all against constantly changing sets of input data in real time? In this situation, you don't just need a fast algorithm, you need the best one, you need a champion!
Selecting competition participants
The selection criteria for participants are quite simple:
availability of source codes
compatibility according to the rules themselves, at least at a basic level
code written in C/C++
Yes, I’m a C++ fan, so I don’t consider regexp in languages other than C/C++, sorry
So,
googling
Using Yandex we find the following applicants:
Assigned name
Language
A link to the repository
stdlib
C++
-
tiny-regex-c
C
https://github.com/kokke/tiny-regex-c
ximtech
C
https://github.com/ximtech/Regex
boost
C++
https://github.com/boostorg/regex.git
hyperscan
С
https://github.com/intel/hyperscan
google-re2
C++
Methodology
Now we need to decide how we will check the participants.
There is a pre-compiled list of regular expressions that is the same for all participants.
There is a pre-compiled list of data sets that is the same for all participants.
To align the time measurements, we will iteratively check each (of
NR
) rule
IR
once per
ID
iterations of data sets (from
ND
), i.e. the number of checks for each participant will be:
SUM = IR * NR * ID * ND
To reduce the mutual influence of participants on each other, we will collect separate applications for each participant.
To maintain the relevance of the project, the source code of the participant is cloned from his repository, from the master branch or the tolerant main branch.
If the participant is a library, then its assembly is carried out.
The actual code that performs the check consists of a class with two abstract methods - one for preparing rules (including their compilation) and one for directly checking the compiled rules on the data set.
We will run the races on a virtual machine in the cloud
Great race
The code is provided in the repository https://github.com/shvit/regexp_checker
For those who want to run the race on their own computer, you will need to install dependencies. For Ubuntu 20.04 (also suitable for
ubuntu-latest
):
sudo apt install build-essential cmake ragel git libboost-all-dev
git clone https://github.com/abseil/abseil-cpp abseil-cpp && cd abseil-cpp && mkdir -p bin && cd bin && cmake -DCMAKE_CXX_STANDARD=17 -DCMAKE_POSITION_INDEPENDENT_CODE=ON .. && cmake --build . && sudo cmake --install .
Now we clone the race onto our computer and run it
git clone https://github.com/shvit/regexp_checker
cd regexp_checker
make check
As a check, instead of unit tests, all participants produced the same number of matches - according to
2'800'000
.
Checks were carried out with the following parameters:
NR=10, IR=20'000, ND=5, ID=10
. Total number of checks per participant
SUM=10'000'000
.
Conclusion
With the current composition of participants, the undisputed champion is the project
hyperscan
, but also pleased with the participant who took second place -
boost
.
It should also be noted that projects
tiny-regex-c
And
ximtech
support a reduced set of regexp elements, which is why the list of rules had to be significantly reduced. Project
ximtech
is a development of the project
tiny-regex-c
. Also in the project
tiny-regex-c
there are bugs and also it does not allow compiling more than one rule due to its architecture.
To use the project
google-re2
Libraries will need to be installed
Abseil
and also a non-trivial linking of the finished project will be required.
In conclusion, I would like to note that the project has
hyperscan
the ability to apply rules on streams, which in some cases can be decisive.
If anyone knows other potential participants in the race, please suggest in the comments.