h a l f b a k e r yRIFHMAO (Rolling in flour, halfbaking my ass off)
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
Sometimes I want to grep a command line output with maybe
five lines of input, and sometimes I want to grep gigabytes
of data against a file full of patterns. Proposed is a "--just-
in-time-compiler" command line option in grep where certain
costs are paid up front in order to optimize throughput
of
larger data sets. Without the flag, the existing version runs
to avoid compiling costs.
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
The compiling overhead isn't that high ... use shell script, shirley ? |
|
|
// Already done in existing grep // Are you sure about that? No doubt it parses the text search string into some intermediate binary form that is more efficient as the grep code loops through the files, but I seriously doubt that grep writes native assembly code optimized for the regular expression you just entered or writes C++ code and calls the C++ compiler. |
|
|
When I searched for regular expression compilers I found lots of references that turn out to be compiling to an intermediate form. My quick search found one academic paper about an actual compiler for regular expressions, but it wasn't clear if there had ever been a usable implementation. I suspect alot of people have started such projects but I don't expect you would be that highly compensated if you managed to finish a useful version. |
|
|
It's hard to guess how much optimization would actually be achievable. For a simple search criteria, say seaching for [aeiou][rstln], uncompiled code would might have a loop to compares the current data from the file to each of the 5 characters it is looking for. Compiled code for this purpose would probably be written to be space innefficient, so the loop would be unrolled into 5 comparisons to immediate values. I could easily see code execution time for the comparison cut by a factor of 2 to 4. With a really smart compiler you might be able to take advantage of multi-core processors to optimize it even more, and if it's aware of the cache size, if the unrolled code wouldn't fit in cache, it might compile an actual loop rather than an unrolled loop to avoid cache misses. |
|
|
Of course for most search terms that won't make a difference because the limiting factor is the hard drive, so the CPU does all the procesing while waiting for the drive. If you do have a whole file full of patterns you're searching against, maybe you could see some benefit. |
|
| |