|
2 months ago | |
---|---|---|
LICENSE | 2 months ago | |
README.md | 2 months ago | |
search.py | 2 months ago | |
words.json | 2 months ago |
README.md
s4h (search)
s4h is a word search script, using i18n-like word compression. no extra modules, under 50 lines.
inspired by this chost.
usage
clone the repo OR download this (6.5MB) as words.json
, put in the same dir as the script.
then you can run the script using:
python search.py "query"
instead of query put anything you want to decompress/search. accepts only [A-Za-z0-9\s], filters out the rest in the clean()
function.
yes, it picks the first word alphabetically, so you're gonna have to do some tricks unless it's a really specific word.
examples
; basic search
$ python search.py a26m
antidisestablishmentarianism
; when a word is not found
$ python search.py thisisnotaword
NOTFOUND
; support for several words
$ python search.py "sev3l word com7ns"
several word combinations
; support for zeroes
$ python search.py v0id2
video
; in case a clean query cannot be made
$ python search.py 0
void
; non-ascii (sorta) characters are ignored
$ python search.py auth,.,.11m
authoritarianism
there are some edge cases (like 0000 returning NOTFOUND instead of void), but don't worry. submit a PR if you have a nice-looking quick fix.
fun fact: the longest word in the dict is "dichlorodiphenyltrichloroethane"
, and can be derived with a query of 31
.
how it works
well, simpler than it looks. at first it uses the argparse module to get the query, then loads it into a variable query
. after that it uses a few functions to get to the result.
clean()
- makes the initial query lowercase, also strips all trailing spaces.
- removes anything that isn't
a-z0-9
or the whitespace. - lazily check if it's empty or 0, return
"void"
in that case. - else return the clean query.
transform()
- splits the query into blocks, based on type (alpha/digit/space).
- if it's numeric - transform into
\w{num}
, where num is the number. - connects everything back.
- compresses whitespace blocks into single spaces using a clever hack.
- returns the regexed query.
search()
- splits the query into word patterns.
- loads the word dictionary from a file.
- makes a pattern to match each word (
^word$
). - gets the matched word list for each pattern.
- picks the first one alphabetically.
- if no matches have been made, the word is
"NOTFOUND"
. - returns the final expanded result.
notes
i feel like performance can be improved, but it's not like i care. it loads alright for something that goes through 370K words, in just 500-600ms for 1 word. although it slows a bit on large sentences.
try "31 is 1n am4gl1 lo1g wor1 b3use im 1 v5nx who h1t1s j1v1 sc2pt pos2ng 2out pro3ms on 2host vi1 my un1x sys2m"
as proof, 2 seconds on my machine. yes, i added "voidlynx" into the repo's dictionary, why wouldn't i.
as said before, do feel free to drop some performance-improving PRs, as long as they don't make the code unreadable.