i18n-like text engine script thingy.
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Reth af973f2d0b
initial
2 months ago
LICENSE initial 2 months ago
README.md initial 2 months ago
search.py initial 2 months ago
words.json initial 2 months ago

README.md

s4h (search)

s4h is a word search script, using i18n-like word compression. no extra modules, under 50 lines.

inspired by this chost.

usage

clone the repo OR download this (6.5MB) as words.json, put in the same dir as the script.
then you can run the script using:

python search.py "query"

instead of query put anything you want to decompress/search. accepts only [A-Za-z0-9\s], filters out the rest in the clean() function.

yes, it picks the first word alphabetically, so you're gonna have to do some tricks unless it's a really specific word.

examples

; basic search
$ python search.py a26m
antidisestablishmentarianism

; when a word is not found
$ python search.py thisisnotaword
NOTFOUND

; support for several words
$ python search.py "sev3l word com7ns"
several word combinations

; support for zeroes
$ python search.py v0id2
video

; in case a clean query cannot be made
$ python search.py 0
void

; non-ascii (sorta) characters are ignored
$ python search.py auth,.,.11m
authoritarianism

there are some edge cases (like 0000 returning NOTFOUND instead of void), but don't worry. submit a PR if you have a nice-looking quick fix.

fun fact: the longest word in the dict is "dichlorodiphenyltrichloroethane", and can be derived with a query of 31.

how it works

well, simpler than it looks. at first it uses the argparse module to get the query, then loads it into a variable query. after that it uses a few functions to get to the result.

clean()

  • makes the initial query lowercase, also strips all trailing spaces.
  • removes anything that isn't a-z0-9 or the whitespace.
  • lazily check if it's empty or 0, return "void" in that case.
  • else return the clean query.

transform()

  • splits the query into blocks, based on type (alpha/digit/space).
  • if it's numeric - transform into \w{num}, where num is the number.
  • connects everything back.
  • compresses whitespace blocks into single spaces using a clever hack.
  • returns the regexed query.
  • splits the query into word patterns.
  • loads the word dictionary from a file.
  • makes a pattern to match each word (^word$).
  • gets the matched word list for each pattern.
  • picks the first one alphabetically.
  • if no matches have been made, the word is "NOTFOUND".
  • returns the final expanded result.

notes

i feel like performance can be improved, but it's not like i care. it loads alright for something that goes through 370K words, in just 500-600ms for 1 word. although it slows a bit on large sentences.

try "31 is 1n am4gl1 lo1g wor1 b3use im 1 v5nx who h1t1s j1v1 sc2pt pos2ng 2out pro3ms on 2host vi1 my un1x sys2m" as proof, 2 seconds on my machine. yes, i added "voidlynx" into the repo's dictionary, why wouldn't i.

as said before, do feel free to drop some performance-improving PRs, as long as they don't make the code unreadable.