You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
4 years ago | |
---|---|---|
.. | ||
README.md | 4 years ago | |
anchors1_linebreaks.pl | 7 years ago | |
anchors2_extract.pl | 7 years ago | |
anchors3_replace.pl | 7 years ago | |
anchors5_count.pl | 7 years ago | |
anchors6_clean.pl | 4 years ago | |
anchors7_adultfilter.pl | 7 years ago | |
anchors8_add.pl | 7 years ago |
README.md
Collecting Data for Query Suggestions
"a poor person's approach"
1. Crawl the web
Below, replace SITE.TLD
(twice) by the site you will crawl. Replace
NAME
with your name to announce your crawler properly. Only if you are
evil, add -e robots=off
. Kill the process once you have enough pages.
wget --timeout=9 --wait=2 --random-wait --level=inf --html-extension \
--recursive --span-hosts --domains=SITE.TLD --no-clobber --tries=2 \
--user-agent='NAME' --html-extension --restrict-file-names=windows \
--reject=jpg,js,css,png,gif,doc,docx,jpeg,pdf,mp3,avi,mpeg,txt,ico \
--no-verbose --no-check-certificate \
http://SITE.TLD
2. Get anchor text
find . -name "*.htm*" -type f -exec cat \{\} \; \
| ./anchors1_linebreaks.pl | ./anchors2_extract.pl \
| ./anchors3_replace.pl | sort -f >anchors.txt
3. Score texts (count and normalize score by length)
cat anchors.txt | ./anchors5_count.pl | ./anchors6_clean.pl \
| ./anchors7_adultfilter.pl | ./anchors8_add.pl | sort -r -n \
>anchors_count.txt
4. Test locally
grep -i -P '\ts' anchors_count.txt | more
5. Run the suggestions engine
java -jar target/searsiasuggest.jar -f anchors_count.txt