|
||
---|---|---|
.gitignore | ||
README.rst | ||
c7y.css | ||
c7yops |
README.rst
The Crapicity Machine
The javascriptification of the Web means – among other, perhaps worse problems – that in today's commerical internet, many pages (as in: their HTML source) are entirely empty, except perhaps for some lame order to switch on Javascript if you're lucky. That's sad, if only because things like the Parallelweb (that could, for instance, translate the entire Web to Swebian) will be very boring on such pages.
There's a simple measure for how much pages suck by such a stanard: The ratio of extractable text to the total size of the document. That's what I call crapicity (or c7y for short; and yes, you can of course construct examples where that'll be grossly misled). Using BeautifulSoup, it's easily computed:
def compute_crapicity(doc):
"""returns the crapicity of html in doc.
doc really should be a str -- but if len() and BeautifulSoup() return
something sensible with it, you can get away with something else, too.
"""
parsed = BeautifulSoup(doc, "html.parser")
content_length = max(len(parsed.text), 1)
return len(doc)/content_length
The program distributed here is built around this function; there's a command line interface, a SQLite-based result storage, and a web interface.
Installation
c7yops is a one-module program – you can run it in place or put it somewhere into your path:
$ sudo apt install git python3-bs4 python3-requests # your your distro's equivalent
$ git clone https://codeberg.org/AnselmF/crapicity.git
$ cd crapicity
$ chmod +x c7yops
$ ./c7yops compute <some html file>
# perhaps
$ sudo cp c7yops /usr/local/bin
Usage
c7yops has a few subcommands; in the synopses below, {x} means zero or more of x, [x] means zero or one of x, and a|b means a or b.
compute {filename|URL}
computes the crapicity for files or remote resources specified by local file names or URLs you give in the command line and prints them. Note that these results are not recorded in the database.
update {URL}
computes the crapicity of documents at URLs and records these results in the database.
dump [pattern]
dumps URLs and their crapicities from the database. Optionally give a pattern for SQL LIKE (i.e., %
is zero or more characters) to only dump matching entries.
serve port prefix
run a webserver at localhost:port with the root page exposed at prefix that lets people rate web pages and review previous results. See below.
Example
$ curl -o tmp.html https://codeberg.org/AnselmF/crapicity
$ c7yops compute tmp.html https://codeberg.org/AnselmF/crapicity
tmp.html: 18.32
https://codeberg.org/AnselmF/crapicity: 18.32
$ c7yops update https://gnu.org https://microsoft.com
https://gnu.org: 4.29
https://microsoft.com: 20.69
$ c7yops dump
https://gnu.org: 4.29
https://microsoft.com: 20.69
$ c7yops update http://blog.tfiu.de
http://blog.tfiu.de: 3.38
$ c7yops dump
http://blog.tfiu.de: 3.38
https://gnu.org: 4.29
https://microsoft.com: 20.69
$ c7yops serve 1070 "" https://blog.tfiu.de/media/2023/c7y.css
# now point your browser to http://localhost:1070
Running a crapicity Server
c7yops has a built-in web server; in principle, you could make it bind to ""
(see the argument to ThreadingHTTPServer
in do_serve
) and then expose it to the net; I'd be reasonably relaxed doing that.
However, the way it's really supposed to run is somewhere within a larger site; for instance, in some nginx site you would say:
location /c7y {
proxy_pass http://localhost:8099;
}
and then run:
c7yops serve 8099 /c7y ""
After that, http://<site>/c7y
should show the (unstyled) interface.
There is some CSS coming with c7y that's really intended to be customised and served externally. Copy c7y.css to some directory that is statically served from the hosting server and pass the (URL) path to that to your c7yops serve call, for instance:
c7yops serve 8099 /c7y /media/2023/c7y.css
To make this a little less haphazard, kill the c7ops server again and create a new user to run the server; the fewer privileges the better, but it needs to be able to write to its home. If you're liberal, do:
$ sudo adduser --disabled-login c7yscorer
Become that user and clone the crapicity repo into its home:
$ sudo -u c7yscorer bash
$ cd
$ git clone https://codeberg.org/AnselmF/crapicity.git
If you have already collected a few scores as yourself, you can copy the existing database with something like:
$ mkdir -p .local/share/c7y
$ cp <your private .local/share/c7y/scores.db> .local/share/c7y/scores.db
(where it'd be better if the second command were a bit complicated because your new user cannot look at other home's contents).
Now create a startup script. With sysvinit, something like the following should work (adapt SCRIPT_ARGS and SCRIPT_LOCATION as necessary for your machine):
#!/bin/sh
### BEGIN INIT INFO
# Provides: crapicity
# Required-Start: $local_fs $network
# Required-Stop: $local_fs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Crapicity server
# Description: A web server for scoring and publishing crapicity scores
### END INIT INFO
SCRIPT_LOCATION=/home/c7yscorer/crapicity/c7yops
SCRIPT_ARGS="serve 8099 /c7y /media/2023/c7y.css"
UID=c7yscorer
DAEMONOPTS="--pidfile /run/c7yops.pid --oknodo --exec /usr/bin/python3"
do_start() {
start-stop-daemon --start $DAEMONOPTS \
--make-pidfile --background --chuid $UID \
--output /var/log/c7yserver.log \
-- $SCRIPT_LOCATION $SCRIPT_ARGS
}
do_stop() {
start-stop-daemon --stop $DAEMONOPTS \
--remove-pidfile \
-- $SCRIPT_LOCATION
}
case "$1" in
start)
do_start
;;
stop)
do_stop
;;
restart)
do_stop
do_start
;;
*)
echo "USAGE: $0 start|stop|restart"
;;
esac
exit 0
A systemd unit file should be correspondingly simpler. After (assuming a Debian box) sudo update-rc.d crapicity defaults
and sudo service crapicity start
you should be good to go.