[WIP] improve general speed when wildcard rules are set #41

Manually merged
twann merged 9 commits from wildcards_speed into main 6 months ago
twann commented 6 months ago
Owner

When a lot of filter lists are added, and that some wildcard rules are set, it can take up to hours to update the hosts file (#39). If you have 10 wildcard rules and 10 millions domains blocked, it currently needs to do 10*1,000,000 = 10,000,000 times the same operation (see 9503966eec)!

TODO:

  • Insert all rules without checking and run a SQL query afterwards to remove wildcard matches
  • Use directly sqlite to match wildcards instead of re.match()
  • Prevent the user from subscribing to filter lists that are already included in other subscribed filter lists.
  • When updating the hosts file, do not directly write on disk, as it is terribly slow.

EDIT: the solution for updating hosts seems to be to alternate both. After 10,000 domains are stored in memory, TBlock writes them in the hosts file, and starts again with the next 10,000 rules. That way, it is faster, and it doesn't take to much memory.

When a lot of filter lists are added, and that some wildcard rules are set, it can take up to hours to update the hosts file (#39). If you have 10 wildcard rules and 10 millions domains blocked, it currently needs to do 10\*1,000,000 = 10,000,000 times the same operation (see https://codeberg.org/tblock/tblock/commit/9503966eec82725069c23c2ed5d5b6e95d70fe9b)! **TODO:** - [x] Insert all rules without checking and run a SQL query afterwards to remove wildcard matches - [x] Use directly sqlite to match wildcards instead of `re.match()` - [ ] Prevent the user from subscribing to filter lists that are already included in other _subscribed_ filter lists. - [x] When updating the hosts file, do not directly write on disk, as it is terribly slow. **EDIT:** the solution for updating hosts seems to be to alternate both. After 10,000 domains are stored in memory, TBlock writes them in the hosts file, and starts again with the next 10,000 rules. That way, it is faster, and it doesn't take to much memory.
twann added this to the (deleted) milestone 6 months ago
twann self-assigned this 6 months ago
twann added 1 commit 6 months ago
twann added 2 commits 6 months ago
Poster
Owner

In case anyone wants to test, the main issue should be solved by now. However, since it has not passed any tests for now, I recommend you test it on a VM and not your personal device.

In case anyone wants to test, the main issue should be solved by now. However, since it has not passed any tests for now, I recommend you test it on a VM and not your personal device.
twann added this to the (deleted) milestone 6 months ago

So with this pull request:

==> Updating database from v2.1.0 to v2.1.1
==> Creating new database

Took only about five minutes. Which is fine I guess.

However when using -a...:

:: You are about to allow the following domains:

  go.duolingo.com

:: Are you sure to continue ? [y/n] y
==> Adding rules
 [✓] Adding rule for domain: go.duolingo.com
 [✓] Checking allowed domains: 478/478
==> Updating hosts file
 [✓] Retrieving rules: 27211129/27211129 (100%)
 [✓] Writing new hosts file

This took around 55 minutes total (between 45 and 50 minutes for the "Checking allowed domains" part). It is faster that way than it was before but I can imagine that I might have a few thousand domains allowed in a few years...

Oh also when I use -U for some reason it restarts "Checking allowed domains" after I reach the 478 domains?

 [✓] Fetching oisd-extra (8.0 KB): 100%
 [✓] Fetching oisd-full (6.2 MB): 100%
 [✓] Fetching oisd-nsfw (2.5 MB): 100%
==> Updating filter list: 1hosts-lite
 [✓] Cleaning rules cache: 2
 [i] Filter list syntax is: list
 [✓] Checking allowed domains: 478/4780950): 0%
 [✓] Checking allowed domains: 478/4780950): 0%
 [|] Checking allowed domains: 249/4780950): 0%
 [✓] Checking allowed domains: 478/478
 [✓] Checking allowed domains: 478/4780950): 0%
 [-] Checking allowed domains: 143/4780950): 0%
 [|] Checking allowed domains: 204/478
 [-] Checking allowed domains: 221/478

(The boxes without the ✓ are there cause I pressed the enter key)
It's been three hours since I started that process and I feel like it won't finish.

Edit: 1hosts-lite has 100950 entries and not 4780950 so well yeah... At least RIP output.

So with this pull request: ``` ==> Updating database from v2.1.0 to v2.1.1 ==> Creating new database ``` Took only about five minutes. Which is fine I guess. However when using `-a`...: ``` :: You are about to allow the following domains: go.duolingo.com :: Are you sure to continue ? [y/n] y ==> Adding rules [✓] Adding rule for domain: go.duolingo.com [✓] Checking allowed domains: 478/478 ==> Updating hosts file [✓] Retrieving rules: 27211129/27211129 (100%) [✓] Writing new hosts file ``` This took around 55 minutes total (between 45 and 50 minutes for the "Checking allowed domains" part). It is faster that way than it was before but I can imagine that I might have a few thousand domains allowed in a few years... Oh also when I use `-U` for some reason it restarts "Checking allowed domains" after I reach the 478 domains? ``` [✓] Fetching oisd-extra (8.0 KB): 100% [✓] Fetching oisd-full (6.2 MB): 100% [✓] Fetching oisd-nsfw (2.5 MB): 100% ==> Updating filter list: 1hosts-lite [✓] Cleaning rules cache: 2 [i] Filter list syntax is: list [✓] Checking allowed domains: 478/4780950): 0% [✓] Checking allowed domains: 478/4780950): 0% [|] Checking allowed domains: 249/4780950): 0% [✓] Checking allowed domains: 478/478 [✓] Checking allowed domains: 478/4780950): 0% [-] Checking allowed domains: 143/4780950): 0% [|] Checking allowed domains: 204/478 [-] Checking allowed domains: 221/478 ``` (The boxes without the ✓ are there cause I pressed the enter key) It's been three hours since I started that process and I feel like it won't finish. **Edit:** 1hosts-lite has 100950 entries and not 4780950 so well yeah... At least RIP output.
twann added 1 commit 6 months ago
twann added 1 commit 6 months ago
Poster
Owner

@schrmh Yeah thanks for the feedback. It should be fixed by 1963bf1297 and 1a1081d419.
This PR needs some more work before releasing a stable version.

@schrmh Yeah thanks for the feedback. It should be fixed by 1963bf1297 and 1a1081d419. This PR needs some more work before releasing a stable version.
twann added 1 commit 6 months ago
twann added 2 commits 6 months ago
Poster
Owner

@schrmh I just tried, and with 1,434,830 domains and 336 allowed domains, the Checking allowed domains process takes less than 40 seconds.
However, it is clear that, the more rules you have, the slower the operation will be.

EDIT: I just checked #39 again, and it seems like you are using the decloudflare blocklist. This list is known for REALLY slowing the program, that's why it is not recommended for daily use (source: https://tblock.codeberg.page/repository/?filter=decloudflare)*. If you still want to subscribe to it, then you should expect TBlock to be slow, even with the latest changes that will be brought by this PR.


* JS is required for this link to work, but you can also check by running tblock -I decloudflare

@schrmh I just tried, and with 1,434,830 domains and 336 allowed domains, the `Checking allowed domains` process takes less than 40 seconds. However, it is clear that, the more rules you have, the slower the operation will be. **EDIT:** I just checked #39 again, and it seems like you are using the `decloudflare` blocklist. This list is known for REALLY slowing the program, that's why it is `not recommended for daily use` (source: https://tblock.codeberg.page/repository/?filter=decloudflare)\*. If you still want to subscribe to it, then you should expect TBlock to be slow, even with the latest changes that will be brought by this PR. --- \* JS is required for this link to work, but you can also check by running `tblock -I decloudflare`
twann added 1 commit 6 months ago
twann added this to the 2.2.0 milestone 6 months ago
twann merged commit cea452d0ea into main manually 6 months ago
twann deleted branch wildcards_speed 5 months ago
The pull request has been manually merged as cea452d0ea.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: tblock/tblock#41
Loading…
There is no content yet.