[BUG] tblock -a becomes incredibly slower the more filters are enabled #39

Closed
opened 5 months ago by schrmh · 16 comments

Before opening:

  • This issue is not duplicated

Platform and software version:

  • I am running TBlock version: v2.1.0
  • My local version of the filter list repository is: v220505
  • Check the platform you are running:
    • GNU/Linux, Linux
    • Android (using Termux)
    • macOS
    • Microsoft Windows
    • *BSD
    • Other: please specify here
  • Check the installation method you used to install TBlock:
    • Python Index Package (pip)
    • AUR
    • Fedora COPR
    • Ubuntu PPA
    • Homebrew
    • Scoop
    • Windows installer (EXE file)
    • Makefile (using the latest release)
    • Makefile (using the main branch)
    • Other: please specify here

Filter lists:

1hosts-lite
1hosts-pro
1hosts-mini
1hosts-xtra
abpvn
ad-domains-filter-list
adaway
add.2o7net
add.dead
add.risk
add.spam
adguard-cname
adguard-dns
adguard-tracking-servers
barbblock
badd-boyz-hosts
blackbook
ddgtrackerradar
easylist
easyprivacy
energized-spark
energized-blugo
energized-blu
energized-basic
energized-porn
energized-ultimate
energized-unified
exodus-privacy
fanboy-annoyance
google-adservice-hosts
hblock
mpvs-hosts
nocoin
notracking
peter-lowe
pornhosts
shady-hosts
stevenblack-hosts
stevenblack-hosts-full
tblock-base
twann-anti-ip-loggers
twann-list
stevenblack-hosts-fakenews
stevenblack-hosts-gambling
stevenblack-hosts-porn
stevenblack-hosts-social
tblock-experimental
tblock-security
twann-anti-conspiracist
blocklistproject-abuse
blocklistproject-ads
blocklistproject-crypto
blocklistproject-drugs
blocklistproject-everything
blocklistproject-facebook
blocklistproject-fraud
blocklistproject-gambling
blocklistproject-malware
blocklistproject-phishing
blocklistproject-piracy
blocklistproject-porn
blocklistproject-ramsomware
blocklistproject-redirect
blocklistproject-scam
blocklistproject-tiktok
blocklistproject-torrent
blocklistproject-tracking
decloudflare
divested
gameindustry-android-full
gameindustry-android-mini
gameindustry-gaming-full
gameindustry-gaming-mini
goodbyeads
lightswitch05-ads-tracking
narsil-hosts-google
narsil-hosts-mozilla
narsil-hosts-microsoft
narsil-hosts-spymicrosoft
oisd-basic
oisd-extra
oisd-full
oisd-nsfw

Describe the issue:

Now with all those filters enabled (over 27 million domains blocked) it takes hours until a single domain gets allowed again when I use tblock -a (with around 50 filters that blocked around 1.2 million domains it took a few minutes).


Steps to reproduce:

  • This issue is reproducible: y
  1. Enable a bunch of filters
  2. Use tblock -a

More details:

Often after using tblock -a I find a few more domains I want to allow (especially when some website is still not working) thus it's especially annoying that the retrieving rules process takes that much time each time.
Also does it really need to retrieve rules each time?

**Before opening:** <!-- Please check this before opening a new issue --> - [X] This issue is not duplicated --- **Platform and software version:** - I am running TBlock version: `v2.1.0` - My local version of the filter list repository is: `v220505` - Check the platform you are running: - [X] GNU/Linux, Linux - [ ] Android (using Termux) - [ ] macOS - [ ] Microsoft Windows - [ ] *BSD - [ ] Other: `please specify here` - Check the installation method you used to install TBlock: - [ ] Python Index Package (pip) - [X] AUR - [ ] Fedora COPR - [ ] Ubuntu PPA - [ ] Homebrew - [ ] Scoop - [ ] Windows installer (EXE file) - [ ] Makefile (using the latest release) - [ ] Makefile (using the `main` branch) - [ ] Other: `please specify here` --- **Filter lists:** <!-- Please enter the filter lists you subscribe to (list them with "tblock -Lk") --> ``` 1hosts-lite 1hosts-pro 1hosts-mini 1hosts-xtra abpvn ad-domains-filter-list adaway add.2o7net add.dead add.risk add.spam adguard-cname adguard-dns adguard-tracking-servers barbblock badd-boyz-hosts blackbook ddgtrackerradar easylist easyprivacy energized-spark energized-blugo energized-blu energized-basic energized-porn energized-ultimate energized-unified exodus-privacy fanboy-annoyance google-adservice-hosts hblock mpvs-hosts nocoin notracking peter-lowe pornhosts shady-hosts stevenblack-hosts stevenblack-hosts-full tblock-base twann-anti-ip-loggers twann-list stevenblack-hosts-fakenews stevenblack-hosts-gambling stevenblack-hosts-porn stevenblack-hosts-social tblock-experimental tblock-security twann-anti-conspiracist blocklistproject-abuse blocklistproject-ads blocklistproject-crypto blocklistproject-drugs blocklistproject-everything blocklistproject-facebook blocklistproject-fraud blocklistproject-gambling blocklistproject-malware blocklistproject-phishing blocklistproject-piracy blocklistproject-porn blocklistproject-ramsomware blocklistproject-redirect blocklistproject-scam blocklistproject-tiktok blocklistproject-torrent blocklistproject-tracking decloudflare divested gameindustry-android-full gameindustry-android-mini gameindustry-gaming-full gameindustry-gaming-mini goodbyeads lightswitch05-ads-tracking narsil-hosts-google narsil-hosts-mozilla narsil-hosts-microsoft narsil-hosts-spymicrosoft oisd-basic oisd-extra oisd-full oisd-nsfw ``` **Describe the issue:** <!-- Insert a clear and concise description below --> Now with all those filters enabled (over 27 million domains blocked) it takes hours until a single domain gets allowed again when I use `tblock -a` (with around 50 filters that blocked around 1.2 million domains it took a few minutes). --- **Steps to reproduce:** - This issue is reproducible: `y` <!-- If it is reproducible, enter the steps to reproduce it below --> 1. Enable a bunch of filters 2. Use tblock -a **More details:** <!-- If you want, you can specify some other details here. Otherwise, delete this section. --> Often after using tblock -a I find a few more domains I want to allow (especially when some website is still not working) thus it's especially annoying that the retrieving rules process takes that much time each time. Also does it really need to retrieve rules each time?
schrmh changed title from [BUG] tblock -a gets incredibly slow the more filters are enabled to [BUG] tblock -a becomes incredibly slower the more filters are enabled 5 months ago
Owner

Yeah the way rules work definitely needs improvement. I'll work on it ASAP. Thanks for reporting :)

Yeah the way rules work definitely needs improvement. I'll work on it ASAP. Thanks for reporting :)
twann self-assigned this 5 months ago
twann added this to the (deleted) milestone 5 months ago
Owner

Yeah, I'm checking it right now, but this might be a lot of work. Just wanted to check, what is the slowest process? Adding the -a rule or updating the hosts file after that? (or both?)

Yeah, I'm checking it right now, but this might be a lot of work. Just wanted to check, what is the slowest process? Adding the `-a` rule or updating the hosts file after that? (or both?)
Owner

Also does it really need to retrieve rules each time?

If the issue comes from the size of the database and the number of rules, then maybe creating another database (for user rules only) would be an idea. Or at least a new table in the current database.

However, if the issue is that it takes too much time to update the hosts file after allowing a domain, the only thing I can do is provide a way to set a rule without automatically updating the hosts file after.

> Also does it really need to retrieve rules each time? If the issue comes from the size of the database and the number of rules, then maybe creating another database (for user rules only) would be an idea. Or at least a new table in the current database. However, if the issue is that it takes too much time to update the hosts file after allowing a domain, the only thing I can do is provide a way to set a rule without automatically updating the hosts file after.
twann added this to the (deleted) project 5 months ago
Poster

Just wanted to check, what is the slowest process? Adding the -a rule or updating the hosts file after that? (or both?)

Updating hosts file part where "Retrieving rules:" is visible.

> Just wanted to check, what is the slowest process? Adding the `-a` rule or updating the hosts file after that? (or both?) Updating hosts file part where "Retrieving rules:" is visible.
Poster

However, if the issue is that it takes too much time to update the hosts file after allowing a domain, the only thing I can do is provide a way to set a rule without automatically updating the hosts file after.

Does this then means that new rules (for allowed domains) doesn't take effect unless the file is manually updated?

Hm... What is limiting the speed of the retrieving rules process?
I can't really suggest something since I haven't worked with sqlite databases but I see that the process barely takes resources on my system and I would like to have an option to trade resources (including keeping things in RAM for a while) for speed.

> However, if the issue is that it takes too much time to update the hosts file after allowing a domain, the only thing I can do is provide a way to set a rule without automatically updating the hosts file after. Does this then means that new rules (for allowed domains) doesn't take effect unless the file is manually updated? Hm... What is limiting the speed of the retrieving rules process? I can't really suggest something since I haven't worked with sqlite databases but I see that the process barely takes resources on my system and I would like to have an option to trade resources (including keeping things in RAM for a while) for speed.
Owner

Yeah in fact it needs to rewrite the hosts file every time you allow a domain in order to remove the domain from it. That's the slow part. I'm searching for an idea right now ... :/

Yeah in fact it needs to rewrite the hosts file every time you allow a domain in order to remove the domain from it. That's the slow part. I'm searching for an idea right now ... :/

Good afternoon @schrmh @twann

There is a small problem in your filters list.

It turns out that you are using ones that are already contiguous to others, even if they are from the same developer.

Let me explain 1hosts-xtra contains all of the above 1 hosts
divested hosts includes a bunch of the ones you have: adaway, some from blocklistproject or lightswitch05, etc..
oisd-full contains oisd basic
gameindustry-android-full contains the mini .
stevenblack-hosts-full idem
gameindustry-gaming-full contains the mini. energized idem..
And so on and so forth.

For example there are some tremendous hosts that include a lot of others. How:
https://github.com/infazide/blockingjunkie
https://github.com/notracking/hosts-blocklists
https://github.com/ShadowWhisperer/BlockLists
https://oooo.b-cdn.net/blahdns/blahdns_hosts.txt and https://oooo.b-cdn.net/blahdns/lite_host.txt ( sources https://raw.githubusercontent.com/ookangzheng/blahdns/master/hosts/source.txt )
https://github.com/infazide/blockingjunkie
https://github.com/bongochong/CombinedPrivacyBlockLists
https://github.com/hagezi/dns-blocklists ( Tickets for pegasus )
https://www.reddit.com/r/pihole/comments/opmrgw/nsopegasus_blocklist/

That is, they repeat themselves and spend resources on deleting duplicate or triplicate entries. And write them over and over again. And it can slow down and cause problems and more consumption. And slowing down

A hug

Good afternoon @schrmh @twann There is a small problem in your filters list. It turns out that you are using ones that are already contiguous to others, even if they are from the same developer. Let me explain 1hosts-xtra contains all of the above 1 hosts divested hosts includes a bunch of the ones you have: adaway, some from blocklistproject or lightswitch05, etc.. oisd-full contains oisd basic gameindustry-android-full contains the mini . stevenblack-hosts-full idem gameindustry-gaming-full contains the mini. energized idem.. And so on and so forth. For example there are some tremendous hosts that include a lot of others. How: https://github.com/infazide/blockingjunkie https://github.com/notracking/hosts-blocklists https://github.com/ShadowWhisperer/BlockLists https://oooo.b-cdn.net/blahdns/blahdns_hosts.txt and https://oooo.b-cdn.net/blahdns/lite_host.txt ( sources https://raw.githubusercontent.com/ookangzheng/blahdns/master/hosts/source.txt ) https://github.com/infazide/blockingjunkie https://github.com/bongochong/CombinedPrivacyBlockLists https://github.com/hagezi/dns-blocklists ( Tickets for pegasus ) https://www.reddit.com/r/pihole/comments/opmrgw/nsopegasus_blocklist/ That is, they repeat themselves and spend resources on deleting duplicate or triplicate entries. And write them over and over again. And it can slow down and cause problems and more consumption. And slowing down A hug
Poster

@gallegonovato Interesting but I don't think this matters when rules are updated. I mean it doesn't resync filter lists but just updates the sqlite database that contains no duplicates.
Or am I wrong here?

Also the decloudflare filter list contains over 21.6 Million domains by itself (I have a bit over 27.2 Million in my hosts file). If I only had that active it would still take many hours.

Tho there might be a general problem with speed in tblock? I used hyperfine to benchmark something simple (catting the content of my hosts file compared to tblock -l since that output is kinda similar). Sure, tblock has to be slower because of the database but it is 580 times slower:
image. Maybe that can be increased e.g. by using different database operations?

@gallegonovato Interesting but I don't think this matters when rules are updated. I mean it doesn't resync filter lists but just updates the sqlite database that contains no duplicates. Or am I wrong here? Also the decloudflare filter list contains over 21.6 Million domains by itself (I have a bit over 27.2 Million in my hosts file). If I only had that active it would still take many hours. Tho there might be a general problem with speed in tblock? I used hyperfine to benchmark something simple (catting the content of my hosts file compared to tblock -l since that output is kinda similar). Sure, tblock has to be slower because of the database but it is 580 times slower: ![image](/attachments/a12a3173-8667-46b0-8166-c7f336f16d34). Maybe that can be increased e.g. by using different database operations?
Poster

Okay, I took a small look at the source code and removed things that I won't need if I add a valid domain as an exception.
It is way faster now. It took 7 minutes (several of those spend at the updating process before the retrieving rules text is visible, so maybe there can be something improved as well?) in contrast to several hours.

git diff

diff --git a/tblock/hosts.py b/tblock/hosts.py
index 2d06d1d..4bc2d29 100644
--- a/tblock/hosts.py
+++ b/tblock/hosts.py
@@ -105,12 +105,7 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None:
                 percent = int(count * 100 / total_rules)
                 if not quiet:
                     print(f" {loading_icon(count)} {__msg} {count}/{total_rules} ({percent}%)", end="\r")
-                for a in allow_rules:
-                    if re.match(re.compile(a[0].replace(".", "\\.").replace("*", ".*")), row[0]):
-                        invalid = True
-                        break
-                    else:
-                        invalid = False
+                invalid = False
                 if not invalid:
                     if row[1] == RulePolicy.BLOCK:
                         w.write(f"{Var.DEFAULT_IP}\t\t{row[0]}\n")

So it's essentially that second loop with that regex introduced in 9503966eec that makes it slow.
I only get 189 domains when I execute SELECT domain FROM rules WHERE domain LIKE '%*%'; within sqlitebrowser and it seems like almost all of those wildcard rules come from adguard-dns and two each from easylist and easyprivacy.

Okay, I took a small look at the source code and removed things that I won't need if I add a valid domain as an exception. **It is way faster now.** It took 7 minutes (several of those spend at the updating process before the retrieving rules text is visible, so maybe there can be something improved as well?) in contrast to several hours. `git diff` ```diff diff --git a/tblock/hosts.py b/tblock/hosts.py index 2d06d1d..4bc2d29 100644 --- a/tblock/hosts.py +++ b/tblock/hosts.py @@ -105,12 +105,7 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None: percent = int(count * 100 / total_rules) if not quiet: print(f" {loading_icon(count)} {__msg} {count}/{total_rules} ({percent}%)", end="\r") - for a in allow_rules: - if re.match(re.compile(a[0].replace(".", "\\.").replace("*", ".*")), row[0]): - invalid = True - break - else: - invalid = False + invalid = False if not invalid: if row[1] == RulePolicy.BLOCK: w.write(f"{Var.DEFAULT_IP}\t\t{row[0]}\n") ``` So it's essentially that second loop with that regex introduced in https://codeberg.org/tblock/tblock/commit/9503966eec82725069c23c2ed5d5b6e95d70fe9b that makes it slow. I only get 189 domains when I execute `SELECT domain FROM rules WHERE domain LIKE '%*%';` within sqlitebrowser and it seems like almost all of those wildcard rules come from adguard-dns and two each from easylist and easyprivacy.

Good afternoon:

First of all I would like to comment that my programming level is not very high. But I do understand hosts

First you are telling your program to download the hosts source. And I don't know if just after that it starts to cear the hosts. And then it continues downloading the second source of hosts, modifying the hosts etc etc....

Many of the github sources when the creator uploads it, modifies. Or has a boot to do the job. Sometimes it crashes at that point. And besides that already have a high traffic to that url.

Then for example Energized, usually or used to make modifications to their hosts. Almost every hour.

But the question is, if for example Energized. They have all the hosts to download, when if you look at: https://github.com/EnergizedProtection/block#package-sources . You will see that:

Unified contains Core List + Core Porn List + Ultimate and Porn . That is to say to download more than 20M, when including all the energized list. It is telling your program to download energized Ultimate 15M more. As you can see the Unified contains the Ultimate and other Energized lists, so it is stressing the program by downloading everything, checking for duplicates, deleting those duplicates. And create the host.

For example vro that use everything from blocklistproject. But instead of downloading one to hna. they have https://github.com/blocklistproject/Lists#main-lists , as you can see in the table there is one that says:
Everything , which already includes all previous lists, except beta. I remember that one because it was a suggestion of mine.

gameindustry uses the full versions, which already includes the mini versions. It uses 1hosts-xtra that if msl I don't remember when they have been with the creator of 1hosts. The xtra already includes the previous ones. stevenblack-hosts-full and stevenblack-hosts......

And so it has a lot.

But I'll go further. Use the adaway hosts, why download it. If you are already downloading it in 1hosts, energized, stevenblack, hblock, divested etc.... It's in the sources they use, if I still remember when talking to stevenblack he told me that it's wasting downloads and processes, when he ys using all of Adaway. And I even commented to him pyes better to work you and Adaway together to improve the hosts and Adaway.....

oisd-full already includes basic. But the vasic is used in other host lists that you send to download before. To say nothing of all that oisd includes: https://oisd.nl/includedlists/full

Or https://divested.dev/hosts already includes some of
blocklistproject, etc...

All this, counting the times they are updated. Or you can update them. There are millions of data to process and download

And to eliminate triplicates and others, surely the program checks again that there are no changes of those lists of hosts. If there are, it would be to start over. And then create the final hosts

A hug

Good afternoon: First of all I would like to comment that my programming level is not very high. But I do understand hosts First you are telling your program to download the hosts source. And I don't know if just after that it starts to cear the hosts. And then it continues downloading the second source of hosts, modifying the hosts etc etc.... Many of the github sources when the creator uploads it, modifies. Or has a boot to do the job. Sometimes it crashes at that point. And besides that already have a high traffic to that url. Then for example Energized, usually or used to make modifications to their hosts. Almost every hour. But the question is, if for example Energized. They have all the hosts to download, when if you look at: https://github.com/EnergizedProtection/block#package-sources . You will see that: Unified contains Core List + Core Porn List + Ultimate and Porn . That is to say to download more than 20M, when including all the energized list. It is telling your program to download energized Ultimate 15M more. As you can see the Unified contains the Ultimate and other Energized lists, so it is stressing the program by downloading everything, checking for duplicates, deleting those duplicates. And create the host. For example vro that use everything from blocklistproject. But instead of downloading one to hna. they have https://github.com/blocklistproject/Lists#main-lists , as you can see in the table there is one that says: Everything , which already includes all previous lists, except beta. I remember that one because it was a suggestion of mine. gameindustry uses the full versions, which already includes the mini versions. It uses 1hosts-xtra that if msl I don't remember when they have been with the creator of 1hosts. The xtra already includes the previous ones. stevenblack-hosts-full and stevenblack-hosts...... And so it has a lot. But I'll go further. Use the adaway hosts, why download it. If you are already downloading it in 1hosts, energized, stevenblack, hblock, divested etc.... It's in the sources they use, if I still remember when talking to stevenblack he told me that it's wasting downloads and processes, when he ys using all of Adaway. And I even commented to him pyes better to work you and Adaway together to improve the hosts and Adaway..... oisd-full already includes basic. But the vasic is used in other host lists that you send to download before. To say nothing of all that oisd includes: https://oisd.nl/includedlists/full Or https://divested.dev/hosts already includes some of blocklistproject, etc... All this, counting the times they are updated. Or you can update them. There are millions of data to process and download And to eliminate triplicates and others, surely the program checks again that there are no changes of those lists of hosts. If there are, it would be to start over. And then create the final hosts A hug
Owner

Yeah, TBlock automatically clears duplicates. However, something that could be useful for speed would be to prevent the user from subscribing to filter lists that are already included in other subscribed filter lists.

I think what is really slowing the program down is wildcards rules introduced in 9503966eec, you are right @schrmh !

I will have to think of a better way to support wildcards. In the meantime, if you have any idea, it would be much appreciated :)

PS: I'm sorry if I'm not really active, but I have tons of things going out right now so I don't have a lot of time :/

Yeah, TBlock automatically clears duplicates. However, something that could be useful for speed would be to prevent the user from subscribing to filter lists that are already included in other _subscribed_ filter lists. I think what is really slowing the program down is wildcards rules introduced in https://codeberg.org/tblock/tblock/commit/9503966eec82725069c23c2ed5d5b6e95d70fe9b, you are right @schrmh ! I will have to think of a better way to support wildcards. In the meantime, if you have any idea, it would be much appreciated :) PS: I'm sorry if I'm not really active, but I have tons of things going out right now so I don't have a lot of time :/
Poster

@twann Yeah the problem is that I have barely any Python knowledge. It's essentially the re.match part that makes it slow.
I tried to make it faster by moving the replace stuff to the SQL query (dunno if that makes sense in terms of speed if there were many wildcards but that was just the easiest way to get the unecessary replace operations out of the loop) and added the row_factory thing cause allow_rules only has one row so we don't need that weird structure with tuples (https://stackoverflow.com/questions/2854011/get-a-list-of-field-values-from-pythons-sqlite3-not-tuples-representing-rows#) but it barely had any effect on speed:

diff --git a/tblock/hosts.py b/tblock/hosts.py
index 2d06d1d..887a094 100644
--- a/tblock/hosts.py
+++ b/tblock/hosts.py
@@ -95,7 +95,8 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None:
             db = sqlite3.connect(Path.DATABASE)
             cursor = db.cursor()
             rules = cursor.execute("SELECT domain, policy, ip FROM rules").fetchall()
-            allow_rules = cursor.execute("SELECT domain FROM rules WHERE domain LIKE '%*%';").fetchall()
+            cursor.row_factory = lambda cursor, row: row[0]
+            allow_rules = cursor.execute("SELECT REPLACE(REPLACE(domain,'.','\.'),'*','.*') FROM rules WHERE domain LIKE '%*%';").fetchall()
             total_rules = len(rules)
             count = 0
             invalid = False
@@ -106,7 +107,8 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None:
                 if not quiet:
                     print(f" {loading_icon(count)} {__msg} {count}/{total_rules} ({percent}%)", end="\r")
                 for a in allow_rules:
-                    if re.match(re.compile(a[0].replace(".", "\\.").replace("*", ".*")), row[0]):
+                    regex = re.compile(a)
+                    if regex.match(row[0]):
                         invalid = True
                         break
                     else:

I think we either need to find something better for the re.match part (btw. what even were the issues with wildcards that were fixed in thtat commit? Why is the variable named allow_rules when it also contains wildcards of blocked domains — actually in my case only blocked domains — instead of e.g. wildcard_rules and why is it needed that the rules are checked against the wildcard rules?) or to speed up or replace the loops somehow.

What I found that could be tried:

  • Filter within the sqlite select query for the rules variable and do not return matching rules so that might make allow_rules, the independent query for it and the loop with the re.match unecessary? (I mean it's the same field anyways and it seems like you just skip the rule when it matches to the wildcard?)
  • Using Numpy arrays (tho I read partly conflicting information somehow. Sometimes that arrays are faster than lists and that Numpy should be used but sometimes that there is a overhead cause numpy converts to C so lists are faster than arrays?)
  • Using C (e.g. Cython, Pybind11) for that part
  • Using multiprocessing
  • Asking on stackoverflow or some other platform

Links of stuff I stumpled upon:
https://stackoverflow.com/questions/6595759/selecting-elements-in-numpy-array-using-regular-expressions
https://stackoverflow.com/questions/40593444/what-is-the-quickest-way-to-iterate-through-a-numpy-array
https://www.geeksforgeeks.org/how-to-convert-a-list-and-tuple-into-numpy-arrays/
https://stackoverflow.com/questions/39032200/converting-python-list-to-numpy-array-inplace
https://stackoverflow.com/questions/10929724/which-is-the-most-efficient-way-to-iterate-through-a-list-in-python
https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3/42747503#42747503
https://stackoverflow.com/questions/3640359/regular-expressions-search-in-list
https://stackoverflow.com/questions/60263953/how-to-speed-up-for-loop-in-python3

Guess it could easily be asked on stackoverflow. The problem isn't too complex and could be described like that:
You basically have a structure like this (I think it's a list of tuples?) with millions of domains that you get after reading from a sqlite database field:
[('hbbtv-track.prosieben.de', 'A', None), ('aek6ohsh.xhamster.com', 'A', None), ('cnt3.xhamster.com', 'A', None),
And that you have entries like that in the same field that you select as well [('201
.myhard.com',), ('8*.tianya.cn',), ('a*.chajiaotong.com',)]*

You may convert the wildcard stuff to
['201*.myhard.com', '8*.tianya.cn', 'a*.chajiaotong.com',
which can be converted to ['201.*\.myhard\.com', '8.*\.tianya\.cn', 'a.*\.chajiaotong\.com', by using e.g. .replace(".", "\\.").replace("*", ".*") (or by using replace in the select request)
And the problem is that you want to loop over the long list with millions of entries but also check for each entry if it matches to any of the wildcard entries of the second list. You currently use re.match but it is way too slow and can take multiple hours. Without the re.match opration it only takes a few minutes.

If you want you can also link to this issue but dunno if StackOverflow likes that.

@twann Yeah the problem is that I have barely any Python knowledge. It's essentially the re.match part that makes it slow. I tried to make it faster by moving the replace stuff to the SQL query (dunno if that makes sense in terms of speed if there were many wildcards but that was just the easiest way to get the unecessary replace operations out of the loop) and added the row_factory thing cause allow_rules only has one row so we don't need that weird structure with tuples (https://stackoverflow.com/questions/2854011/get-a-list-of-field-values-from-pythons-sqlite3-not-tuples-representing-rows#) but it barely had any effect on speed: ```diff diff --git a/tblock/hosts.py b/tblock/hosts.py index 2d06d1d..887a094 100644 --- a/tblock/hosts.py +++ b/tblock/hosts.py @@ -95,7 +95,8 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None: db = sqlite3.connect(Path.DATABASE) cursor = db.cursor() rules = cursor.execute("SELECT domain, policy, ip FROM rules").fetchall() - allow_rules = cursor.execute("SELECT domain FROM rules WHERE domain LIKE '%*%';").fetchall() + cursor.row_factory = lambda cursor, row: row[0] + allow_rules = cursor.execute("SELECT REPLACE(REPLACE(domain,'.','\.'),'*','.*') FROM rules WHERE domain LIKE '%*%';").fetchall() total_rules = len(rules) count = 0 invalid = False @@ -106,7 +107,8 @@ def update_hosts(quiet: bool = False, do_not_prompt: bool = True) -> None: if not quiet: print(f" {loading_icon(count)} {__msg} {count}/{total_rules} ({percent}%)", end="\r") for a in allow_rules: - if re.match(re.compile(a[0].replace(".", "\\.").replace("*", ".*")), row[0]): + regex = re.compile(a) + if regex.match(row[0]): invalid = True break else: ``` I think we either need to find something better for the re.match part (**btw. what even were the issues with wildcards that were fixed in thtat commit? Why is the variable named *allow_rules* when it also contains wildcards of blocked domains — actually in my case only blocked domains — instead of e.g. *wildcard_rules* and why is it needed that the rules are checked against the wildcard rules?**) or to speed up or replace the loops somehow. What I found that could be tried: - **Filter within the sqlite select query for the *rules* variable and do not return matching rules so that might make *allow_rules*, the independent query for it and the loop with the re.match unecessary?** (I mean it's the same field anyways and it seems like you just skip the rule when it matches to the wildcard?) - Using Numpy arrays (tho I read partly conflicting information somehow. Sometimes that arrays are faster than lists and that Numpy should be used but sometimes that there is a overhead cause numpy converts to C so lists are faster than arrays?) - Using C (e.g. Cython, Pybind11) for that part - Using multiprocessing - Asking on stackoverflow or some other platform Links of stuff I stumpled upon: https://stackoverflow.com/questions/6595759/selecting-elements-in-numpy-array-using-regular-expressions https://stackoverflow.com/questions/40593444/what-is-the-quickest-way-to-iterate-through-a-numpy-array https://www.geeksforgeeks.org/how-to-convert-a-list-and-tuple-into-numpy-arrays/ https://stackoverflow.com/questions/39032200/converting-python-list-to-numpy-array-inplace https://stackoverflow.com/questions/10929724/which-is-the-most-efficient-way-to-iterate-through-a-list-in-python https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3/42747503#42747503 https://stackoverflow.com/questions/3640359/regular-expressions-search-in-list https://stackoverflow.com/questions/60263953/how-to-speed-up-for-loop-in-python3 Guess it could easily be asked on stackoverflow. The problem isn't too complex and could be described like that: *You basically have a structure like this (I think it's a list of tuples?) with millions of domains that you get after reading from a sqlite database field: ` [('hbbtv-track.prosieben.de', 'A', None), ('aek6ohsh.xhamster.com', 'A', None), ('cnt3.xhamster.com', 'A', None), ` And that you have entries like that in the same field that you select as well [('201*.myhard.com',), ('8*.tianya.cn',), ('a*.chajiaotong.com',)]* *You may convert the wildcard stuff to `['201*.myhard.com', '8*.tianya.cn', 'a*.chajiaotong.com',` which can be converted to `['201.*\.myhard\.com', '8.*\.tianya\.cn', 'a.*\.chajiaotong\.com',` by using e.g. `.replace(".", "\\.").replace("*", ".*")` (or by using replace in the select request) And the problem is that you want to loop over the long list with millions of entries but also check for each entry if it matches to any of the wildcard entries of the second list. You currently use re.match but it is way too slow and can take multiple hours. Without the re.match opration it only takes a few minutes.* If you want you can also link to this issue but dunno if StackOverflow likes that.
Owner

The ideal would be not to include the loop that checks if a domain matches a wildcard rule inside another loop: that's what is really slowing the process.
So I had an idea last night: instead of checking every domain while inserting the rules inside the database, maybe the program could insert them all without checking, and, when it's done, simply run an SQL query to select the domains that match a wildcard rule (and remove them).

If you have 10 wildcard rules and 10 millions domains blocked, it currently needs to do 10*1,000,000 = 10,000,000 times the same operation! With this new way, it would only need to do it 10 times.

Don't know why I didn't think about that in the first place🤔

The ideal would be not to include the loop that checks if a domain matches a wildcard rule inside another loop: that's what is really slowing the process. So I had an idea last night: instead of checking every domain while inserting the rules inside the database, maybe the program could insert them all without checking, and, when it's done, simply run an SQL query to select the domains that match a wildcard rule (and remove them). If you have 10 wildcard rules and 10 millions domains blocked, it currently needs to do 10\*1,000,000 = **10,000,000 times the same operation!** With this new way, it would only need to do it **10 times**. Don't know why I didn't think about that in the first place🤔
twann added this to the (deleted) milestone 4 months ago
Owner

So I opened a PR (#41) and I'll start working on that ASAP

So I opened a PR (#41) and I'll start working on that ASAP
Owner

In case you are interested, I think I fixed the issue. However, I didn't test it really much so the release is not yet available. But you can see it or test it here: #41

In case you are interested, I think I fixed the issue. However, I didn't test it really much so the release is not yet available. But you can see it or test it here: https://codeberg.org/tblock/tblock/pulls/41#issuecomment-465470
twann added this to the (deleted) milestone 4 months ago
twann added this to the v2.2.0 milestone 4 months ago
Owner

So, I guess it's time to close this issue. It took hours when the issue was opened, and now it only takes a few minutes, thanks to b78f60ab43.

Thanks for the fixes @schrmh :) The next release should be out soon

So, I guess it's time to close this issue. It took hours when the issue was opened, and now it only takes a few minutes, thanks to https://codeberg.org/tblock/tblock/commit/b78f60ab4357e27813e69630349a7dacf8bfc81f. Thanks for the fixes @schrmh :) The next release should be out soon
twann closed this issue 3 months ago
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: tblock/tblock#39
Loading…
There is no content yet.