Add an index and crawler so API's dont need to be called every search #1

Open
opened 2022-07-04 01:34:09 +00:00 by MoralCode · 7 comments
Owner
No description provided.
Author
Owner

While the goal here is more "finding repositories" rather than "easrching the code itself", these seem to be interesting summaries of the landscape and various ways to do it

https://github.blog/2021-12-15-a-brief-history-of-code-search-at-github/

https://swtch.com/~rsc/regexp/regexp4.html

While the goal here is more "finding repositories" rather than "easrching the code itself", these seem to be interesting summaries of the landscape and various ways to do it https://github.blog/2021-12-15-a-brief-history-of-code-search-at-github/ https://swtch.com/~rsc/regexp/regexp4.html

Hello! 👋

Is the goal searching for repositories for repository contents?

If it's searching for repositories, then ForgeFlux(I'm one of the devs) is building a federated forge crawler. If you are interested, we could work together on implementing indexing :)

Hello! 👋 Is the goal searching for repositories for repository contents? If it's searching for repositories, then [ForgeFlux](https://forgeflux.org)(I'm one of the devs) is building a [federated forge crawler](https://starchart.forgeflux.org). If you are interested, we could work together on implementing indexing :)
Author
Owner

@realaravinth The intent is to have this be essentially a global repository search that I (optimistically) hope might eventually replace GitHub search to help lessen their monopoly on repository discoverability. It was pretty much inspired by the https://giveupgithub.org campaign that the software freedom conservancy has started (theres more info on their mailing list).

in looking at some of the articles I linked above, I think indexing every repo ever with full-code search is going to take way too much storage, cost, time and infrastructure that I do not have. At most id probably ony want to index certain files that have documentation intended for humans to help people find a project if they enter things like a description of how it works for example.

That said, even within the scope of repository name/description search, this particular repo is still a super dumb proof of concept that was hacked together in less than a day. it pretty much just queries the API's of a set of repos that I hardcoded in since it turns out that tons of free/public repo hosts use the same gitea/gogs/gitlab API's.

so yeah, I'd love to contribute what I can to your project! I hope at the very least the code in this repo serves as a reasonably decent list of public git hosting sites that could be a starting point to start crawling at the speed of free API access for testing and things.

Let me know what I can do to help! (Also im sure the GiveUpGithub mailing list would love to learn about your project since its pretty much exactly the concept that inspired this hacky mess)

@realaravinth The intent is to have this be essentially a global repository search that I (optimistically) hope might eventually replace GitHub search to help lessen their monopoly on repository discoverability. It was pretty much inspired by the https://giveupgithub.org campaign that the software freedom conservancy has started (theres more info on their mailing list). in looking at some of the articles I linked above, I think indexing every repo ever with full-code search is going to take way too much storage, cost, time and infrastructure that I do not have. At most id probably ony want to index certain files that have documentation intended for humans to help people find a project if they enter things like a description of how it works for example. That said, even within the scope of repository name/description search, this particular repo is still a super dumb proof of concept that was hacked together in less than a day. it pretty much just queries the API's of a set of repos that I hardcoded in since it turns out that tons of free/public repo hosts use the same gitea/gogs/gitlab API's. so yeah, I'd love to contribute what I can to your project! I hope at the very least the code in this repo serves as a reasonably decent list of public git hosting sites that could be a starting point to start crawling at the speed of free API access for testing and things. Let me know what I can do to help! (Also im sure the GiveUpGithub mailing list would love to learn about your project since its pretty much exactly the concept that inspired this hacky mess)

The intent is to have this be essentially a global repository search that I (optimistically) hope might eventually replace GitHub search to help lessen their monopoly on repository discoverability.

I feel the same too. Enhance discoverability on GitHub is often brushed off as a myth, but I strongly feel that it is not the case.

In addition to crawling and indexing repositories, Starchart will also create sitemaps so that they can be easily crawled and indexed by search engines. In essence, Starchart will implement all necessary features to discoverability visibility of projects on indie forges by both bots and humans alike.

But indexing selective file content is out of scope of the project. I understand the motivation, but it could be expensive to run.

I am unfamiliar with plugin architecture, but I am curious about plugins. If you are going to implement file indexing within ForgeFind, do you think it can be loaded as a plugin as well?

> The intent is to have this be essentially a global repository search that I (optimistically) hope might eventually replace GitHub search to help lessen their monopoly on repository discoverability. I feel the same too. Enhance discoverability on GitHub is often brushed off as a myth, but I strongly feel that it is not the case. In addition to crawling and indexing repositories, Starchart will also create sitemaps so that they can be easily crawled and indexed by search engines. In essence, Starchart will implement all necessary features to discoverability visibility of projects on indie forges by both bots and humans alike. But indexing selective file content is out of scope of the project. I understand the motivation, but it could be expensive to run. I am unfamiliar with plugin architecture, but I am curious about plugins. If you are going to implement file indexing within ForgeFind, do you think it can be loaded as a plugin as well?
Author
Owner

But indexing selective file content is out of scope of the project. I understand the motivation, but it could be expensive to run.

agreed, i mostly mentioned it as a "this could be cool" thing but definitely not required

I am unfamiliar with plugin architecture, but I am curious about plugins. If you are going to implement file indexing within ForgeFind, do you think it can be loaded as a plugin as well?

honestly given how much time I have to put towards this, ForgeFind is pretty much a toy project/proof of concept for the general idea of cross-forge repository search. While I have an image in my head of what i think a cross-forge search service should be, I think your/ForgeFlux's starchart project is already way ahead of what I could hope to achieve here.

While not directly relevant, heres a possibly useful tidbit that may be helpful regarding the design of a webscraper/crawler that may be more useful to you than it is to ForgeFind:
I have previously done some work with large scale (non-federated) webscraping/data ingest to help map COVID19 vaccination locations in the US (https://github.com/CAVaccineInventory/vaccine-feed-ingest/). They have a pretty robust pipeline thats broken into fetch, parse, normalize, and load steps (see their wiki). There were probably many reasons for this, including being able to re-parse the original source data if they ever needed to rebuild their DB from scatch, and minimizing load on the places being scraped, presumably among others.

> > But indexing selective file content is out of scope of the project. I understand the motivation, but it could be expensive to run. agreed, i mostly mentioned it as a "this could be cool" thing but definitely not required > I am unfamiliar with plugin architecture, but I am curious about plugins. If you are going to implement file indexing within ForgeFind, do you think it can be loaded as a plugin as well? honestly given how much time I have to put towards this, ForgeFind is pretty much a toy project/proof of concept for the general idea of cross-forge repository search. While I have an image in my head of what i think a cross-forge search service should be, I think your/ForgeFlux's starchart project is already way ahead of what I could hope to achieve here. While not directly relevant, heres a possibly useful tidbit that may be helpful regarding the design of a webscraper/crawler that may be more useful to you than it is to ForgeFind: I have previously done some work with large scale (non-federated) webscraping/data ingest to help map COVID19 vaccination locations in the US (https://github.com/CAVaccineInventory/vaccine-feed-ingest/). They have a pretty robust pipeline thats broken into fetch, parse, normalize, and load steps (see their wiki). There were probably many reasons for this, including being able to re-parse the original source data if they ever needed to rebuild their DB from scatch, and minimizing load on the places being scraped, presumably among others.

, ForgeFind is pretty much a toy project/proof of concept for the general idea of cross-forge repository search. While I have an image in my head of what i think a cross-forge search service should be

As is Starchart :D

I have previously done some work with large scale (non-federated) webscraping/data ingest to help map COVID19 vaccination locations in the US (https://github.com/CAVaccineInventory/vaccine-feed-ingest/). They have a pretty robust pipeline thats broken into fetch, parse, normalize, and load steps (see their wiki). There were probably many reasons for this, including being able to re-parse the original source data if they ever needed to rebuild their DB from scatch, and minimizing load on the places being scraped, presumably among others.

Thanks for sharing! After reading the pipeline, it makes sense to store the raw data too but I'm concerned about the costs associated with it(I self-host software on a very modest setup out of my bedroom so I'm tunnel visioned, when it comes to computational costs. I could, however, create an optional adaptor that stores raw data too. What do you think?

> , ForgeFind is pretty much a toy project/proof of concept for the general idea of cross-forge repository search. While I have an image in my head of what i think a cross-forge search service should be As is Starchart :D > I have previously done some work with large scale (non-federated) webscraping/data ingest to help map COVID19 vaccination locations in the US (https://github.com/CAVaccineInventory/vaccine-feed-ingest/). They have a pretty robust pipeline thats broken into fetch, parse, normalize, and load steps (see their wiki). There were probably many reasons for this, including being able to re-parse the original source data if they ever needed to rebuild their DB from scatch, and minimizing load on the places being scraped, presumably among others. Thanks for sharing! After reading [the pipeline](https://github.com/CAVaccineInventory/vaccine-feed-ingest/wiki/Runner-pipeline-stages), it makes sense to store the raw data too but I'm concerned about the costs associated with it(I self-host software on a very modest setup out of my bedroom so I'm tunnel visioned, when it comes to computational costs. I could, however, create an optional adaptor that stores raw data too. What do you think?
Author
Owner

I could, however, create an optional adaptor that stores raw data too. What do you think?

Since you need to download the raw data (api calls, etc) and parse it anyway, i think it might make more sense to just have an option to store the data somewhere so each instance can decide if and/or how far back to store the raw API results as a recovery/backup. IMO the data needs for VaccinateTheStates is probably somewhat different from starchart. Overall i think having a pipeline-structured sequence that data goes through in order to get crawled might help separate he steps out and make it easier for new contribitors to understand and make contributions to.

Although that said, you probably already have a particular way of scraping so please dont feel like you need to re-write anything if the current system works fine. At this stage time is probably better spent adding code that makes starchart grow/become more usable

> I could, however, create an optional adaptor that stores raw data too. What do you think? Since you need to download the raw data (api calls, etc) and parse it anyway, i think it might make more sense to just have an option to store the data somewhere so each instance can decide if and/or how far back to store the raw API results as a recovery/backup. IMO the data needs for VaccinateTheStates is probably somewhat different from starchart. Overall i think having a pipeline-structured sequence that data goes through in order to get crawled might help separate he steps out and make it easier for new contribitors to understand and make contributions to. Although that said, you probably already have a particular way of scraping so please dont feel like you need to re-write anything if the current system works fine. At this stage time is probably better spent adding code that makes starchart grow/become more usable
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: MoralCode/ForgeFind#1
No description provided.