What's the best way to search the fediverse?

Yingwu@lemmy.dbzer0.com · 5 months ago

What's the best way to search the fediverse?

Jupiter Rowland · 5 months ago

It’s technologically impossible for any search to cover all of the Fediverse. Like, absolutely 100% of it.

That’s because it’s technologically impossible for anything in or outside the Fediverse to be aware of the full extent of the Fediverse and know all its instances, all its actors, all its (public) content in real-time.

It would only be possible if there was a fully centralised search engine. And that search engine had been hard-coded into all Fediverse server apps for years so that even instances that haven’t been upgraded in two or three years know it.

If Joe Übergeek spun up his own personal CherryPick or (streams) or Forte instance or whatever on his own Raspi, that instance would immediately have to announce its existence to that centralised search engine. Otherwise, the search engine wouldn’t have any way of knowing this new instance exists. If Joe Übergeek sent his first test post into the void because he has no connections yet, it would immediately have to be pushed to that search engine. And if Joe Übergeek decided to turn off ActivityPub on his (streams) channel, his instance would immediately have to notify the search engine which would immediately have to list that channel as formerly but no longer available.

Now imagine such a search being decentralised, e.g. built into Fediverse server apps like Mastodon or Lemmy. In this case, all server apps would have to know all instances out there with Fediverse-wide search. And immediately so.

Imagine Mastodon had such search built-in. Imagine Alice started up her own personal Mastodon instance with this search at 10:30. Imagine Bob installed his own personal (streams) instance from source at 10:31.

In order for the search on Alice’s Mastodon instance to actually cover 100% of the Fediverse, it would require Bob’s (streams) instance to push all necessary information to it. In order for this to work, Bob’s (streams) instance would have to know of the existence of Alice’s Mastodon instance from the moment it’s installed.

This couldn’t be done via any form of discovery, for where would (streams) go look for search instances?

So an automatically-generated list of search instances would have to be necessary. It would have to be delivered with the code upon installation.

This means that Alice’s Mastodon instance would have to add itself to the list of search instances in the streams repository (https://codeberg.org/streams/streams) as a pull request and then immediately merge that PR into both dev and release, the latter past dev, both without Mike Macgirvin’s permission, so that Bob’s new (streams) instance knows about Alice’s less-than-a-minute-old Mastodon instance with search the very moment that Bob installs it, so that Bob’s (streams) instance knows that it will have to report everything that happens to it in public to Alice’s Mastodon instance with built-in Fediverse search.

Whenever someone spins up a new instance that has Fediverse search built in, this would cause a PR in the code repositories of all Fediverse server applications that adds this instance to the initial list of search instances, and it’d cause that PR to immediately be merged into all active branches with no consent by the maintainers. And each shutdown of an instance with Fediverse search would cause a PR and an automated merge because that instance would have to be removed from the initial list of search instances.

I guess it should be obvious what an outlandish idea this is.

LovableSidekick@lemmy.world · 5 months ago

What if search itself were a federated function? Although I’m a software dev I really don’t know much about the mechanics of large-scale search engines such as Google, but I know their server farms somehow share the load of performing searches and maintaining whatever database they maintain to optimize searching. Seems like the fediverse could do search in a similar way. I’m just saying your critique of the idea, although well thought out, seems like a critique of a particular strategy. It’s not obvious to me that the very idea of federated search is outlandish.

Jupiter Rowland · 5 months ago

Still, the issue would be to find all instances of all Fediverse server applications.

I mean, the idea was to cover the whole Fediverse with that search. Literally everything.

Like, imagine I spin up my own instance of Forte on a home server to try it out and see if it already works.

How’s a Fediverse search engine supposed to know about my brand-new Forte instance? Clairvoyance? Hah. A crawler? Yeah, right, as if any crawler out there was fast enough to discover a brand-new instance of something that doesn’t have a running instance at all yet. At least not beyond enclosed, experimental instances detached from the rest of the Fediverse.

I mean, instead of Forte, I could also install what Forte was forked from, namely something colloquially referred to as (streams). Something that intentionally doesn’t have a name, doesn’t have a brand identity, doesn’t have a unified server identifier. Unlike Mastodon whose instances all identify as “mastodon” and Lemmy whose instances all identify as “lemmy” and Hubzilla whose instances all identify as “hubzilla”, (streams) instances don’t all identify the same. That field is customisable. And it has been customised for as long as (streams) has been around. You can’t reliably crawl (streams) instances. Instead of “streams”, they can identify as “y” (because Y is not X) or “get ready to rumbly” (public instance actually) or “bunny of doom” or “diversi spiritus”.

In fact, crawlers would have to be able to identify any kind of Fediverse server software. Even if someone has only just forked something, a crawler would be able to recognise it as Fediverse server software. If you hard-code server identifiers into the crawler, it’d be out-of-date as soon as someone decides to fork Mastodon or Misskey or Firefish or Sharkey or whatever again. And, as mentioned above, you can’t crawl (streams) instances by identifier.

It simply is impossible to discover and index the whole Fediverse by crawling, Google-style. And if a Fediverse search engine can’t discover a (streams) instance that identifies as “y”, it can’t index the content coming from the man who created (streams) and Forte and still occasionally develops both. The man who created the oldest still existing Fediverse project, Friendica, as well as the Swiss Army knife of the Fediverse, Hubzilla, and the very concept of nomadic identity. One of the most competent and experienced Fediverse devs ever. A crawler couldn’t find him.

Still, the search engine needs to know all Fediverse instances, right?

Well, if crawling fails, and crawling does fail, there’s only one way to achieve that: Each instance would have to announce its presence to anything that’s supposed to be able to search the Fediverse.

But in order to be able that, each instance would have to know everything that can search the Fediverse. And all instances of it. Every single one of them.

And if it shall announce its existence when it spins up for the first time, it will have to know all these search instances immediately before spinning up.

How can it possibly know them all before even going online itself?

Two options. Either a centralised list of all search instance that’s being updated as soon as a new one is spun up.

But you said, “federated.” As in not centralised.

Or the list would have to be built into the source code as it’s being git pulled from the code repository. In fact, the list would have to be git pulled from the code repository immediately before the server spins up so that it’s up-to-date when the server spins up. This would mean that the whole server software would have to be updated before start-up.

Of course, each Fediverse server software project that’s started from scratch would have to implement this list, otherwise its instances couldn’t be found.

But how is this list supposed to be kept up-to-date?

I mean, let’s suppose what has been spun up here is something that has Fediverse search built in. It itself would have to be added to this list so that other new instances can announce themselves to this new instance, so that it can find them and index their content.

So how is this new search-equipped instance supposed to be added to the list of search instances?

Shall it add itself to the list by manipulating the production code of all Fediverse server applications that have Fediverse search built in? Past the maintainers and without their consent?

Perfect search that covers 100% of the Fediverse has to rely on lists of some kind, that’s clear. The Fediverse changes too quickly to be crawlable. It’s too diverse to be crawlable. And it has server software which itself is inherently uncrawlable because it’s undiscoverable by design.

But such lists are impossible to always be kept up-to-date, too.

LovableSidekick@lemmy.world · 5 months ago

Thanks for putting so much time and thought into the discussion. All the problems you talk about exist for every search engine in actual use today. For example, publishing a site on a brand new domain has the exact problem you’re describing with spinning up a new Forte instance. There can be a 24-hr lag before DNS can reliably find the site. Perfect search is an aspirational goal. The realistic goal is to satisfy most needs. No matter how many words you throw at it, I don’t think federated search is an outlandish idea at all.

Jupiter Rowland · 5 months ago

I’m not even only talking about a 24-hour lag. I’m talking about parts of the Fediverse never being discovered at all. After all, the Fediverse doesn’t have a centralised DNS of its own in which all instances are registered but only them, where a search crawler could simply look them up.

Even if someone developed a Web search crawler much like the Google Bot, something that crawls the entire WWW looking for Fediverse instances, how is it supposed to tell Fediverse instances from websites that aren’t Fediverse instances?

I bet the first two proposals for solutions wouldn’t work with (streams).

The first proposal would probably be to go for the instance type, like “mastodon” or “lemmy” or “mbin” or “akkoma” or “misskey” or whatever. This, however, would require valid instance types to be manually added to a kind of config file from which the search crawler could look valid instance types up. This, in turn, would only work if this list was constantly kept complete and up-to-date.

This means: Whenever someone launches a new project, the identifier of this project will have to be added to the list. Whenever someone forks something into a new project, ditto. Now let the devs of the crawler have as little time as the Plume devs or as the sole Firefish dev early this year, and the list of Fediverse instance types will spend months outdated with new projects missing, and the crawler won’t recognise the instances of these new projects as Fediverse projects.

Oh, and it wouldn’t work with (streams) at all. See, (streams) is intentionally without a name, without a brand identity and even without a unified, pre-defined, fixed instance type. It isn’t like all instances identify as “streams” or “(streams)”. Some identify as “streams”, but many others have unique types. The crawler wouldn’t know these identifiers as valid Fediverse instance types (how is that crawler supposed to know that “bunny of doom” is a Fediverse identifier), and thus, it wouldn’t be able to identify (streams) instances as Fediverse instances.

Now you could say that (streams) is so tiny that it wouldn’t hurt to sweep it under the rug. Nobody would notice.

But that’d exactly be the problem. One of the (streams) users is the guy who created (streams) and everything before it all the way back to Mistpark in 2010, the one man who developed more Fediverse protocols and server applications than anyone, the man who invented nomadic identity and magic single sign-on: Mike Macgirvin. He is on one out of only two instances that identify as “y” (because Y is not X).

He is one of the few people in the Fediverse who actually post about what’s possible in the Fediverse that goes way beyond Mastodon. Not only possible, but readily available right now. He started advertising (streams) in the wake of the mass-migration of Twitter users to Mastodon. And if his most recent creation, Forte, manages to take off, he’ll probably advertise that. If (streams) wasn’t caught by crawlers, nobody would read his advertisement except those who already follow him, and I guess half of them already know his creations and what they can do.

Hard-coding the custom identifiers of (streams) instances into the list is a stupid idea, too. The instance type is not defined upon installation in a config file. It’s an admin-side free-text field that can be changed anytime with no consequences for connections, just because the admin feels like it.

Okay, so here’s the second proposal: Go for nodeinfo. The problem this time: Mike has also intentionally removed almost all nodeinfo code from (streams). He didn’t want (streams) to participate in that eternal rat race between Fediverse projects and Fediverse instances for the best stats on FediDB, Fediverse Observer and The Federation. In fact, (streams) is entirely absent from all three. This, too, is intentional.

If anyone has a better idea, I’m all ears.