I was looking for some collection of posts earlier about Proton Mail and the whole controversy with the CEO, and I opened a post the lemmy instance that was suggested was lemmy.zip but the community and the poster were from lemmy.world so that made me ask myself a bunch of questions. Reference link

Note: I used duckduckgo

Here are some questions I have:

  • How does the search engine decide which instance to link you to as you could in theory show every instance for the same post?
  • Could you get a result where all the results are the same post just different instances?
  • Do you think that could deter new people finding out about lemmy through search results?
  • How can an instance make themselves more visible in the search results (for exposure)?
  • I did not get any results from lemmy clients such as vger.app the only results were direct instances, will this always be the case?

I remember learning about search engines a while back but I don’t know how relevant that information is any more. Having crawlers and the more a website is linked in other websites the higher up in the search result will be and the whole robot.txt thing.

I know if I wanted to search for something specific in lemmy I could just use its own search function, but what about people who ask general questions and that happens to be answered in a lemmy post. I wanted to know how exposed we are/ will be to people who don’t yet know about lemmy.

  • Federation is a weird one with search engines. Each instance is indexed by search engines directly (if the admins allow it in robots.txt) and the web crawler will then index that Lemmy instance. It used to just be like this and thus a scraper would come across the same content in multiple instances and also find a bunch of back links to said other instances. The search engine would then classify the entirety of the fediverse as an seo hack and ignore it. This issue has since been resolved so now posts include a special HTML tag that tells the web crawler where the original content came from (I assume the instance which manages the content so the communities instance).

    What this means is that each instance is individually competing in the search results. When I crawler discovers lemmy.world content through accessing the lemmy.zip instance it knows the content came from lemmy.world and thus rates the lemmy.zip content and by extension the post u where looking for as though it was a lemmy.world page. (I assume all the search providers don’t say how their algorithms work).

    Its a shame how this works as it means that each instance has to outperform the competition individually instead of being able to work as a collective. Ideally the fediverse would have a single domain that search engines can be told is the content origin and thus the fediverse would be able to compete as a collective.

    • JustAnotherKay@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 day ago

      Ideally the Fediverse would have a single domain…

      Educate me a bit. Would it be possible to create a unique low level domain like www for the Fediverse? Imagine fed.service.instance/data as the structure perhaps?

  • catloaf@lemm.ee
    link
    fedilink
    English
    arrow-up
    19
    ·
    2 days ago

    Search engines don’t treat Lemmy specially. They index the pages just like any other site. If it’s discoverable through the crawling process, it’ll be indexed.

  • gigachad
    link
    fedilink
    English
    arrow-up
    14
    ·
    2 days ago

    Instances that disagree with being found in search engines are not shown. Instance admins can configure their robots.txt by adding lemmy-search. All other instances can theoretically be found. I think their priority depends on the laws of SEO (Search Engine Optimization). This probably means that a post on myownlemmy1337 that is federated with lemmy.world, will be found as a post on lemmy.world.

    So, if Lemmy was very famouse, I guess it’s possible to get pages over pages with the same result from different instances. However search engines usually have a way to exclude “similar” results.

    For voyager it may be possible, they do not want to be found, I don’t know about this though. You could add site:vger.app to your search prompt for testing this.

  • nucleative@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    2 days ago

    I think most search engines are not optimized for this. I’m sure it’s changing but might take some time.

    Google historically penalizes duplicate content and selects one source as canonical, usually whichever domain is the most authoritative. When it comes to lemmy, whichever instance hosts the community should probably be the canonical source.

    • Rimu@piefed.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 days ago

      Every post has a <link rel="canonical" href="https://lemmy.instance/whatever"> tag on it which links to the version of the post on the author’s instance.