• Scott
    link
    fedilink
    English
    arrow-up
    14
    ·
    9 months ago

    I’m just trying to get my hands on some faster hardware, https://groq.com has been able to do some crazy shit with their 500/tokens/sec on their LPUs

    • Lojcs@lemm.ee
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      2
      ·
      9 months ago

      What kind of a website is that? Super slow and doesn’t work without web assembly. Do you really need that for a simple interface

      • Scott
        link
        fedilink
        English
        arrow-up
        2
        ·
        9 months ago

        It’s not about their frontend, they are running custom LPUs which can process LLM tokens at 500/sec which is insanely impressive.

        For reference with a max size of 2k tokens, my dual xeon silver 4114 procs take 2-3 minutes.

        • Lojcs@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          No I got what you meant, but that site is weird if it’s not doing anything on its own

        • Finadil@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          That with a fp16 model? Don’t be scared to try even a 4 bit quantization, you’d be surprised at how little is lost and how much quicker it is.

        • Amaltheamannen@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          Isn’t it those that cost $2000 per 250mb of memory?? Meaning you’d about 350 to load any half decent model.

          • Scott
            link
            fedilink
            English
            arrow-up
            2
            ·
            9 months ago

            Not sure how they are doing it, but it was actually $20k not $2k for 250mb of memory on the card. I suspect the models are probably cached in system memory.