• notfromhere@lemmy.oneOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Wow that’s crazy. Would it be possible to offload the KV cache onto system RAM and keep model weights in VRAM or would that just slow everything down too much? I guess that’s kind of what llama.cpp does with GPU offload of layers? I’m still trying to figure out how this stuff actually works.

    • actually-a-cat
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      That’s what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.

      MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it’s not a problem.