• @31337
    link
    222 days ago

    Likely transformers now (I think SD3 uses a ViT for text encoding, and ViTs are currently one of the best model architectures for image classification).