![](/static/253f0d9b/assets/icons/icon-96x96.png)
![](https://awful.systems/pictrs/image/8651f454-1f76-42f4-bb27-4c64b332f07a.png)
West Coast of USA, late 2000s to early 2010s, yes, the thick squared dark eyeglass frames were popular. Every time I see photos of these folks, I’m reminded of a couple people I know IRL as well as folks I know professionally who still prefer the thicker frames. Personally, I’ve always needed a very heavy prescription, and so I’ve always looked for the thinnest frames, but it really was a trend a decade ago.
Look, I get your perspective, but zooming out there is a context that nobody’s mentioning, and the thread deteriorated into name-calling instead of looking for insight.
In theory, a training pass needs one readthrough of the input data, and we know of existing systems that achieve that, from well-trodden n-gram models to the wholly-hypothetical large Lempel-Ziv models. Viewed that way, most modern training methods are extremely wasteful: Transformers, Mamba, RWKV, etc. are trading time for space to try to make relatively small models, and it’s an expensive tradeoff.
From that perspective, we should expect somebody to eventually demonstrate that the Transformers paradigm sucks. Mamba and RWKV are good examples of modifying old ideas about RNNs to take advantage of GPUs, but are still stuck in the idea that having a GPU perform lots of gradient descent is good. If you want to critique something, critique the gradient worship!
I swear, it’s like whenever Chinese folks do anything the rest of the blogosphere goes into panic. I’m not going to insult anybody directly but I’m so fucking tired of mathlessness.
Also, point of order: Meta open-sourced Llama so that their employees would stop using Bittorrent to leak it! Not to “keep the rabble quiet” but to appease their own developers.