transformer 6
- Do Transformers Need Three Projections? — QKV 투영을 공유해 KV 캐시를 절반으로
- Stanford CME295: Lecture 9 - Recap & Current Trends
- Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
- Stanford CME295: Lecture 2 - Transformer-Based Models & Tricks
- Stanford CME295: Lecture 1 - Transformer 기초
- Stanford CME295: Lecture 0 - Transformer 개요