Even scaled model cannot cross tthe compute efficent frontier

validation loss

Neural scaling laws. That is error rates scale with compute,model size and dataset size, independant of model aritechture. Can we drive to 0?

Same laws apear in video and image models.

LLMs are auto progressive models.

Theorical results guiding experiemental - saving compute time.

How LLMs work

During training we know next value, hence we have a Loss function to help learning.

L1 - loss functions

Cross Entropy-loss function (uses negative log of probability). Why is cross enropy used over L1?

unabigious next words. Entropy of natural language due to this will LLMs cannot drive Cross entropy loss to zero.

Manifolds?

Example MNIST data set images of number, has high dimesnional dataset space.

16:03 Simlar concepts group together.

density of manifold average distance between point. or size of neightbourhoods s

Manifold hypothesis (data points in high dim space) and scaling laws

Knowing the manifold will help scaling. This is called

Resolution limited scaling

Cross entropy loss should scale wrt manifold.

19:24

intrinsic dimension of natural language