WEBVTT

00:00.000 --> 00:12.240
So yeah, for those of you who are not familiar with VLN, VLN is basically started as the open

00:12.240 --> 00:20.080
source projects from a couple of students from the Berkeley and it kind of exploded very quickly

00:20.080 --> 00:25.280
because it offered a really like a nice and efficient implementation of inference for modern

00:25.280 --> 00:33.040
LLMs and then over time it became one of the very very large and vibrant communities where

00:33.040 --> 00:38.000
like PhD students, researchers, engineers and also like big tech companies are contributing

00:38.000 --> 00:44.000
very actively and trying to develop like inference engine which is going to serve these modern

00:44.000 --> 00:50.880
LLMs as efficiently as possible. The cool part about VLN is the fact that it runs and supports

00:50.960 --> 00:56.640
various hardware so you can run it on and VLN GPUs, AMD GPUs, Google TPUs, AWS,

00:56.640 --> 01:02.000
Neuron, Intel Gaudi and so on and so forth and usually for most of these hardware vendors like

01:02.000 --> 01:07.520
the teams from these companies are helping out kind of developing you know, trying to make VLN run

01:07.520 --> 01:13.840
as efficiently as possible in their hardware and then on top of that VLN supports running all

01:13.840 --> 01:20.320
like popular modern LLMs these days and usually model vendors are working with VLN kind of behind

01:20.320 --> 01:24.480
the scenes to always try to enable day zero support for their models when they are about about

01:24.480 --> 01:33.040
a real system. So VLN offers a ton of like optimizations by default and like you will get

01:33.040 --> 01:37.760
already like pretty good performance if you just deploy your model with the standard configuration.

01:37.760 --> 01:41.840
However there is always like the traditional kind of question of trying to figure out

01:41.840 --> 01:46.960
whether we can squeeze in more performance and this is where quantization and speculative decoding

01:47.040 --> 01:52.080
became like very attractive topics for us because they offer the additional pathway completely

01:52.080 --> 01:57.680
orthogonal to all optimizations that already exist in VLN to accelerate inference with these like

01:57.680 --> 02:04.240
very, very large models. So just to bring everyone to the same page just one slide on what

02:04.240 --> 02:08.640
quantization is all about. So quantization is a basically a process through which we are reducing

02:08.640 --> 02:13.280
the number of bits that we use to represent either weights or activations of the model.

02:14.000 --> 02:19.040
What I mean by this is that if we take a neural network and we plot all of the weights in the

02:19.040 --> 02:24.720
network, we're going to get this like a normal Gaussian like her. And what's going to happen is that

02:24.720 --> 02:29.600
the weights are going to be distributed across this curve and they can take any possible value here

02:29.600 --> 02:34.000
and given that this is the full precision and by full precision usually these days we mean

02:35.200 --> 02:41.040
the weights are represented in brain floats 16 for but that's that became a standard these days.

02:41.120 --> 02:46.240
We have very high granularity so we can represent like very, very small differences between the

02:46.240 --> 02:50.880
weight. And through the process of quantization what we're trying to do now we're trying to take

02:50.880 --> 02:55.040
these weights and we're trying to put them in these discrete buckets and each of these buckets

02:55.040 --> 02:59.840
represent one single value that we can represent in this now a new quantized range.

03:01.040 --> 03:07.600
Because this like a quantization this this quantized range here has much much lower granularity

03:07.600 --> 03:11.840
it this means that during this process we're going to have to shift the weights a bit.

03:11.840 --> 03:16.480
For example if two weights were very close to each other in the original full precision regime

03:16.480 --> 03:21.520
we're going to most likely one be able to represent their difference in the quantized range

03:21.520 --> 03:25.440
so we'll have to put them in the same bucket. For some weights we'll have to move them to the

03:25.440 --> 03:31.040
left to the right and so on and so forth. All of this to say that quantization is not the

03:31.040 --> 03:35.760
lossless process. So this means that there will be some loss of the precision and the entire

03:35.760 --> 03:40.640
game and entire research here is all about figuring out algorithms which are going to enable us

03:40.640 --> 03:46.240
to do this while preserving accuracy of the model as much as as possible because we still don't

03:46.240 --> 03:53.920
want to destroy our model in order to get some efficiency gains. So let's take a look where

03:53.920 --> 03:58.400
quantization fits inside the server and then based on that we'll try to figure out which quantization

03:58.400 --> 04:03.600
schemes to use in which in which in which situation. If we look at our server like at the like

04:04.560 --> 04:10.640
the very bottom of this like a pyramid we have CPU memory. This is like our main memory it's

04:10.640 --> 04:17.200
like very large but it's also very very slow. On top of that we have our GPUs. GPUs have

04:17.200 --> 04:21.600
something called like high bandwidth memory or HBM and this is what you see when you run

04:21.600 --> 04:26.800
NVIDIA. So my for example and the main characteristics of this layer is that it's much smaller

04:26.800 --> 04:31.840
but it's also much faster than the CPU memory. And then at the very top we have GPUs

04:31.840 --> 04:36.800
RAM and tenser cores. So this is the smallest part of the memory but it's also the fastest one.

04:36.800 --> 04:40.480
And this is basically the part where the computation happens. This is where matrix may be

04:40.480 --> 04:44.880
some applications actually happen and this is how we are doing the inference like with our model.

04:45.520 --> 04:49.840
So if we try to look at what's happening when we are like trying to do one forward past

04:49.840 --> 04:53.520
our model. So we want to put in some tokens for the model and generate an answer.

04:54.240 --> 04:58.320
The first part that we'll have to load the model. So we have to move it from the CPU memory

04:58.960 --> 05:03.440
to the HBM and this is something in most normal scenarios we do only once.

05:03.440 --> 05:09.840
We load the model to the GPU memory and then for n times what we're going to do like for

05:09.840 --> 05:14.240
every single operator in the network we're going to take like one weight matrix. We're going to

05:14.240 --> 05:20.640
load it from HBM inside tenser cores. Do some computation with that right back results and repeat.

05:20.640 --> 05:26.880
And this is something we're going to do for every single like a linear operator in the network

05:26.960 --> 05:31.760
and then what's going to happen is that this is the sequence of the operations that we're

05:31.760 --> 05:35.760
going to repeat many many times for every single token for every single letter in the network.

05:37.360 --> 05:42.080
The reason why I'm just kind of specifying here like loading torch and a linear is because

05:42.800 --> 05:49.040
even though the LMCs days they have many different types of layers linear layers or linear operators

05:49.040 --> 05:54.240
are the costly pieces of this. If we were about to plot like where we spend time during the forward

05:54.240 --> 06:00.000
pass all of the linear operators this is like the self-attention and the MLPs part.

06:00.000 --> 06:04.000
They will take majority of the time here and they are the main target for optimization in general

06:04.000 --> 06:09.920
here. So these are the two main components and candidates that we actually want to accelerate

06:09.920 --> 06:16.560
through a quantization. So the first part is an acceleration of the loading. So given that the

06:16.560 --> 06:21.360
loading part happens like when we're transferring the weights from HBM inside the tenser cores

06:22.240 --> 06:27.200
there is nothing else we can do on the hardware level because all GPUs come with a specific bandwidth

06:27.200 --> 06:31.760
they're nothing we can change there. However the only thing so that we can do there is

06:31.760 --> 06:36.800
reduce the number of bits to load and those bits are the bits of our weights. So that's the first

06:36.800 --> 06:40.800
kind of quantization scheme that we are working with that's called weight only quantization.

06:41.520 --> 06:45.920
The second part is the computational part which actually happens inside the tenser cores and this is

06:45.920 --> 06:50.320
where matrix matrix multiplication operations happen and they happen inside tenser cores.

06:51.040 --> 06:55.920
So the only way to accelerate that is to find faster tenser cores. And if we take a look at

06:57.040 --> 07:03.840
technical specifications of an H100 GPU we'll see that by default we are usually because most

07:03.840 --> 07:08.800
models are in brainfolded. 16 format we are operating in brainfolded 16 tenser cores which you

07:08.800 --> 07:15.040
really have around 2,000 steraphlobs. And then if we look in this sheet there are also two more tenser

07:15.040 --> 07:20.960
cores called fp8 and in-tait tenser cores which offer two times more flops per unit of time.

07:20.960 --> 07:25.520
So the main goal of this part is now to push all of the matrix matrix multiplications

07:25.520 --> 07:30.400
to happen inside fp8 or in-tait tenser cores instead of brain float 16 because now we get

07:30.400 --> 07:35.440
two times more flops per unit of time. And the way to do this now is to have both operands of

07:35.440 --> 07:40.720
the matrix matrix multiplication quantize either to fp8 or in-tait. And this is where the second

07:40.800 --> 07:47.280
scheme comes. Which is called like weight and activation quantization. And then these are

07:47.280 --> 07:52.720
like the two main paradigms in the quantization field. And whenever we want to like quantize and

07:52.720 --> 07:58.320
optimize the model like we have to pick one of these two like and that's going to like like

07:58.320 --> 08:03.120
depend on the exact scenario that we're interested and we'll see later on how to choose this.

08:03.120 --> 08:07.920
Now I'm going to just very quickly like go through some code samples just to like show

08:07.920 --> 08:12.880
showcase how how relatively simple this is at this point in time. As part of the real

08:12.880 --> 08:17.120
and project we have a library called LM compressor which is basically a library that implements

08:17.120 --> 08:21.520
a lot of like state-of-the-art quantization algorithms which you can just use from from the

08:21.520 --> 08:26.400
from the python interface. And it's relatively simple to apply this. You just load your model.

08:26.400 --> 08:31.840
This is the standard hugging phase transformers like we load load our model with dot from pre-trained.

08:31.840 --> 08:35.840
And then we instantiate the quantization modifier. We say okay we want to target all

08:35.840 --> 08:39.920
linear layers because as I mentioned before those are the costly pieces of the inference.

08:39.920 --> 08:44.960
And then the scheme that we want to do for quantization is let's say in this specific case FPA dynamic

08:44.960 --> 08:48.960
scheme we call the one-shot method we provide the model, the recipe and that's it.

08:50.000 --> 08:55.280
This is like a very simple pipeline where we don't use any calibration data. So here now we come to

08:55.280 --> 08:59.280
the second like like a choice that we have to make when we're doing quantization. And that's

08:59.280 --> 09:03.200
the choice of whether we're going to do quantization with or without calibration data set.

09:03.360 --> 09:07.760
Calvation data set is basically a set of tokens that we pass through our model

09:07.760 --> 09:12.800
to try to simulate the forward pass through the model and see how the model responds to this.

09:12.800 --> 09:18.720
So because LMS these days have one major problem and that problem are outliers inside

09:18.720 --> 09:24.160
activations. Outliers in activations are basically that's like a phenomenon where some specific

09:24.160 --> 09:29.760
channels have order of magnitude higher values than other other activations in the same in the same

09:30.240 --> 09:34.400
layer. And this makes quantization very hard because if we have one extreme large value and a lot

09:34.400 --> 09:39.120
of very small values, if we do like quantization top of that, this very large value is going to push

09:39.120 --> 09:43.440
all these small ones to like to zero. And if we do that, we're going to destroy the accuracy.

09:44.080 --> 09:51.120
So then we have to address addresses and kind of the path to get there is to take some

09:51.120 --> 09:55.920
calibration data set pass it through the model and see where these outliers appear in the model

09:55.920 --> 10:00.640
and then based on that applies to apply some tricks. And this requires some kind of calibration

10:00.640 --> 10:06.880
data set and this kind of here example here shows how how relatively easy it is to do the

10:06.880 --> 10:10.480
quantization with LMS compressor, but now just accounting for the fact that we want to have a

10:10.480 --> 10:15.120
calibration data set as well. First we load the model, the standard way as we load

10:15.120 --> 10:20.240
any agifist transformer, then we have already prepared like based on like many years of research,

10:20.240 --> 10:24.480
we already prepared like a proper calibration data set. Usually if you have a model

10:24.480 --> 10:29.920
fine tune for a specific task, you want to pick calibration data set from that task so that you

10:29.920 --> 10:34.880
have in distribution tokens. If you're just like quantizing like a general model, then this should

10:34.880 --> 10:39.920
be a good enough enough data set to start with. We just tokenize this and then we create a

10:39.920 --> 10:45.120
quantization recipe and here I'm trying to show you that one nice feature of LMS compressors

10:45.120 --> 10:51.360
that you can combine multiple different quantization algorithms together. Here I'm just like

10:51.360 --> 10:57.280
throwing in smooth quant, which is an algorithm to deal with the outliers in the activations.

10:58.080 --> 11:02.480
And then the standard quantization modifier, I want all linear layers, but now I have changed

11:02.480 --> 11:08.000
the quantization scheme to quantize the weights, to 8 bits and activations to 8 bits. And also I can

11:08.000 --> 11:12.800
ignore some specific layers if I really want. One common practice is to ignore quantization of the

11:12.800 --> 11:18.960
final LMS head because usually this takes a significant impact on the accuracy of the model.

11:19.520 --> 11:24.160
And then we call the one shot, we provide model recipe and like contrary to the previous example,

11:24.160 --> 11:28.160
now given that we have calibration data set, we also have to provide the data set and the number

11:28.160 --> 11:33.680
of calibration samples. And what is this is going to do is going to pass the calibration tokens through

11:33.680 --> 11:38.800
the model, look how activations behave, where the outliers are and then based on that apply

11:38.800 --> 11:44.000
smooth quant and the wrong linearest quantization and provide provide us with a quantization model.

11:44.320 --> 11:50.480
This pathway is computationally more expensive, right, because you have to have some access to

11:50.480 --> 11:55.280
GPUs kind of to do the forward pass. The previous one was relatively easy because it did not

11:55.280 --> 11:59.440
depend on any calibration data set, so we have been able to do that even on a CPU.

11:59.440 --> 12:03.600
LMS compressor has a couple of nice already built-in features to deal with this,

12:03.600 --> 12:08.400
whether you have a single GPU multiple GPUs can do the sharding, CPU floating and all these

12:08.400 --> 12:16.080
like modern tricks to deal with that. So given that we are as I mentioned before, we are quantizing

12:16.080 --> 12:21.040
a model, this is a lossy process, the main kind of the first question that we should ask ourselves

12:21.040 --> 12:26.080
is how well are we recovering the accuracy of the original model. And here I'm just kind of we have

12:26.080 --> 12:31.280
like a pretty long line of research where we're trying to show how to do quantization in a different

12:31.280 --> 12:36.080
in different scenarios for different models. And so on but here I'm just kind of like like a

12:36.080 --> 12:42.000
cherry-picked just one example to show how these accuracy recovery looks like in practice overall.

12:42.000 --> 12:47.920
So what we're looking at is the like an entirely family of the deep seek R1 distilled models.

12:47.920 --> 12:52.240
So from Lama and Quen deep seek took these models, fine tuned them for some reasoning

12:52.240 --> 12:57.200
reasoning tasks. And then we are looking at their average performance across AIME and F500

12:57.200 --> 13:01.760
and GPU alignment, which are the standard kind of reasoning task to do this. And we are looking

13:01.760 --> 13:08.160
at four models, BF-16 is the unquantized baseline represented by the gray, gray bar, F-PA-188 and

13:08.160 --> 13:13.200
in four represented by a blue, green and yellow colors represent just the different quantizations

13:13.200 --> 13:18.720
in that we apply here. And one like most important point kind of to take away from this is that

13:20.160 --> 13:26.080
blue and the green bars here are usually very, very close in most cases almost indistinguishable

13:26.160 --> 13:32.720
from the gray bars, which means that F-PA-188 quantization if done properly and calibrated properly

13:32.720 --> 13:38.880
should always yield accuracy recovery in range of 95% to 100% of the full baseline model.

13:40.000 --> 13:44.960
One important note in addition to that is that in four quantization is usually a little bit

13:44.960 --> 13:50.000
trickier in this in the specific regime and more specifically for the smallest models in the family.

13:50.000 --> 13:54.800
As you can see Quen 1.5 V, Lama, A-B, we do see slightly higher drops here,

13:54.800 --> 13:59.840
but usually if this process is also done properly, these drops should never be below 90%.

13:59.840 --> 14:04.480
At least we have done a ton of research of this produced hundreds of models on the Hague Nesave

14:04.480 --> 14:10.080
and we still haven't found a single use case where this the accuracy recovery would go below 90%.

14:10.080 --> 14:13.600
So if that happens then something should be calibrated

14:13.600 --> 14:18.880
slightly better on the quantization part. And all of the times so far I'm just talking about Lama

14:18.880 --> 14:25.920
but all these techniques and all these techniques apply to vision language models or any other

14:25.920 --> 14:29.920
architecture that you find, it's just that Lama's these days are the most representative use case.

14:30.720 --> 14:36.560
So this was about the accuracy given that we are compressing a model, we usually expect some

14:36.560 --> 14:41.120
gains on the speed upside because now we have a smaller model and we want to run inference with that.

14:42.000 --> 14:49.120
So here what we're looking at is how the inter-token latency changes with respect to the number

14:49.120 --> 14:55.520
of queries per second and we are looking at Lama, A-B model served on a single A-6000 GPU for

14:55.520 --> 15:00.560
some specific use case doxling generation. And we are looking at three different models, B-F16.

15:00.560 --> 15:06.480
So this is the unquantized baseline, entade weight and activation quantized model and

15:06.480 --> 15:10.720
in four weight only quantized model. And they're like two interesting things to take away from

15:10.720 --> 15:16.720
this graph. The first one is that if our server is being hit with less than four queries per second

15:16.720 --> 15:22.480
we're going to be in this regime here where the best choice with respect to the smallest latency

15:22.480 --> 15:28.480
that we want to get from the model would be obtained by doing in four weight only quantization.

15:28.480 --> 15:34.640
And this is because in this specific use case here we are not bounded by compute. So we don't

15:34.640 --> 15:40.000
have many requests coming to our model. Therefore our GPUs are going to be idle most of the time.

15:40.000 --> 15:44.880
So what's going to happen here is that the main gain that we can get is by optimizing this

15:44.880 --> 15:53.360
weight loading part of the pipeline. Then the next part here becomes important that when we hit

15:53.360 --> 15:57.360
four queries per second, when we hit four queries per second what starts happening is that

15:57.360 --> 16:02.480
weight and activation quantization starts becoming better choice than weight only quantization.

16:02.480 --> 16:08.080
This is because at this point we have large enough inputs to keep our GPUs busy doing matrix

16:08.080 --> 16:12.800
matrix multiplications which means that now we need to optimize the computational part. This was

16:12.800 --> 16:18.160
the second part of the inference pipeline. And we can optimize that by doing quantization of both

16:18.160 --> 16:23.120
operands of the mathematical which means we are leveraging lower precision times our queries.

16:24.000 --> 16:29.040
And then if we push even more, if we push even more what's going to happen is that at some point

16:29.040 --> 16:34.400
we're going to get in the heavily compute boundary gene. So our inputs are going to be extremely large

16:34.400 --> 16:39.520
and weight only quantized models are going to become a voice choice than just deploying

16:39.520 --> 16:44.480
unequantized model. In practice because right now we're going to be heavily bounded by compute

16:44.480 --> 16:48.960
and the time needed to load the weights is going to be almost zero relative to the time that we

16:48.960 --> 16:53.920
spend in terms of course doing models. So you have to be really careful when you're deploying a

16:53.920 --> 17:00.320
models in order to figure out where on this graph you're deployment flies. And this depends on

17:00.320 --> 17:05.360
all of the factors in the game or like the model size, the GPU that you have and the requests

17:05.360 --> 17:12.000
that you that you that your server server is receiving. In order kind of to automate this because

17:12.000 --> 17:15.920
we had to do this many many times we developed a library. It's also part of the villain project

17:15.920 --> 17:20.960
it's called Guidelalan where you can just serve your model and then simulate real-world workloads

17:20.960 --> 17:25.040
and see how how the model behaves in different scenarios and then based on that plot something

17:25.040 --> 17:30.640
like this and see which quantization scheme is best for you. Given us we have five more minutes

17:30.640 --> 17:36.320
I'm probably going to try to wrap it here just just very short like part regarding the speculative

17:36.320 --> 17:42.720
decoding. So speculative decoding is relatively new technique. So quantization was like a lossy process

17:42.720 --> 17:48.720
because we are moving the weights around we are shifting the weights lossless speculative decoding

17:48.800 --> 17:54.560
is a lossless acceleration technique which means that we are not changing the model and the

17:54.560 --> 18:01.040
text that we get at the at the end of the speculative decoding process is like guaranteed to be the

18:01.040 --> 18:06.240
same text that we would get without it. The main caveat here is that we have to train additional

18:06.240 --> 18:12.480
model called speculator model which we're going to serve alongside our original model. So this speculator

18:12.480 --> 18:18.160
model like is a model which is like order of magnitude smaller than our very large model that we

18:18.160 --> 18:22.240
are originally interested in and this model is something we're going to run many many times

18:22.240 --> 18:26.960
like trying to produce three five tokens at a time and then we're going to take our larger model

18:26.960 --> 18:31.520
in this case it's called a verifier just to verify the outputs of this smaller model and then

18:31.520 --> 18:36.400
the larger model is going to say oh I agree with three out of five tokens these are the tokens

18:36.400 --> 18:41.040
that I would produce if I was in the decoding phase so I'm going to accept these three I'm going to

18:41.040 --> 18:47.040
reject these two and then go again and in this way if our speculator model has been trained properly

18:47.920 --> 18:53.680
we get some kind of scheme where we are generating multiple tokens at a time with a larger model

18:53.680 --> 18:59.200
for the cost of like running a couple of for passes through this very very small model and it's kind

18:59.200 --> 19:04.240
of feels I tried to illustrate it here in the terrific clear but the first pass like we have a prompt

19:04.240 --> 19:12.640
once upon the large model starts the the inference produces up and then we take this like once upon

19:12.800 --> 19:19.200
and we put it to we give it to a speculator model and we say okay please like speculate the next

19:19.200 --> 19:23.600
two tokens should be it's going to say time there and then we're going to take all these

19:23.600 --> 19:27.520
and we're going to put them through the verifier and say hey are do you agree with these tokens

19:27.520 --> 19:32.400
like are these tokens that you would produce and in this specific case the model said yes I would

19:32.400 --> 19:38.000
produce these we got three tokens for the cost of running just one token plus some small amount of

19:38.000 --> 19:43.200
time we spent in this speculator model then we have the we generated three tokens we pass it again

19:43.200 --> 19:48.880
to a speculator we try say oh generate the next two tokens the the the speculator model generates

19:48.880 --> 19:53.680
these two the verifier in this case does not agree with the scary so it's just going to discard

19:53.680 --> 19:59.360
the scary part take the the first one which which agrees with and then it's going to combine them

19:59.360 --> 20:05.120
and produce an x one and this entire process is great but there is one important part is that

20:05.600 --> 20:10.160
quantization was relatively easy to apply we take a model we just quantize it in one shot

20:10.160 --> 20:14.400
we have a model which is faster here the main cost that we have to pay that we have to train the

20:14.400 --> 20:18.960
speculator model so we have to take a large model and then we have to generate some data set

20:18.960 --> 20:22.800
and then we have to train smaller speculator model they're like different techniques how to train them

20:23.360 --> 20:28.160
to mimic kind of the distribution of the larger model so this is like a computational

20:28.160 --> 20:33.040
expensive part but it does does allow us to get like a loss as loss of speedups and for this purpose

20:33.040 --> 20:37.280
like we also developed a library in the VLM project called speculators where you can train these

20:37.280 --> 20:41.920
models for your own data sets or you can just go to hugging face hub and pick up some of the models

20:41.920 --> 20:46.080
that that we already released there so given that you don't have time I'm just going to skip through

20:46.080 --> 20:51.280
the like results yeah the cool part is that we can get the speedups of anywhere from two to five

20:51.280 --> 20:56.320
x depending on the model size and the quality of the speculator model but I'm going to skip this so

20:56.320 --> 21:01.440
we can maybe take some questions so yeah links like all all of the libraries are part of the VLM

21:01.440 --> 21:05.840
project open source you can you can just play with them there are standard examples if you

21:05.840 --> 21:09.920
don't want to do any of this you just want to get like high quality quantized or high quality

21:09.920 --> 21:14.640
speculator models you can go to red hats hugging face hub we are kind of releasing on a daily

21:14.640 --> 21:18.960
basis new models there it there is like more than 500 compressed models which have been already

21:18.960 --> 21:23.280
validated before so you're sure and you can see like what the accuracy recovery saw across

21:23.280 --> 21:28.800
different use cases so you can just download them and play with them yeah thanks a lot for your time

21:31.840 --> 21:53.600
yeah perfect yeah okay yeah so yeah so the question is how to tweak the recipe so the recipe so

21:53.600 --> 21:58.160
it really depends on the quantization algorithm that you want to use every single quantization algorithm

21:58.160 --> 22:03.360
has different knobs to tweak usually what you would do it would be like the like like the standard

22:03.360 --> 22:08.160
training loop you would take some development set like small some small subsets that you're you

22:08.160 --> 22:13.200
know not testing on and then you would just run a couple of quantization quantization runs with

22:13.200 --> 22:17.840
different like hyper it's like a like like tuning a training process with different hyper parameters

22:17.840 --> 22:22.720
and then you would you would evaluate on the development split and then you would like pick the best

22:22.720 --> 22:27.280
one and go to the test split usually there are some there are some guidance depending on which

22:27.280 --> 22:33.840
quantization algorithm you use on how to tune some specific pieces like for example GPTQ which tries

22:33.840 --> 22:38.880
to like approximate like the inverse Hessians then there's like some dampening therm blah blah blah

22:38.880 --> 22:44.480
which has to be picked like on a like a model line model basis so there is no like a general rule

22:44.480 --> 22:48.240
of time you know you should set this like this there is just like some kind of rule of time

22:48.240 --> 22:54.000
on how many tokens you can use for calibration so like like 501K tokens at that point you start

22:54.160 --> 22:59.120
seeing like a like a negligible improvement in the end of the results but apart from that everything

22:59.120 --> 23:16.880
is very specific to the algorithm that you use there is a question here yes yes yes so you always run them

23:16.880 --> 23:33.200
in parallel yeah so yeah the question is how to run speculators in vlm so basically in vlm we have

23:33.200 --> 23:38.560
support where you don't have two instances of vlm it's like a single instance but inside of that

23:38.560 --> 23:44.400
single instance you have a speculator model and your large model and if you do like TP2 as you said

23:44.400 --> 23:49.680
for example in this case you are splitting that like the tenser parallelism refers to your original

23:49.680 --> 23:53.600
model the speculator will still run on its own so it's kind of completely independent because

23:53.600 --> 23:58.400
it's really small model you don't get any gains by kind of splitting it or doing any fancy

23:58.400 --> 24:03.120
sharding with that so it's basically your entire model is still running in the same way you just

24:03.120 --> 24:09.840
attach one more smaller model which just runs in parallel and it's it's it's simple as a vlm serve

24:10.000 --> 24:14.400
you take a speculator model and do vlm serve you give a speculator and based on that it's going to

24:14.400 --> 24:20.960
do everything automatically for you yeah you don't have to do any custom stuff yeah yeah there is

24:20.960 --> 24:37.920
question order yeah so the question is regarding the validation of quantized models is it

24:38.320 --> 24:43.040
are we in a danger of overfitting for example for some specific data set yes that's like a really

24:43.040 --> 24:49.120
great question and it's still like on on going kind of problem because usually whenever we quantize

24:49.120 --> 24:53.760
the model which we open the model card on hacking face out and we see okay like you know

24:53.760 --> 24:59.600
mistral they published like these 10 evaluation benchmarks so our main task is okay let's try to

24:59.600 --> 25:05.120
do quantization and recover on these benchmarks that they proposed and that's kind of the standard

25:05.120 --> 25:09.840
setup but we usually do maybe there is something that we are missing along the way so we are always

25:09.840 --> 25:15.600
trying to add new new benchmarks to the mix but we still haven't found a single benchmark where

25:15.600 --> 25:22.000
the entire story that we have about accuracy recovery being about 90% fails so yeah we kind of we have

25:22.000 --> 25:26.160
a paper where we did more than a million evaluations across many different benchmarks is called

25:26.160 --> 25:32.160
give me bf 16 or give me death accuracy performance traders in in quantization and they they're basically

25:32.240 --> 25:38.000
presented the study like a large scale study we took every single benchmark that exists out there

25:38.000 --> 25:44.640
from the arena hard coding a hacking face leader board three one we two and so on and we still

25:44.640 --> 25:51.520
haven't been able to find the single benchmark where this this fails yeah yeah

25:52.160 --> 26:11.280
so so so the the question is relation with the llama yeah

26:11.360 --> 26:27.680
okay yeah so the violum supports llama cpp models in general i think but i'm not exactly sure

26:27.680 --> 26:33.360
what support for the quantize now because llama cpp has its own like way of doing the quantization

26:33.360 --> 26:38.960
like q4 q3 and all these different schemes and they're all doing weight only quantization as far

26:39.920 --> 26:44.720
as i'm aware of and i think that violum should support them but i'm not sure that that's

26:44.720 --> 26:50.320
well tested path at all i think there is some kind of way to run llama cpp models but i don't think

26:50.320 --> 26:54.960
it's super performant at least at this point yeah i know that there are like some people working on it

26:56.160 --> 27:01.920
specifically in in redhead there is a new team which is supposed to bring llama cpp to be a first

27:01.920 --> 27:08.480
class citizen of violum but i'm not not really like in touch with that with that pipeline we mostly

27:08.560 --> 27:12.240
do violum like on on gp's that's kind of the main to many of the skis

27:27.920 --> 27:29.920
great

