WEBVTT

00:00.000 --> 00:12.000
And so welcome to my session introduction introducing the Kubernetes checkpoint restore working group.

00:12.000 --> 00:14.000
Please be quiet.

00:14.000 --> 00:16.000
Thank you.

00:16.000 --> 00:24.000
So this talk will be as the title suggests about the Kubernetes checkpoint restore working group.

00:24.000 --> 00:27.000
It's also, I only have 10 minutes.

00:27.000 --> 00:30.000
And there's no time for questions.

00:30.000 --> 00:34.000
So if you have any questions, please catch me later somewhere.

00:34.000 --> 00:38.000
And then we can talk about it.

00:38.000 --> 00:44.000
So why did we create the Kubernetes checkpoint restore working group?

00:44.000 --> 00:45.000
So there's a bit of history.

00:45.000 --> 00:49.000
So I'm personally working on checkpoint restore for, I don't know,

00:49.000 --> 00:51.000
long time, 16 years probably.

00:51.000 --> 00:57.000
And since the last six years we've been focusing trying to get checkpoint restore working in Kubernetes.

00:57.000 --> 01:04.000
And the reasons why we want this, the use cases, that's an interesting site story kind of.

01:04.000 --> 01:12.000
If you look at the last 25 years of papers written about checkpoint restore outside of the context of containers also,

01:12.000 --> 01:14.000
it's always the same, it hasn't changed.

01:14.000 --> 01:16.000
We read a paper from 20 years ago.

01:16.000 --> 01:22.000
The reasons why people want to have this probably still the same.

01:22.000 --> 01:25.000
So this is mainly you want to have four tolerance.

01:25.000 --> 01:27.000
You have a workload running.

01:27.000 --> 01:32.000
And for whatever reason you want to project it, if some of the resources in your cluster or

01:32.000 --> 01:39.000
failing, you want to have a point where you can continue running your workload without starting from the beginning.

01:39.000 --> 01:44.000
Another use case is which has come up in the last years is more and more in the last years.

01:44.000 --> 01:50.000
This fast startup time, this is mainly people are using this with Java applications today.

01:50.000 --> 01:52.000
So you initialize a Java application.

01:52.000 --> 01:57.000
You take a checkpoint and then you restore it from the checkpoint and it will start.

01:57.000 --> 01:58.000
And much faster.

01:58.000 --> 02:06.000
And the thing which why people are really interested in the last years is better resource utilization.

02:06.000 --> 02:09.000
So you have a cluster, Kubernetes or not.

02:09.000 --> 02:12.000
And you want to redistribute your workload.

02:12.000 --> 02:19.000
You want to rebalance your workload and without stopping the workload without losing any progress.

02:19.000 --> 02:25.000
And so this is also reason why you would like to maybe use checkpoint restore to migrate your workloads.

02:25.000 --> 02:31.000
And especially the resource utilization part has become interesting in the last years with GPUs.

02:31.000 --> 02:37.000
GPUs because GPUs are really expensive and everybody wants to run something on GPUs.

02:37.000 --> 02:48.000
And so people are really interested to finding a solution how to better utilize existing resources and existing GPUs.

02:48.000 --> 02:54.000
And that's why we have seen much more interest in checkpoint restore related topics.

02:55.000 --> 02:58.000
In the beginning, I have done this mainly all alone.

02:58.000 --> 03:02.000
And then sometimes later, I don't think he's here.

03:02.000 --> 03:04.000
He's slightly slightly joined the effort.

03:04.000 --> 03:06.000
And I think Victoria's over there.

03:06.000 --> 03:08.000
She also joined the effort.

03:08.000 --> 03:12.000
And so we in a small group started to work on this.

03:12.000 --> 03:19.000
And in 2022, we had the first thing which was merged in Kubernetes, container check pointing and restore.

03:19.000 --> 03:27.000
And but the problem or it did work for us at the time, it was all happening in a small group, most discussions were private.

03:27.000 --> 03:35.000
And we we tagged along pretty good, but it was it was a really small group and they were it was easy it was difficult for people from the outside.

03:35.000 --> 03:40.000
Understanding what we did and why we did it and so.

03:40.000 --> 03:44.000
So in the last one and a half years.

03:44.000 --> 03:55.000
Working group idea came up and last year here at Boston after my talk people started to talk about us about why we do things we do them and couldn't we do this differently.

03:55.000 --> 04:04.000
And then all of the first half of 2025 people were reaching out to us and saying what's happening here can we also joined this effort.

04:04.000 --> 04:06.000
And so we thought.

04:06.000 --> 04:13.000
Maybe it's time to make this something official and the Kubernetes community has this this construct called working group.

04:13.000 --> 04:21.000
So they have special interest groups which live for a long time and they have working groups which are more for short time to try to solve a certain problem.

04:21.000 --> 04:31.000
And so we thought yeah well maybe creating a working group and having more people join this effort would be the the right the better approach.

04:31.000 --> 04:37.000
So what we did in in May 2025 we started to fill out the Kubernetes paperwork.

04:37.000 --> 04:42.000
It's a pull request where you describe what you want to do why you want to do.

04:42.000 --> 04:52.000
And then just a half year later this pull request was merged and we were able to officially say that a Kubernetes checkpoint we still working group was established.

04:52.000 --> 05:02.000
We then did all the Kubernetes infrastructure thing we need for the working group of mailing list zoom call thing slack channels calendars.

05:02.000 --> 05:17.000
And in December 2025 we actually had a first meeting and the first meeting we kind of introduced what we think are the open topics what this working group could solve in the context of Kubernetes.

05:18.000 --> 05:35.000
And so at this point the meeting we have maybe 2025 people who are interested in in the topic and all working together to find some way how we could get more support in Kubernetes and extending what we currently have.

05:36.000 --> 05:44.000
So it was pretty clear early that the one thing we want to focus in the next time is pot check pointing restore.

05:44.000 --> 05:54.000
So currently we have just container but because Kubernetes is more about pots and there are many resources which are connected to a pot.

05:54.000 --> 06:02.000
It seems like to make this feature useful it's important that we have it connected to a pot and so.

06:02.000 --> 06:13.000
Pot check pointing restore is what we currently working on and this happens in the usual Kubernetes way we are having a Kubernetes enhancement proposal which is a lot of document where we describe everything.

06:13.000 --> 06:33.000
How we want to do it design architecture and all those things and in parallel to this theoretical description we also have an implementation a proof of concept where we try to implement what we describe and this is mainly for us to see if if the ideas that we have on paper actually also work in real life.

06:33.000 --> 06:41.000
And important for us is that the proof of concept doesn't have to be the final solution so if we see in our I don't know.

06:41.000 --> 06:54.000
Testing it or designing it it doesn't work out the proof of concept is just what the name it is it's just a concept and what we want to do to bring pot check point restore to Kubernetes is an iterative approach.

06:54.000 --> 07:09.000
And the main reason is to make it easy for reviewers and I think this is like in any other open source project if we would bring everything at once we would need forever to design and code it and people would have a really hard time reviewing it and that's why we want to.

07:09.000 --> 07:30.000
And do it in small steps and just have first some some initial functionality in Kubernetes and it will not from the beginning support all pot attached resources like volumes or shared memory this will be all happening in in later steps.

07:30.000 --> 07:46.000
And one of the goals of this pot check point restore is at some point maybe we would like to see a scheduler integration so that if resources are getting low somewhere the scheduler can make decisions to automatically migrate the pot from one node to another node.

07:46.000 --> 07:56.000
So who's currently working it or I already mentioned Victoria is one of the chairs how does he and he's one of the chairs and one of the chairs Peter he's not here.

07:56.000 --> 08:09.000
I know that other people who are joining this effort in the working group are also at first them so please say hi and but in the end it's everybody who's interested in the topic.

08:09.000 --> 08:20.000
We're welcoming you to join us help us move this forward in Kubernetes and hopefully we can see it implemented pretty soon.

08:20.000 --> 08:33.000
So when do we meet currently we meet a Thursday 6 p.m. until European time and anytime on slack on the mailing list and my time is up and that's it. Thank you for listening.