WEBVTT

00:00.000 --> 00:10.160
OK, and that's going to be our last talk of the day.

00:10.160 --> 00:15.160
Talking about choosing what message came, and the EPSL.

00:15.160 --> 00:16.160
Hello.

00:16.160 --> 00:17.160
I'm Dominic.

00:17.160 --> 00:24.840
I'm very humbled to be here, after all these speakers who came before me.

00:24.960 --> 00:28.640
I've been coming to first them for 10 years or so,

00:28.640 --> 00:31.320
and this is the very first time I'm presenting.

00:31.320 --> 00:34.240
So, in advance I ask for apologies for.

00:34.240 --> 00:41.960
Thank you very much.

00:41.960 --> 00:46.840
I'm here to present the work of my team this year.

00:46.840 --> 00:49.960
Some of them are in attendance.

00:49.960 --> 00:54.120
And without further ado, let me say all about our alma mater,

00:54.120 --> 00:58.040
EPSL, Ecole, Politic, Fédéral de Lousanne,

00:58.040 --> 01:04.920
the Swiss Federal Politic-Nix School on Leggineva.

01:04.920 --> 01:07.000
This is a rather big place.

01:07.000 --> 01:11.520
We have thousands of students, 14,000 plus students,

01:11.520 --> 01:15.880
about 300 academic staff, meaning as many labs,

01:15.880 --> 01:19.920
like autonomous research entities, 6K employees,

01:19.920 --> 01:23.800
and people from all over the world.

01:23.800 --> 01:25.280
So, there's a positive side.

01:25.280 --> 01:27.600
To that, of course, we have data centers.

01:27.600 --> 01:31.440
So, maybe some of them run things on the cloud,

01:31.440 --> 01:34.400
the public cloud, who does?

01:34.400 --> 01:39.600
Ah, so, a number of you, who has an on-premise setup,

01:39.600 --> 01:41.320
like with you?

01:41.320 --> 01:42.680
All right.

01:42.680 --> 01:47.800
So, apologies for people who are on the cloud side,

01:47.800 --> 01:50.320
because some of the things I'm going to talk about,

01:50.320 --> 01:56.680
specific to our position, which is to have an on-premise data centers

01:56.680 --> 01:59.200
that have VMs and NFS.

01:59.200 --> 02:02.280
We are a team of eight employees and five interns,

02:02.280 --> 02:06.000
and, of course, we don't do just work press.

02:06.000 --> 02:09.520
So, I'm going to speak today about solving a scale,

02:09.520 --> 02:12.040
which is, I think, our main contribution,

02:12.040 --> 02:13.760
operating at scale.

02:13.760 --> 02:15.880
I think this, when you will find this of interest,

02:15.880 --> 02:20.000
we do want to run a cluster of core presses as well,

02:20.000 --> 02:23.640
and developing a scale, but there's a little trick here,

02:23.640 --> 02:27.320
a little cache that I will mention later.

02:27.320 --> 02:29.080
So far, that we are going to use tools

02:29.080 --> 02:31.560
that two people are probably very familiar with,

02:31.560 --> 02:35.400
who are going to use containers, pods, PSPFPM.

02:35.400 --> 02:39.200
I expect NGNX to be a little more debatable,

02:39.200 --> 02:41.480
and in fact, we probably will have to change that

02:41.480 --> 02:45.240
in short order, and, of course, enough premise cache,

02:45.240 --> 02:48.680
even though we are, we have our own cloud,

02:48.680 --> 02:53.480
we do want to come the use of a cache,

02:53.480 --> 02:57.360
I get that, get back to that point later.

02:57.360 --> 02:59.800
Earlier, we figured first, that we wanted work press.

02:59.800 --> 03:02.000
So, we were users, before 2009,

03:02.000 --> 03:04.600
we were users of something called dryer.

03:04.600 --> 03:08.760
It's another CMS, and we need to reform that.

03:08.760 --> 03:11.600
And also, we wanted to federate all the, you know,

03:11.600 --> 03:15.200
my lab, the TPSFLCH, and there are a single domain name,

03:15.200 --> 03:19.160
and we figured, this place is a bit of an access control headache.

03:19.160 --> 03:22.880
Why not have a galaxy of what presses,

03:22.880 --> 03:28.600
all living under a particular segment of the Europe space?

03:28.600 --> 03:31.840
And we did the math and came up with hundreds of containers.

03:31.840 --> 03:37.320
Sorry, of what presses, and the first iteration of that was tough.

03:37.320 --> 03:43.320
So, we actually went for a WordPress of the 2000s start,

03:43.320 --> 03:48.000
you know, like, you upload your PHP to some commercial,

03:48.000 --> 03:52.880
hosting provider, except in our case, the hosting provider was us.

03:52.880 --> 03:57.240
And, whether you want to run WordPress or something else,

03:57.240 --> 04:00.720
is kind of up to you.

04:00.720 --> 04:03.280
A bit, because we didn't know any better back then,

04:03.280 --> 04:05.920
and we had actually outsourced the exposition.

04:05.920 --> 04:09.520
We were not said a DevOps team, we were a development team.

04:09.520 --> 04:14.800
So, that's me about two decades earlier than that.

04:14.800 --> 04:20.080
And, that was not really at scale.

04:20.080 --> 04:23.200
We used our patch, and as you might know,

04:23.200 --> 04:27.040
this is a broad choice for something of that scale,

04:27.040 --> 04:31.120
because basically you are always paying full price for everything

04:31.120 --> 04:35.200
on query, whether it be, whether it be the homepage,

04:35.200 --> 04:39.480
or humble JavaScript file somewhere.

04:39.480 --> 04:41.240
And, we run that stuff over, and the first,

04:41.240 --> 04:42.560
because NFS is what we have.

04:42.560 --> 04:44.600
So, we used NFS as the hosting provider.

04:44.600 --> 04:47.240
And, that is a bit of a disaster.

04:47.240 --> 04:51.800
Because, every time, WordPress starts up for every single click

04:51.800 --> 04:56.040
on every single page, you have to pay a thousand times,

04:56.040 --> 04:58.600
the RTT, the wrong trip time, to your file.

04:58.600 --> 05:01.880
Because, there is none of the credit on the server,

05:01.880 --> 05:03.120
and no any better.

05:03.120 --> 05:06.200
You have, even if there is a cashmished there is,

05:06.200 --> 05:08.800
you have to revalidate it.

05:08.880 --> 05:12.320
So, in 2025, which is this year,

05:12.320 --> 05:14.480
we had to do it again, and do it right.

05:14.480 --> 05:17.840
And, we went for a much more modern, much more,

05:17.840 --> 05:21.600
I will say, stand out, architecture.

05:21.600 --> 05:24.480
So, we're going to zoom in on that bit here,

05:24.480 --> 05:30.800
and later on that bit here, and finally, this one with a demo.

05:30.800 --> 05:33.360
So, you want a nice cash, if you are going to be serious with

05:33.360 --> 05:36.560
service at scale, you want something that turns your

05:36.640 --> 05:40.880
footerabytes of data serve daily, that's what a day

05:40.880 --> 05:44.000
of serving looks like at a perfil at CH.

05:44.000 --> 05:48.800
Into just one terabyte, served by your architecture.

05:48.800 --> 05:51.680
So, that's a nice error, and it pays for itself.

05:55.360 --> 05:58.640
And, let me jump right to the chase,

05:58.640 --> 06:02.560
as regards the serving infrastructure, how many positive you think,

06:02.560 --> 06:09.840
we need it for 800 what presses and one terabyte of traffic per day?

06:09.840 --> 06:13.520
Well, the answer is right there, how many colors can you see here?

06:13.520 --> 06:16.720
Two, and the other things actually, you know,

06:16.720 --> 06:19.120
screaming, they are humming, as they should.

06:19.120 --> 06:22.000
We could actually make do with just one.

06:22.000 --> 06:27.760
If we had, you know, but we have two for obvious, you know,

06:27.760 --> 06:32.560
resiliency reasons. So, how does that work?

06:32.560 --> 06:36.560
Well, first, we decided to change from Apache to a non-blocking

06:36.560 --> 06:41.760
web server, something that lets the heavy duty

06:41.760 --> 06:47.520
job of an in-road press run somewhere else in a PHPFPM site,

06:47.520 --> 06:50.640
a site car, I would say, or separate container,

06:50.640 --> 06:54.160
and NGX is able to serve hundreds of thousands of,

06:54.160 --> 06:56.480
or tens of thousands of simultaneous connections,

06:56.480 --> 07:00.400
out of a single process, which is more efficient.

07:00.400 --> 07:03.520
And, obviously, we had to turn our general purpose

07:03.520 --> 07:09.840
and our posting approach into something in which the PHP code goes into the container image.

07:09.840 --> 07:14.240
Right, so it's harder to backdoor, and now you have a real

07:14.240 --> 07:17.600
FS cache with metadata interaction meaning that you

07:17.600 --> 07:23.440
revalidate the cache of PHP in milliseconds.

07:23.440 --> 07:29.840
So, first thing that went away was our internal secondary cache,

07:29.840 --> 07:34.720
which was called vanish, so out with it went a lot of our

07:34.720 --> 07:38.640
operational complexity, which is what we think.

07:38.640 --> 07:44.160
Now, NGX has to be interested with making it all work.

07:44.240 --> 07:47.920
It knows about all sites, all your rows, and all variables,

07:47.920 --> 07:52.480
like there is no two world presses that have the same username

07:52.480 --> 07:58.960
or password in the Maya SQL or later, MariaDB databases, right?

07:58.960 --> 08:02.320
It has to serve as I mentioned on a first line basis,

08:02.320 --> 08:05.280
like the small files or the things that are in NFS,

08:05.280 --> 08:09.440
because some of them stay there, like the uploads from people who want to publish,

08:09.440 --> 08:12.080
you know, images and movies.

08:12.160 --> 08:17.440
And the rest goes to PHPFPM, and that means that every single

08:17.440 --> 08:19.680
world press query is started fresh.

08:19.680 --> 08:25.040
And I want to spend a little time exploring how this is even possible.

08:26.080 --> 08:30.560
PHP has this design of a site which turned into a skill of feature.

08:30.560 --> 08:33.760
It forgets everything, every time.

08:34.960 --> 08:38.320
It starts from a blind slate, all constants, all variables,

08:38.320 --> 08:41.600
all states, all coordinates to be reloaded,

08:41.600 --> 08:44.400
or at least pretend to, in a cache.

08:44.400 --> 08:48.240
I'm not exactly sure how this is made efficient, but it is.

08:48.240 --> 08:52.000
It involves something called Shen memory and a Zen cache,

08:52.000 --> 08:56.160
from that Zen company who makes, who is in the business of open source.

08:57.680 --> 09:02.080
We are effectively turning the entire world press into a function as a service.

09:03.040 --> 09:08.800
So I'm very tempted to claim that this would be very tricky to achieve

09:08.800 --> 09:11.520
in any other programming language than PHP.

09:14.400 --> 09:15.760
So we are small team.

09:15.760 --> 09:19.760
We need to be a parity cascade with all these hundreds of processes.

09:21.040 --> 09:24.880
And therefore we wrote, and we offer, this is our flagship contribution,

09:24.880 --> 09:30.560
the word press operator, something that does the heavy lifting of operating

09:31.200 --> 09:35.760
all the scary Kubernetes objects, and NFS, MKDRs, and the PHP code,

09:37.040 --> 09:42.400
all a breeze, all with the cop from one of the Kubernetes

09:42.400 --> 09:47.600
operators from working Python, and in a fully IPF diagnostic way,

09:47.600 --> 09:49.520
meaning that you can start using it right now.

09:51.360 --> 09:53.760
I would like to take the time for them or for this.

09:55.600 --> 09:59.440
So if you are connected to our cluster, sorry,

10:00.080 --> 10:02.880
this is the path where I get nervous and mess something up.

10:02.880 --> 10:05.760
Hopefully I don't. You can say,

10:05.760 --> 10:08.560
come sit here, get word presses.

10:13.280 --> 10:18.160
I suppose so, better.

10:21.600 --> 10:23.920
So there's a number of them, but this is the testing stance,

10:25.280 --> 10:28.320
but you could also do that in product with another coupon fee,

10:28.320 --> 10:31.920
and get all the 800 word presses that I mentioned.

10:33.760 --> 10:37.120
But there's someone in attendance who doesn't like me working on the production

10:37.120 --> 10:40.000
instance, so we're regime from the testing stance, if you're.

10:42.960 --> 10:48.000
So right now, you can see that I don't have anything like my sleeve,

10:48.720 --> 10:52.000
and there's a fall or fall at the hell of us than Europe.

10:52.080 --> 10:57.920
But we wrote this manager of this backend up to create

10:57.920 --> 11:03.920
what presses and delete them, and I took the liberty of squeezing a few

11:03.920 --> 11:08.080
seconds out of the demo by prefilling it, so that I only have to

11:08.080 --> 11:12.240
create here, and that should create a Kubernetes object,

11:13.920 --> 11:17.440
that is going to make the operator operate right there.

11:18.000 --> 11:23.440
So it's already creating, it has observed the event, create word press site,

11:24.240 --> 11:28.160
and it's doing things like creating a secret, a user,

11:28.160 --> 11:32.080
a grant, looks to other legating work to another operator, which is the

11:32.080 --> 11:38.320
Mary at DB operator. Then some PHP runs, some of it has warnings,

11:38.320 --> 11:41.440
we could squash some of those warnings, some others we could don't.

11:41.440 --> 11:43.440
Ah, to bad.

11:43.520 --> 11:54.480
And once the operator is done, the number of plugins to be installed, it says succeed,

11:54.480 --> 11:59.840
well, that's this great. If we go back to that fall for page right here,

12:00.560 --> 12:03.280
we can see that something new is coming.

12:14.080 --> 12:18.160
So did we automate ourselves out of the job? Why aren't we sort of?

12:18.160 --> 12:21.680
We actually got to focus on the fan part, on the important part.

12:22.400 --> 12:28.000
So as we, as you saw, we wrote a back, back office, sorry, not back end, back office,

12:28.000 --> 12:34.320
up, that helps us manage the fleet, right? So you'd probably have to write that

12:34.320 --> 12:36.160
should you want to adopt the word press parade.

12:36.800 --> 12:45.680
And we fixed our wrong 22 power 22 problem, which was crashing production every day,

12:45.680 --> 12:50.480
by moving the current workload out into their own dedicated set of pods.

12:50.480 --> 12:54.480
That too, you might want to look into, if you're out to host work press, at scale.

12:56.560 --> 13:01.440
And we developed a some scale and would like to increase the scale.

13:01.440 --> 13:05.520
And that's a bit the reason we're here. We very much like

13:06.240 --> 13:11.120
to get in touch with you outside in the whole way, right after the presentation.

13:11.760 --> 13:15.840
And one way we did that for ourselves for starter was to write a development pit.

13:17.120 --> 13:21.680
If you type these three commands, sorry, two commands, one euro,

13:21.680 --> 13:25.440
in your browser, sorry, in your terminal, and then in your browser, right now,

13:25.440 --> 13:29.600
you get to fully working what press, on your accession, Mac, or Linux.

13:30.400 --> 13:37.040
So is this really for immediate consumption? Well, it depends.

13:37.040 --> 13:42.240
We have this flagship contribution, which is the operator, and we do believe it's ready right now.

13:43.440 --> 13:48.640
Some of the things like the integration with NGNX is coupled with the way we do things,

13:48.640 --> 13:52.720
which might even be even more controversial than NGNX itself.

13:52.720 --> 13:57.200
We use NGNC, but we don't use our goCity, I hear that maybe we should.

13:58.000 --> 14:04.480
So we would need to work to provide this as an independent piece of infrastructure,

14:04.480 --> 14:08.320
and we would very much like to show this story.

14:08.320 --> 14:13.600
Should we find someone here or on the internet who finds interested in it?

14:14.240 --> 14:19.360
Obviously, if you're one, our WordPress, you'll have red and white WordPress,

14:19.360 --> 14:21.760
which you don't want, soon have to rewrite some parts.

14:22.640 --> 14:24.880
The theme would be prime candidates, and some plugins,

14:24.880 --> 14:28.720
it depends, like you don't pour in nuts and trusting in large menus of a PFL.

14:29.520 --> 14:34.320
And some of them, like the push gateway, the thing that lets us, on the basis of the WPCron,

14:34.320 --> 14:40.560
extract metrics on a hourly basis, like instead of every three minutes, like I don't know,

14:40.560 --> 14:46.000
the number of pages per language, what we probably can use that right away, as well.

14:47.520 --> 14:48.400
Thank you very much.

14:52.640 --> 15:16.720
Thank you very much for this, so how would an update of the WordPress application propagate to all the instances?

15:17.440 --> 15:21.040
That's a great question, so when should I repeat this or not?

15:22.000 --> 15:28.560
I think I'm happy. All right, so when we upgrade WordPress, most of the time it works,

15:29.280 --> 15:36.000
meaning that we just build another image, we actually have a homemade continuous integration system

15:36.000 --> 15:42.960
that relies on text on pipelines, and most of the times, the website that pops up in the morning,

15:42.960 --> 15:48.320
sorry, on the test instance works, it doesn't have any so-called migrations and it changes

15:48.320 --> 15:54.720
required to the database to perform the update. When that is not the case, we obviously want to block

15:54.720 --> 15:59.680
the production, I'm sorry, block that image from going to production, which is fairly easy,

15:59.680 --> 16:04.640
because as we use Ansible, it just doesn't happen automatically. I do want as a team leader,

16:04.640 --> 16:08.320
I want someone with both hands on the keyboard before production changes.

16:09.680 --> 16:15.440
And the unsigned that case is that we will have to run a campaign of WPCleek commands that

16:15.440 --> 16:23.600
will run the update, and we will do that as part of a warout that is organized on a special basis.

16:23.600 --> 16:28.320
This part is not, we did not automate ourselves of a job.

16:32.960 --> 16:40.160
The other question is, thank you. Thank you very much.

