WEBVTT

00:00.000 --> 00:14.400
I'm just like, I had no idea what's really being doing something like a keynote talk.

00:14.400 --> 00:22.480
I'm feeling like I like to talk about clarity, licensing, and metadata for mixed packages

00:22.480 --> 00:24.640
and way to problem.

00:24.640 --> 00:29.640
I maintain a bunch of tools, some of you may know about them in the space, something called

00:29.640 --> 00:32.640
scan code.

00:32.640 --> 00:37.080
I double those so a bit in standards, like something called package URL, which is used

00:37.080 --> 00:42.920
to identify packages in S-bombs, and standards, we need more of them, and the co-founder of

00:42.920 --> 00:49.880
SPDX, and I'm also a core contributor to CycleDX.

00:49.880 --> 00:57.880
So I'll try to avoid making the room too late starting late, so I'll try to finish

00:57.880 --> 00:58.880
early.

00:58.880 --> 01:07.560
The problem is that mixed package has metadata in particular licenses, is a bit of a mess.

01:07.560 --> 01:13.560
It's not your fault, it's not all fault entirely, a bit, but it's also because it's hard,

01:13.560 --> 01:20.560
because upstream is also sometimes pretty brain damaged.

01:20.560 --> 01:29.160
If you look at, say, something like, look at this license, this page, which, and you

01:29.160 --> 01:37.000
squint, it says MIT, but it says, you know, permission is not granted, right?

01:37.000 --> 01:44.000
If there's some junkity, but I have a whole collection of these, I call these fubo licenses,

01:44.000 --> 01:47.000
and it's just a small slice of this.

01:47.000 --> 01:53.200
It's easy for us to be full by that, because, you know, it says MIT, and in some case even

01:53.200 --> 01:57.200
GitHub says MIT, and it looks legit.

01:57.200 --> 02:02.440
So that's the problem, so it's not entirely on our fault, it's a problem of the tools,

02:02.440 --> 02:06.960
it's a problem of upstream or so, which is sometimes pretty brain damaged.

02:06.960 --> 02:12.800
In the end, it's useful, each time you have a license issue, you're putting a bit of

02:12.800 --> 02:19.800
a friction in using packages, and license of somewhat important, open source exists because

02:19.800 --> 02:23.120
of packages being under an open source license, right?

02:23.120 --> 02:27.520
You remove the open source license, there's no open source anymore.

02:27.520 --> 02:36.520
So, so useful for some things like emerging regulations like CRA, and, and really,

02:36.520 --> 02:42.560
any kind of responsible reuse, whether it's in the corporates, in an organization, in

02:42.560 --> 02:46.840
an open source project, you want to know about the license.

02:46.840 --> 02:49.040
So what can we do?

02:49.040 --> 02:50.480
Does it take a fix the problem?

02:50.480 --> 02:58.520
So we've started a small project called Nix Clarity, which is supported by a program funded

02:58.520 --> 03:06.480
by European Union called Fegiversity, and the goal is to help fix the mess.

03:06.480 --> 03:11.040
Hopefully also help upstream, because the great thing about Nix is that a lot of the code

03:11.040 --> 03:16.240
is pristine upstream, and ideally, we don't want to fix it for Nix.

03:16.240 --> 03:21.920
We want to fix it for everyone, so in a thousand years from now, open source license clarity

03:21.920 --> 03:26.080
will no longer be a problem.

03:26.080 --> 03:31.040
So the plan is, you know, if we can use package URL to help standardize the case where

03:31.040 --> 03:37.760
we have Vendor code, which may not have been externalized, could be paste, automate the detection

03:37.760 --> 03:45.600
with open source tools, eventually deploy at or from Nix, that's not up to me to decide,

03:45.600 --> 03:52.160
but ideally, like whenever you publish a package on Nix packages, then you get feedback.

03:52.160 --> 03:57.340
If the foundation of community wants to block on funky, weird, missing license, proprietary

03:57.340 --> 04:02.420
license, more power to you as a group, that's not for me to decide.

04:02.420 --> 04:10.540
So, quick work about Rails, I have any of you heard about package URLs?

04:10.540 --> 04:12.540
Good, so I need to talk a bit about it.

04:12.540 --> 04:21.780
It's a small strip it's standard to identify your package in a Nix bomb or elsewhere.

04:21.780 --> 04:27.460
It's useful and you need to know about it because it's been merged into the CV schema for

04:27.460 --> 04:38.380
vulnerabilities in data October, eventually you can go from a scan of a codebase to straight

04:38.380 --> 04:41.500
look up in the vulnerability database without much friction.

04:41.500 --> 04:46.020
So that's that's really useful for that.

04:46.020 --> 04:50.220
Somebody has to tell me if I'm going late because I can go on that for ours, it's been

04:50.220 --> 04:58.460
recently on the 6th of December standardize as an ECMAS standard, it's only choice to

04:58.460 --> 05:02.980
be an ISO standard, which is really interesting for a small string like that.

05:02.980 --> 05:10.820
But you know, it's important for companies to have also this standard to ensure these are

05:10.820 --> 05:13.900
not moving and usable.

05:13.900 --> 05:19.740
We are working by the way specifically supporting package URLs for Nix packages, which

05:19.740 --> 05:25.180
has specific challenges because Nix and because we have their revisions and a lot of

05:25.180 --> 05:30.420
hack has to deal with.

05:30.420 --> 05:37.860
It's used a bit everywhere, most if not all the tools that scan for origin, license and

05:37.860 --> 05:46.700
vulnerabilities do use package URL, so the standardization can basically after adoption.

05:46.820 --> 05:56.140
Database, CVs, using that GitHub, GitHub, all these companies, most open source foundations.

05:56.140 --> 06:03.740
So it's hopefully useful, it's not perfect, I like to say it's less bad than other approaches

06:03.740 --> 06:07.380
that have existed before, so it's just a small set for it.

06:07.380 --> 06:11.580
So now what are we going to do for Nix?

06:11.620 --> 06:16.380
I don't if I vendor it copied code is one thing, so we don't want to create and validate

06:16.380 --> 06:17.820
all the licenses.

06:17.820 --> 06:33.580
You may not know how Nix licenses are actually tracked in Nix itself, but some saying

06:33.580 --> 06:40.260
you search like that, I had a page up and we said that, there's a big, oh, there you go,

06:40.340 --> 06:42.340
it's at the top.

06:42.340 --> 06:47.260
You have a big Nix file that at least each and every possible license that could be

06:47.260 --> 06:48.660
used in the package.

06:48.660 --> 06:55.900
It doesn't scale because you can have something like the Linux kernel which may not be

06:55.900 --> 07:01.980
strictly available as a new Nix package, but you may have a package that has a combination

07:01.980 --> 07:05.940
of license, license, exceptions, these kind of things.

07:05.940 --> 07:09.940
In the current mode, you would have to expand this file forever to list all the possible

07:09.940 --> 07:10.940
combinations.

07:10.940 --> 07:15.940
So that's one problem.

07:15.940 --> 07:19.740
So we want to validate all the license, but there's probably going to be work to do

07:19.740 --> 07:27.340
specifically to support license expressions in Nix language.

07:27.340 --> 07:31.900
And I don't know how to do that, I need to write, how to write Python, I can be dangerous

07:31.900 --> 07:42.420
in C and a bit of C++, I don't know Nix to do something there.

07:42.420 --> 07:47.540
The problem also at scale is it's very large we're talking about tens and tens of

07:47.540 --> 07:52.300
terabytes of code.

07:52.300 --> 07:56.860
We want to clarify the license, so using scan code and match code to do things, scan code

07:56.860 --> 07:57.860
that takes licenses.

07:57.860 --> 08:05.140
It's basically a stupid dump, deep, between many license texts and license mentions

08:05.140 --> 08:11.100
and the code, except it's not stupid when you need to do that billions and billions of

08:11.100 --> 08:16.740
times, so there's a few tricks to make it fast.

08:16.740 --> 08:23.740
It always requires new licenses, especially with AI, there's a lot of very bad innovation

08:23.740 --> 08:29.500
going on, everybody wants to involve great new licenses, which are almost upon source, but

08:29.500 --> 08:30.700
not exactly.

08:30.700 --> 08:38.940
You've seen the case where you just had not every week we find new license, typically from

08:38.940 --> 08:44.500
AI-related projects, which are really problematic.

08:44.500 --> 08:49.860
So being able to detect the license, one thing, the other thing is being able to detect

08:49.860 --> 08:53.780
vendor it's copied code, that's where match code comes in, it's basically database of

08:53.780 --> 08:55.780
harshes.

08:55.780 --> 09:02.580
Again, very simple, in essence, can be a bit tricky if you're trying to find efficiently

09:02.580 --> 09:08.620
matches to code that was copied and it's been modified.

09:08.620 --> 09:14.700
There's a few tricks for that also, and the problem here, we're talking about generative

09:14.700 --> 09:19.820
AI, I have a tool and I can prove that

09:19.820 --> 09:29.340
and scientifically, about 20% of the time when you can easily point LNs to actually

09:29.340 --> 09:34.900
speed, very bad in copies of the source code that was used to trend them.

09:34.900 --> 09:41.420
So AI-company used collectively or source code to trend them model.

09:41.420 --> 09:45.820
This is eventually an index that can speed and memorize everything, whenever you have

09:45.860 --> 09:51.140
that, it's great, copyright infringement machine, it's fast and of course there's no

09:51.140 --> 09:56.820
attribution, no notice that comes with it.

09:56.820 --> 10:03.500
We're going to have a lot of bugs to fix, again, nix is big, nix packages big, we have

10:03.500 --> 10:11.620
some not-secret, top and source, but some machine learning tools to spot incorrect license

10:11.660 --> 10:19.300
detection, so to fix the fix, help fix the bugs, and we're going to need also a way

10:19.300 --> 10:26.340
to avoid being a drain on package maintenance at nix and eventually upstream.

10:26.340 --> 10:32.860
I don't know any solution that due to it one at a time with humans in the loop, I hate

10:32.860 --> 10:41.340
bugs, it's not the way, which means we cannot do the fix of problems upstream without

10:41.340 --> 10:46.140
involving massive the community.

10:46.140 --> 10:50.620
So we talked about the license expression for nix, right now the solution of license

10:50.620 --> 10:56.180
that nix in a scale, if you really want to have a credit license at the scale of the

10:56.180 --> 11:02.500
current nix package table, it's going to have to be 100 times bigger, which is going to

11:02.500 --> 11:09.300
be a challenge even for poor request merging and this kind of thing.

11:09.300 --> 11:17.820
It's SPDX not SPX sorry, we need a way to manage and deal with SPDX expressions, which

11:17.820 --> 11:25.380
are combinations, where you just keep the individual licenses and you can say, oh this is

11:25.380 --> 11:32.820
GPL and BSD, as opposed to say, here's the GSD, here's the GPL, today the approach in nix is

11:32.820 --> 11:40.900
to store GPL, BSD, GPL and BSD, GPL or BSD, GPL and MIT and all this combination, again, cannot

11:40.900 --> 11:41.900
scale.

11:41.900 --> 11:49.420
I have Python code for that, we need help to bring that to fruition to nix.

11:49.420 --> 11:55.460
We need also to find a way when you have package competing to packages, to make sure we

11:55.460 --> 12:01.740
are correct origins, it's going a bit beyond license, if, like you have say, in a rust

12:01.740 --> 12:10.740
crate, a vendor copier of a PNSSL, and you only know about the top-level rust package,

12:10.740 --> 12:14.580
and you forgot that you have vulnerable copier of a PNSSL, that's a problem.

12:14.580 --> 12:19.380
This may go undetected when you actually build nix packages, because you may not

12:19.420 --> 12:24.500
devander systematically the PNSSL copier, the PNSSL copier, the PNSSL copier may have

12:24.500 --> 12:29.340
been patched for what you know, and that's a problem, so it's really interesting and

12:29.340 --> 12:33.700
important, not only for license, but also for security, eventually it's also an issue

12:33.700 --> 12:36.700
for upstream, right?

12:36.700 --> 12:44.220
We want to do that on a slice as big as possible of the nix packages, and brings

12:44.260 --> 12:50.780
and repeat, and again, the goal is not for us to own that, but to empowers the package

12:50.780 --> 12:55.140
maintenance, to have this information, and, on the nix community, and the nix, so

12:55.140 --> 13:00.180
it's conditioned to own this stuff, if possible.

13:00.180 --> 13:02.620
And in the end, we use that for our system.

13:02.620 --> 13:10.300
We have running project with a star in the space, with the maintenance of logforj.

13:10.340 --> 13:14.780
You cannot make a presentation on security issues, and open source without talking about

13:14.780 --> 13:22.060
logforj and logforcial, and what we're doing is working together to fix the problem at

13:22.060 --> 13:27.380
naven, to find hidden copies of logforj.

13:27.380 --> 13:31.860
Doing something on rest also, debyan is in need of love, and suffers a lot of these

13:31.940 --> 13:35.940
licensing issues, when a package is uploaded in a...

13:42.740 --> 13:51.060
They lose the, they lose track, and the metadata usually drift, so if we can do good for nix

13:51.060 --> 13:55.140
for debyan, that would be awesome.

13:55.140 --> 13:56.740
That's it.

13:56.820 --> 14:03.140
We're, as I said, public benefit on profit, foundation-based Brussels, recently formed,

14:03.140 --> 14:10.420
we've been blessed to receive support from the German government, large US corporations,

14:10.420 --> 14:16.420
and a lot from the European Union through the NGI program in particular with NLNet.

14:16.500 --> 14:27.540
We, we leave up and survive as a charity, and I just want to say that we're trying to help,

14:27.540 --> 14:31.940
so if you have questions, and that's pretty much it.

14:39.780 --> 14:40.740
Any question?

14:44.420 --> 14:45.060
Yes, go ahead.

14:46.580 --> 14:58.740
So, the question is, how do you, how do you handle the less common licenses?

15:03.140 --> 15:08.740
So, the detection scan code, as I said, is stupid.

15:09.700 --> 15:16.340
You, you take all the variations, non-variations of MIT, which are bona fide MIT,

15:16.340 --> 15:21.540
the original correct license text, that's agreed upon, because there's no real

15:21.540 --> 15:29.860
original version for MIT, and all the known bad variations, and you're trying to match that exactly,

15:29.860 --> 15:34.260
using a D, when I said using a D, eventually we use a modified version of D,

15:34.340 --> 15:42.580
which works on big vectors, but it doesn't matter. If there's variations like one more difference,

15:42.580 --> 15:49.140
you'll say, there's not exact match. As a matter of fact, we're also indexed, these we have

15:49.860 --> 15:55.380
fubo license, so we detect them eventually as proprietary licenses, but it's, it's really a string matching

15:55.620 --> 16:04.260
at the world level, and the goal is to have as big a database of example of licenses,

16:04.260 --> 16:12.500
about at the moment, about 40,000, and the more we have of this example of bad and good licenses,

16:12.500 --> 16:20.180
the faster we are detecting, because we build automatons, and we use an algorithm for search

16:20.180 --> 16:27.460
called alcharsic, which is used typically for detecting viruses. And in the most, we were very

16:27.460 --> 16:34.100
efficient fast, exact detection, exacting the exact sequence of words, ignoring formatting,

16:34.100 --> 16:41.780
and all that kind of stuff. So it's really string matching on a large scale. The one thing to understand

16:41.860 --> 16:51.540
is that say, given an index like Google, queries, ignoring the new AI mode, was limited to 32

16:51.540 --> 16:59.300
words in a query, index is petabytes. In our case, a codebase or large codebase could be a

16:59.300 --> 17:09.700
copolygabytes, but the codebase is the query. The index is 10, 20, 30 megabytes, the codebase is

17:09.780 --> 17:15.780
the query. It's interesting and less common problems. How do you deal with gigabytes size queries

17:15.780 --> 17:22.340
on the small index? I'm going to see, hopefully people can hear me with this microphone,

17:22.980 --> 17:26.180
I apologize. No, that's okay, that's what the sign to tell me, it's over.

17:27.860 --> 17:31.460
Philip, I'm going to thank you very much. We can take one more question.

17:33.700 --> 17:37.700
I apologize for not introducing you ahead of the talk, we're still figuring out the Microsoft

17:37.700 --> 17:44.580
situation a bit. So we're going to see if the mic can ask you questions that way,

17:44.580 --> 17:47.860
we'll work, so it's because don't have to repeat the questions. Are there any further questions?

17:48.900 --> 17:50.500
Please raise your hand if you have a question.

17:52.420 --> 17:56.100
Somebody can think of a question, right? Can be interesting. Okay, so go ahead then.

17:56.100 --> 18:11.060
No, I didn't hear the question, sorry. Sorry, I did lose my phone coming. Can you hear me?

18:13.060 --> 18:21.380
Is it working? Yes. I'll shout, okay. I was wondering, is there one specific place for working

18:21.380 --> 18:25.780
on this project? Because there's a lot of links on the board. Yes, that's a good point.

18:25.780 --> 18:33.460
So there's one specific place, which is a board of issues we track on a robot code organization.

18:33.460 --> 18:39.700
I'll make sure I'll put the link on the first-hand talk page. Thank you for that. I forgot about it.

18:41.540 --> 18:45.380
All right, thank you very much. Thank you.

18:51.380 --> 18:53.380
Thank you.

