'Is This Knowledge Of Illegal Activity?': Josh Hawley Torches Meta For 'Pirating' Books To Train AI

Forbes Breaking News

During a Senate Judiciary Committee hearing on Wednesday, Sen. Josh Hawley (R-MO) shows chat logs from Meta employees showing employees allegedly expressed doubts about pirating materials to train artificial intelligence models.

Transcript

00:00Thank you very much, Professor. Thanks for being here. Thanks again to all of our witnesses.

00:03We're going to now have seven-minute rounds of questioning, and we'll see if we can fit in

00:10maybe a couple of rounds, just depending on the time that we have. I'll start, and then we'll go

00:13to the ranking member and any other members who arrive in that time. Professor Viswanathan,

00:19let me just start with you, if I could, and let's see if we can just drill down on some of the

00:22specifics here. Mr. Baldacci mentioned in his opening statement that AI could just feed dictionaries

00:29into their platforms in order to train them. They don't do that. They prefer published works,

00:34fully formed works. Why is that? Can you give us an insight into that? That's absolutely right.

00:39They learn syntax, structure. They learn how we learn language, right? When you learn language,

00:45you just don't learn words. You don't memorize words. You don't memorize notes when you learn music.

00:49You learn structure and syntax, and the point that Professor Lee is making is correct. They need

00:55large data sets. More is better to learn predictive language models. However, more is not everything.

01:06It's not pirated works. So let me just ask this. You said that they are not buying the books. They're

01:14not buying Mr. Baldacci's book or anybody's book who's sitting up here, anybody in the audience.

01:18They're getting them. They're stealing them. They're pirating them from somewhere. If they're not

01:23buying the books, they're not stealing them out of libraries, where are they getting them?

01:27These large repositories of materials that are available online, there are many. Some are licit,

01:33some are not licit. The pirate websites in particular are not licit. So if you need a lot

01:38of material, you go out and you scoop up all that material that you can find. But you don't go to pirate

01:44websites to get that material if what you want to do is legal. None of these works are licensed.

01:49None of these works are licensed. No author has been compensated to date.

01:56So how do they go to these, let's call them shadow libraries, to get the works illegally? They've

02:03already, by the time they go to the shadow library, the works there are already stolen, right? They've

02:07already stolen Mr. Baldacci's book, Professor Lee's book, everybody's, your books. They've stolen them.

02:12How do they actually, when they go to the shadow library, how do they get them? I mean,

02:16how does the AI company then take possession of the particular work?

02:21There's a process called torrenting, and I will not trouble you all with the details of torrenting,

02:26but essentially huge amounts of data stream to you and you get them. At the same time,

02:32you can send them out. That's called seeding. You can send them out at the same time.

02:36Uploading and downloading exist at the same time. This is a peer-to-peer process. So not only are you

02:42taking in these pirated materials, you are also distributing them. The violation of copyright law

02:48exists at the reproduction of these works, at the making available of them by the pirate libraries,

02:54the dissemination of them, and your dissemination, Gen AI company, of them as well.

03:00So they're both taking the works and distributing them as well in this thing you call it, kind of like

03:05Napster, this thing that you call torrenting. Let me ask you this. I mean, that's not,

03:09is torrenting legal? That's not legal, is it? Torrenting can be legal, but in this case it is

03:13not. And in this particular case, this is benefiting the torrent. Now, I agree with Judge Alsup,

03:21who said, if you're taking it from pirate libraries, no way. That is not acceptable, right? Part of what

03:27we're seeing here, Judge Chabria said, well, it's not helping the pirate websites. Well, yes, it is.

03:33The pirate websites, there's one in particular called Anna's Archive. They actually put on their

03:38website, hey, Gen AI companies, come train on us. We'll do some data swaps, or you know what? You

03:44can make us a donation, too. This is directly helping the pirate websites thrive, flourish,

03:50proliferate. Let me ask you this. Have there been any, to your knowledge, any criminal enforcements

03:54against these torrenting platforms? Yes, there have been attempts to. Again, it's like a game of

03:59whack-a-mole. You get one, you knock it down, it pops up again in some jurisdiction that you don't

04:04have control over. What's the key to a criminal enforcement? You know, civil versus criminal in

04:09this context, when do we have a criminal case against torrenting? What's the key to that?

04:15Okay, this is a really important point. What's criminal here? Criminal copyright liability has

04:21two prongs to it. Prong one is you have to do it willfully, and prong two is you have to do it for

04:27commercial advantage or gain. We clearly know that prong two is met. This is for commercial advantage or

04:31gain. I don't think Meta is doing this out of the goodness of its heart. Prong one, willful means

04:36you need to know that what you were doing is illegal. There's lots and lots of evidence now,

04:42particularly from the cadre v. Meta case, that shows that they knew this was illegal. They even had to

04:48ask all the way up the chain of command to Mark Zuckerberg and say, hey, is this okay? And he said,

04:52yes, it did. Yes, it's okay. So not only did he do it knowing it was illegal, he did it knowingly,

04:57he did it willfully, intentionally, and whether or not he knew what statute it was legal doesn't

05:02matter. For this to be willful, you have to know that what you're doing is wrong, and this meets

05:07that prong. So this is, in fact, amounting to what you might call criminal copyright liability.

05:13Mr. Pratt, let me just ask you about this, about the willful aspect, and let's talk about Meta in

05:17particular, since Professor Vizwanifan just mentioned Meta. They're one of the biggest monopolists

05:23in the world, and one of the biggest AI companies now in the world, if not the biggest. So let's just

05:27talk about them for a second. Meta uses torrents to acquire pirated data for its llama model. Is that

05:35right? Correct. How much data would you estimate that Meta has torrented? It's illegally downloaded

05:44and also then shared in this peer-to-peer scheme. It has pirated well over 200 terabytes

05:52of copyrighted material from multiple, I don't call them shadow libraries because they're not

05:59libraries, but illicit criminal enterprises. And how much has it paid the copyright holders for

06:06these works that it's used, to your knowledge? Nothing. If nothing, zero. So billions of works,

06:14billions of books like Mr. Baldacci's, zero payment. If Meta were to pay, do you have any idea what the

06:21cost might be? I mean, did they ever, to your knowledge and your discovery, did they ever explore

06:25paying? I mean, is there any sense of how much this might have cost them? Early on, they explored

06:31licensing. They assigned two individuals part-time to attempt to license, and they decided it would take

06:40too long, for example. And that's when they turned to piracy. At the time, they had, public documents show

06:48that certainly tens of millions, if not hundreds of millions, had been contemplated for licensing at

06:55that time. Okay, so let's just think about this. Hundreds of millions of dollars, that's the value,

07:00maybe sort of the base, the bare value of the works that they've used, like the works that you all have

07:06written on this panel. Hundreds of millions, and they paid zero of that. So let's just drill down a little

07:12further. Did Meta know what they were doing was wrong? Do you, do you, Mr. Pritt, believe in the

07:18evidence you've seen, that there's any evidence to suggest that Meta's employees knew what they were

07:23doing is illegal? The documents that have become public clearly show that. Let's just look at a few

07:27of these documents. I'm going to show you a few things, and I'll ask you to help me interpret them

07:31to make sure that we get them right. Let's start here with a Meta employee, a Meta engineer working

07:36on their AI project, Eleonora Prasani. She says, I don't think we should use pirated material. This

07:42is in a chat with other Meta employees. I don't think we should use pirated material. I really need

07:47to draw a line there. She goes on, I feel that using pirated material should be beyond our ethical

07:53threshold. Sci-Hub, ResearchScape, LibGen are basically like Pirate Bay or something like that.

07:57They are distributing content that is protected by copyright, and they're infringing it. How do you

08:03read this, Mr. Pritt? Does this look like knowledge to you?

08:06That's certainly what we've argued in the case. Let's look at another Meta employee.

08:14Here is Nisha Deo in the same chat. She replies and said, it's the piracy on us knowing and being

08:23accomplices that's the issue. This is a Meta engineer working on their AI project. It's the piracy on us

08:32being knowing accomplices. That's the issue. Let's look at another one.

08:41Here is the response that another Meta engineer in the same chat gave. Well, we want to buy books and be

08:50nice, open people here, but however, to make it happen and not letting the bad guys win, that's the

08:57beat China argument. We need to make a case fast and cut some corners here and there. We need to

09:04cut some corners here and there. Mr. Pritt, what are we looking at here? I mean, is this knowledge of

09:09illegal activity? When they refer to bad guys, I think they're actually referring to OpenAI and other

09:14AI competitors. But yes, this is certainly one of the many documents that show that they knew these

09:22were pirated websites that contained copyrighted materials and they were taking them for free.

09:26So here we have it in black and white. Don't believe me. Read the evidence. These are Meta's

09:33own engineers, Meta's own employees saying they know what they're doing is ethically wrong, illegal,

09:41likely to subject them to legal liability, and they're doing it anyway because they need the money.

09:47There's a lot more here. We'll come back to this. I want to give Senator Durbin a chance to ask questions.

09:53Senator.

Category

Transcript

Recommended