The problem with generative AI is plagiarism, not copyright.

If you squint and tilt your head, you can see some similarities in the blurry shapes that are Harvard and OpenAI. Each is a leading institution for building minds, whether real or artificial—Harvard educates smart humans, while OpenAI engineers smart machines—and each has been forced in recent days to stare down a common allegation. Namely, that they are represented by intellectual thieves.

Last month, the conservative activist Christopher Rufo and the journalist Christopher Brunet accused then–Harvard President Claudine Gay of having copied short passages without attribution in her dissertation. Gay later admitted to “instances in my academic writings where some material duplicated other scholars’ language, without proper attribution,” for which she requested corrections. Some two weeks later, The New York Times sued Microsoft and OpenAI, alleging that the companies’ chatbots violated copyright law by using human writing to train generative-AI models without the newsroom’s permission.

The two cases share common ground, yet many of the responses to them could not be more different. Typical academic standards for plagiarism, including Harvard’s, deem unattributed paraphrasing or lackluster citations a grave offense, and Gay—still dealing with the fallout from her widely criticized congressional testimony and a wave of racist comments—eventually resigned from her position. (I should note that I graduated from Harvard, before Gay became president of the university.) Meanwhile the Times’ and similar lawsuits, many legal experts say, are likely to fail, because the legal standard for copyright infringement generally permits using protected texts for “transformative” purposes that are substantially new. Perhaps that includes training AI models, which work by ingesting huge amounts of written texts and reproducing their patterns, content, and information. AI companies have acknowledged, and defended, using human work to train their programs. (OpenAI has said the Times’ case is “without merit.” Microsoft did not immediately respond to a request for comment.)

Read: Artists are losing the war against AI

There is a difference, obviously, between a prominent university leader and a prominent chatbot. But the overlap between the two situations is meaningful, demanding clarity on what constitutes stealing, proper credit, and integrity. While they provide useful heuristics for judging academic work and generative AI, neither plagiarism nor copyright is an intrinsic standard—both are shortcuts for adjudicating originality. Considering the two together reveals that, beneath the political motives and slighted egos, the real debate is over the degree of transparency and honesty that society expects from powerful people and institutions, and how to hold them accountable.

There is some cognitive dissonance at play between the controversies. The most prominent people chastising Gay for scholarly plagiarism—which Harvard defines as drawing “any idea or any language from someone else without adequately crediting that source”—have not declared war against generative AI’s idea-harvesting. One of Gay’s harshest critics, the billionaire Bill Ackman, recently said that “AI is the ultimate plagiarist.” But he also made a substantial investment in Alphabet last year—because, Ackman said at the time, he believes the company will be a “dominant player” in the field, partially due to its “enormous amounts of access” to customer data that he suggested could be used, legally, as AI training material. Brunet, who helped bring forth the initial plagiarism accusations against Gay, uses ChatGPT-written summaries of his own work with zeal. (Neither Ackman nor Brunet responded to requests for comment.)

For his part, Rufo, the conservative activist who helped spearhead the campaign to remove Gay, has taken issue with generative AI, although his complaints are mired in the culture wars—that the technology is becoming too “woke.” Reached via email, Rufo did not comment on the notion that AI is stealing intellectual property, and said only that “there is an important commonality between Claudine Gay and ChatGPT: neither are reliable sources for academic work.”

At the same time, Gay’s defenders have argued that the faults in her work amount to neglect and sloppy citations, not malice or fraud, and suggested that common standards for plagiarism should be updated with some of the leniency of copyright law. Some of her advocates are among the fiercest critics calling generative AI theft.

Regardless of your position, the debate over Gay’s resignation is about values, not actions—not about whether Gay reused materials without attribution, but about how consequential doing so was. It is a debate over the definition and punishment of varying degrees of theft. Even if a court rules that training an AI model on a book without the author’s permission is “transformative,” that doesn’t negate that the model was trained on a book without the author’s permission, and that the model could automate book-writing altogether. Perhaps, instead of framing the battle between artists and chatbots around copyright, it is time to apply Harvard’s plagiarism standard to generative AI.

Read: These 183,000 books are fueling the biggest fight in publishing and tech

The very same accusations leveled against Gay, if applied to ChatGPT or any other large language model, would almost certainly find the technology guilty of mind-boggling levels of plagiarism. As the NYU law professor Christopher Sprigman recently noted, “Copyright leaves us free to copy facts and even bits of expression necessary to accurately report facts,” because sharing facts and context benefits the public. Anti-plagiarism rules, he wrote, “take the opposite approach, acting as if the first person to put a fact on paper has a moral claim to it powerful enough to bring down serious punishments for uncredited use.”

These rules exist to give authors due credit and prevent readers from being duped, Sprigman reasons. Chatbots violate both at an unfathomable scale, paraphrasing and replicating authors’ work on infinite demand and on infinite repeat. Language- and image-generating AI programs alike have been known to almost exactly reproduce sentences and images in their training data, although OpenAI says the problem is “rare.” Whether those reproductions, even if verbatim, run afoul of U.S. code will be litigated; that they would constitute plagiarism if found in the dissertation of a university’s president is beyond doubt. AI companies frequently say that their chatbots only learn from copyrighted material, like children—but the technology’s core function is to reproduce without consent or citation, meaning that this silicon form of “learning” still constitutes plagiarism. One might argue that allowing chatbots to repurpose facts is as socially beneficial as allowing humans to do so. But unlike a graduate student toiling away, chatbots threaten to put their uncited sources out of business—and, unlike a self-respecting academic, journalist, or any human, chatbots are equally confident about right and wrong information while being unable to distinguish between the two.

Reframing current generative-AI models as plagiarism machines—not just software that helps students plagiarize, but software that plagiarizes just by running—would not demand shunning or legislating them out of existence; nor would it negate how the programs have incredible potential to aid all sorts of work. But this reframing would clarify the underlying value that copyright law is an imperfect mechanism for addressing: It is wrong to take and profit from others’ work without giving credit. In the case of generative AI, which has the potential to create billions of dollars of revenue at authors’ expense, the remedy might involve not only citation but also compensation. Just because plagiarism is not illegal does not make it acceptable in all contexts.

Last month, OpenAI simultaneously stated that it is “impossible to train today’s leading AI models without using copyrighted materials,” and that the company believes it has not violated any laws in such training. This should be taken not as a favorable illustration of the leniency of copyright statutes permitting technological innovation, but as an unabashed admission of guilt for plagiarizing. Now it is up to the public to deliver an appropriate sentence.

QOSHE - What If We Held ChatGPT to the Same Standard as Claudine Gay? - Matteo Wong

account_circle info brightness_medium cancel view_agenda grid_view

expand_moreexpand_less

Bosnia & Herzegovina

World

favourites

archive

Columnists

Actual . Favourites . Archive

We use cookies to provide some features and experiences in QOSHE

More information . Close

Aa Aa Aa

- A +

What If We Held ChatGPT to the Same Standard as Claudine Gay?

Matteo Wong

The Atlantic

3

37
11.01.2024

The problem with generative AI is plagiarism, not copyright.

If you squint and tilt your head, you can see some similarities in the blurry shapes that are Harvard and OpenAI. Each is a leading institution for building minds, whether real or artificial—Harvard educates smart humans, while OpenAI engineers smart machines—and each has been forced in recent days to stare down a common allegation. Namely, that they are represented by intellectual thieves.

Last month, the conservative activist Christopher Rufo and the journalist Christopher Brunet accused then–Harvard President Claudine Gay of having copied short passages without attribution in her dissertation. Gay later admitted to “instances in my academic writings where some material duplicated other scholars’ language, without proper attribution,” for which she requested corrections. Some two weeks later, The New York Times sued Microsoft and OpenAI, alleging that the companies’ chatbots violated copyright law by using human writing to train generative-AI models without the newsroom’s permission.

The two cases share common ground, yet many of the responses to them could not be more different. Typical academic standards for plagiarism, including Harvard’s, deem unattributed paraphrasing or lackluster citations a grave offense, and Gay—still dealing with the fallout from her widely criticized congressional testimony and a wave of racist comments—eventually resigned from her position. (I should note that I graduated from Harvard, before Gay became president of the university.) Meanwhile the Times’ and similar lawsuits, many legal experts say, are likely to fail, because the legal standard for copyright infringement generally permits using protected texts for “transformative” purposes that are substantially new. Perhaps that includes training AI models, which work by ingesting huge amounts of written texts and reproducing their patterns, content, and information. AI companies have acknowledged, and defended, using human work to train their programs. (OpenAI has said the Times’ case is “without........

© The Atlantic