The internet is user-generated content. We’re all making things and contributing whether publicly or “privately”. With our names or without.
“AI” is years of our data being scraped, packaged and sold back to us.
ChatGPT and the other Large Language Models (LLMs) are little more than someone who is incredible well-read with perfect recall. They’re taken the internet’s data and packaged it and put a neat little plain language front-end on it for us to interact with.
The reason chatGPT can write such good fanfiction is because it scraped 32billion words from AO3. And that was in 2019. So there’s likely even more fanfic in the large language model today. If you look at this and say, “but it’s only fanfiction, who cares?” Would it be acceptable to other writers?
It’s abhorrent that a program which purports to support a community of writers has based at least 32 billion words of its program on the writing of a community that did consent to have their work used.
…
Writing fic is not stealing, but taking fic and using it to develop a dataset, and then offering that dataset to the public without having gotten permission from literally anyone is ethically gross.
How Bots Like ChatGPT Have Stolen Fanfiction, and What It Means
What if your entire history of writing that you had publicly posted to the Internet was scooped up and used without your permission for another company to make money from?
Well, that it likely the cast as Kevin Schaul, Szu Yu Chen and Nitasha Tiku writing for The Washington Post have researched and reported on.
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
See the websites that make AI bots like ChatGPT sound so smart
What about social media?
Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.
So while your posts to social media may not be in ChatGPT, it’s certainly going to be included in Meta/Facebook’s own product. And they’re long history of scooping up and and all data, it’s certainly far more extensive.
What about if you have ever written in a blog on powered by WordPress, Tumblr, Blogspot and Live Journal? Then you’ve included too.
My own site is included in the data set at rank 1,953,276.

If you write on the web, you’re likely there too. You can search through the data by scrolling to the bottom of The Washington Post’s article: See the websites that make AI bots like ChatGPT sound so smart.
As with any story that talks about data, there’s a section at the end describing how the Post came to this data and the 15.1 million unique domains included in this dataset.
How do you feel about your writing being included in this gigantic data sets and being used to build products?
