Blogging in the Age of Internet-Scale AI Training

These days, large language models are being trained on enormous corpora. I don’t have any figures about how much exactly, but these things ingest a lot of text. And it seems likely that they will continue to ingest more and more text as time goes on. Perhaps in the near future an LLM will be trained on all text on the open Internet. In particular, an LLM might be trained on this very blog!

Many text-creators are understandably worried about this. If an LLM ingests all my writing, it will learn what my kind of writing is like. And if it learns that, it will be able to generate new text that is just like the text that I myself have previously generated. And since there is no such thing as human-generated text, that means that I will be made obsolete as a blogger. I could continue to write, of course, but it will be impossible to keep up with the automated just-like-my-blog text generator.

Now, let’s imagine that “the big one” is being trained right now – the larger-than-large language model that will subsume all text hitherto generated by humans and will exert an overwhelming influence on all text generated thenceforth.

What is a human text-creator to do?

One reaction to this scenario would be to jealously guard our remaining text. Suppose I have some novel insight or expertise. Openly publishing this knowledge would amount to feeding the beast, tossing my text into the gaping maw of the technological horror that has been created to replace me. So instead of openly publishing it, maybe I will communicate my thoughts in closed fashion. Perhaps I could write it down by hand and mail it to a small circle of people I know personally. This would disseminate my knowledge, but in a way that keeps it beyond the reach of the big-time mondo LLM. If enough people pursued this approach, it could lead to a rise in closed communities of esoteric knowledge.

There is something noble about small communities forming to resist centralized text generation. But also there are some obvious problems. Since they would have to forego modern computer technology, communication would be inconvenient, slow, and unreliable. It would also lead to a general atmosphere of distrust and paranoia, totally antithetical to the dream of the open Internet. After all, if I don’t want my text out in the open, I have to make sure that the only people who see it are people who I know won’t share it.

But there is a grander problem. Though their precise functioning is obscure, LLMs are ultimately reflections of the text on which they have been trained. The text they produce will be statistically similar to the text they have seen. And so if I choose to keep my text esoteric and inaccessible, then the text generated by the big-time mondo LLM – which will come online regardless of what I do – will reflect text other than mine. And since the big LLM will, in this scenario, exert an overwhelming influence on all future text, that means giving up any chance to personally influence the direction of that text.

Put another way, if my goal is to influence the world via text, even in a small way, then it is in my long-term interest to get as much of my text as possible ingested by as many LLMs as possible. If there really is an inflection point past which all text will be pushed in a certain direction, then I want to have some say in what direction it is, and the only way to do that is to get my text ingested.