Part of the machine

Washington Post: Inside the secret list of websites that make AI like ChatGPT sound so smart. Clickbait headline aside (spoiler alert: it’s not a secret list, and “AI like ChatGPT” in this case means most large language models, but the actual training dataset for ChatGPT is secret), it’s interesting and informative to peel back the curtain on projects like C4 and the sources of their data, which in this case include this blog.

What’s interesting about this is that this sort of use of my blog is something that has been permitted for over twenty years, by the license on the blog itself. On January 2, 2003, I posted Licensed to blog, which declared my application of a Creative Commons license (at the time, there was only one!) to my content. I later refined that to a CC BY-NC-SA license, meaning you are welcome to use my content for non-commercial purposes, provided you credit me and share any modifications to the content under the same license. At some point, probably when I changed to my blog’s current theme, the license was accidentally dropped from the template. I’ve re-added it today.

I wonder what would happen if you tried to enforce the sharealike clause in the CC license against the C4 project and its makers? I have to imagine that a corpus of data drawn from 100 million sources, each with their own potentially conflicting license, must pose an IP law nightmare. As of a few years ago, the Creative Commons team itself felt that there were significant open questions about how the law applied in this use case.