Shakespeare, Dickens, and Dante are about to join forces with obscure Czech math textbooks and Welsh pocket dictionaries to teach AI how to write poetry, solve quadratic equations, and possibly tell you how to ask for directions in Welsh—all thanks to Harvard’s Institutional Data Initiative.
Microsoft and OpenAI have funded the initiative.
The executive director of this endeavour, Greg Leppert, said the project “levelled the playing field,” adding that the dataset is “rigorously reviewed,” which presumably means someone checked to ensure the Bard really was dead.
This isn’t just for Silicon Valley’s finest. No, it’s for “the general public,” including your friendly neighbourhood AI hobbyist and evil mastermind building their robot army.
Leppert compares the dataset’s potential to Linux, the beloved open-source operating system. Of course, much like Linux, any success will require additional resources, expertise, and a sprinkle of magic from those same deep-pocketed corporations the initiative is designed to challenge.
The books were scanned as part of the Google Books project. So, for anyone nostalgic about the early 2000s, it’s like a digital time capsule from when Google’s ambition to scan every book seemed quirky rather than quietly dystopian.
Leppert is optimistic about the potential uses for this treasure trove, suggesting it could help train AI models for everyone from garage start-ups to corporate behemoths. Just imagine: the same dataset could be used to build an AI that drafts heartfelt love sonnets and one that optimises your online shopping habits.
Of course, while some will hail this as a revolutionary leap forward in democratising AI, others might see it as a subtle way to ensure that any upstart with a dream and a few terabytes of server space can now compete in the race to build the next ChatGPT—though they’ll still need plenty of extra data to make a dent in the market.