Scientists prove in amazing ways that large models can be trained with ethical data

Picture by Getty / Futurism provides

According to the "The Washington Post">More than 20 AIs from MIT, Cornell University, the University of Toronto and other institutions The team of researchers trained a large language model using only publicly licensed or public domain data, providing a blueprint for ethical development of the technology.

But, as the creators readily admit, this is far from easy.

As they described in a peer-reviewed paper published this week, they quickly discovered that it was not computing power that hindered them, but personal ability.

WaPo Explained that this is because the text in the dataset they put together (which they call Common Pile v0.1) must be manually cleaned and reformatted to make it suitable for AI training. Then there is a lot of extra work that has to be done to double-check the copyright status of all data, as many online works are inappropriately licensed.

WaPoWaPoWaPo: "This is not a thing you can just expand the available resources," like accessing more computer chips and fancy web crawlers. "We use automation tools, but all of our stuff is manually annotated and checked by people at the end of the day. It's really hard.

Nevertheless, Biderman and her colleaguesThey did the job.

After the hard journey of creating the Common Pile, they used the guilt-free dataset to train an LLM with 7 billion parameters. How did it turn out? An admirable AI can rival industry models such as Meta's Llama 1 and Llama 2 7B - which is impressive, but these are versions released more than two years ago. It's almost a lifetime in the AI competition.

Of course, this is done by a more or less cluttered team, rather than a company with billions of dollars in resources, and it must be compensated for it in pieces. A particularly resourceful discovery is a set of over 130,000 books in English in the Library of Congress.

Copyright remains one of the biggest moral and legal issues facing AI. Leaders like OpenAI and Google consume unfathomable data on the surface network to reach where they are, swallowing up everything from news articles to content as intrusive as social media posts. Meta Has been sued by the authors who claim it illegally used pirated 7 million copyrighted books to train its AI.

The tech industry defended its greedy data demands, saying it all counts as fair use—rather, from an existence perspective, it would be "unable" to develop the technology without sucking everyone's content for free.

This latest work is a rebuttal to the Silicon Valley route, although it doesn't remove all moral issues. It's still a large language model, a technology that fundamentally aims to destroy jobs, and perhaps not everyone whose work ends up in the public domain will be happy that it's rumination by AI -- of course, if they are not late artists whose copyright has expired.

Even that AI companies are bound and can only use their works when licensed or paid -- a big assumption -- the fact remains that as long as these companies stick, copyright owners will face great pressure to allow AI training.

Biderman herself doesn't fantasize that companies like OpenAI will suddenly turn a new page and start to become a model of ethical data sources. But she hopes that her work will at least make them stop hiding what they use to train AI models.

"Even partial transparency has great social value and moderate scientific value," she told WaPo.