training datasets AI News & Updates

EleutherAI Creates Massive Licensed Dataset to Train Competitive AI Models Without Copyright Issues

EleutherAI released The Common Pile v0.1, an 8-terabyte dataset of licensed and open-domain text developed over two years with multiple partners. The dataset was used to train two AI models that reportedly perform comparably to models trained on copyrighted data, addressing legal concerns in AI training practices.