An investigation by The Atlantic has unmasked the immense scale of data scraping fueling generative AI music platforms. Led by staff writer Alex Reisner, the report details the discovery of four datasets containing roughly 21.2 million tracks used for training models. The largest single archive holds 12 million songs while another contains 9 million. These records allow rights holders to verify if developers ingested their work to build services that can simulate human performances. Searchable databases confirm the inclusion of tracks from prominent artists such as Taylor Swift, Bad Bunny, Billie Eilish and Nirvana.
The disclosure arrives at a critical moment for the music industry as it combats unauthorized AI generated content. Generative AI companies frequently rely on fair use defenses and argue that training models on existing media does not harm the original market. The newly exposed datasets weaken this stance by showing the exact copyrighted material required to output commercially viable clones. Streaming services like Spotify and Deezer have already struggled to manage the influx of artificial audio, with the latter reporting that nearly half of its daily uploads are AI generated.
These concrete findings directly impact high profile legal actions against tech companies. Universal Music Group and Sony Music Entertainment are currently engaged in a massive copyright infringement lawsuit against the AI platform Suno. The labels recently asked a federal court to add more than 61,000 sound recordings to their suit after identifying their property within the training data. Suno previously admitted to showing its program tens of millions of instances of different recordings to build its service.
Bad Bunny, Taylor Swift Among 21 Million Artists Whose Music Was Secretly Used to Train AI
by Editor's Pick
on Saturday
An investigation by The Atlantic has unmasked the immense scale of data scraping fueling generative AI music platforms. Led by staff writer Alex Reisner, the report details the discovery of four datasets containing roughly 21.2 million tracks used for training models. The largest single archive holds 12 million songs while another contains 9 million. These records allow rights holders to verify if developers ingested their work to build services that can simulate human performances. Searchable databases confirm the inclusion of tracks from prominent artists such as Taylor Swift, Bad Bunny, Billie Eilish and Nirvana.
The disclosure arrives at a critical moment for the music industry as it combats unauthorized AI generated content. Generative AI companies frequently rely on fair use defenses and argue that training models on existing media does not harm the original market. The newly exposed datasets weaken this stance by showing the exact copyrighted material required to output commercially viable clones. Streaming services like Spotify and Deezer have already struggled to manage the influx of artificial audio, with the latter reporting that nearly half of its daily uploads are AI generated.
These concrete findings directly impact high profile legal actions against tech companies. Universal Music Group and Sony Music Entertainment are currently engaged in a massive copyright infringement lawsuit against the AI platform Suno. The labels recently asked a federal court to add more than 61,000 sound recordings to their suit after identifying their property within the training data. Suno previously admitted to showing its program tens of millions of instances of different recordings to build its service.
READ ALL ABOUT IT AT HYPEBEAST