CA AB 2013 Disclosure (Text to Speech)

Datasets Used for Suno Text-to-Speech Generative AI Models

This document provides information for California consumers, pursuant to California Civil Code Section 3110 et seq., on how we, Suno, Inc. and our affiliates and subsidiaries (“Suno”), use data to train Suno’s generative artificial intelligence services.

Dataset sources: Suno’s text-to-speech generative AI models (e.g., Bark) are trained on publicly available audio files and related metadata accessible on third-party websites on the open internet. Suno abides by all paywalls, password protections, and the like. In particular, Suno does not create login credentials for the purpose of obtaining training data from credentials-walled websites.

Intended purpose: Suno uses the collected data to train its text-to-speech generative AI models, which are intended to create novel speech and other sounds from text prompts.

Data points, including counts and types: Suno’s training data consists of millions of public audio files and corresponding textual metadata that help teach the models about what speech (and other sounds) sound like.

Inclusion of public domain or IP-protected data: Suno trains its models on datasets that may be from the public domain and/or that may be subject to intellectual property protection.

Acquisition of data: As described above, Suno’s models were developed with publicly-available data, obtained in a manner that abided by all paywalls, password protections, and the like.

Inclusion of personal information or aggregate consumer information: Suno’s training datasets primarily consist of publicly available materials, which are not personal information as defined in subdivision (v) of California Civil Code Section 1798.140. Suno does not knowingly include in its training dataset aggregate consumer information as defined in subdivision (b) of California Civil Code Section 1798.140.

Cleaning, processing, and other modification to datasets: Suno undertook processing steps to associate audio files to their related metadata. Prior to training, Suno also organizes, cleans, filters, and processes the collected datasets to remove junk or other low-quality data and improve its usefulness for model training.

Data collection time period: Suno has been collecting data to train its text-to-speech generative AI models since Fall 2022.

Dates the datasets were first used: Suno began collecting the datasets in Spring 2023 and began using them for model development shortly thereafter.

Use of Synthetic Data: Suno did not use synthetic data in the development of text-to-speech models.