top of page

Data Critique

This Million Song Dataset includes both basic and complex information on its chosen song, hence allows for a very comprehensive analysis on its trends and patterns. For the basic information, it provides the name of the song and artist, the genre, the year of release, the location and the popularity. Beyond that it provides detailed knowledge on the musical aspect of the song such as the start and confident of the bar, beats, tatum, mode, the key, the loudness, the tempo, the time signature and the fade. The dataset includes the following information: 1,000,000 songs/files, 273 gb of data, 44,745 unique artists, 7,643 unique terms (the Echo Nest tags), 2,321 unique musicbrainz tags, 43,943 artists with at least one term, 2,201,916 asymmetric similarity relationships, 515,576 dated tracks starting from 1922, and 18,196 cover songs identified. The songs are categorized and grouped by their nature, including genre, popularity, and year of release.

 

All the songs are from contemporary songs ranging in date from the early 20th century to close to current. The Echo Net API was used to find the one million popular tracks which included audio data and metadata about the songs such. This information included location, length, release date, etc. Those collecting the data chose artists in several ways. Researchers downloaded songs from the most ‘familiar’ artists, by Getting the 200 top terms from The Echo Nest, then using each term as a descriptor to find 100 artists, then downloading as many of their songs as possible. They found these songs and artists by using the CAL500 Dataset, and further expanded their search to include finding ‘extreme’ songs from the The Eco Nest search patterns. This allowed the researchers to find songs with varying levels of energy and tempo.

 

Since we have such a large dataset, we can identify rare patterns that may not be prevalent in a smaller dataset. This large dataset can help us identify consistent patterns or phenomena in the data, which would give us a more in-depth analysis for our research. With all of the technical information along with the basic chronological and geographical data, we can potentially analyze the impacts of a particular factor on the technicals or see the overall trend of the music development. For example, we can show the trend of which specific music technique or feature is considered popular at different time periods or location. We can also see how a certain music technique has evolved over time based on the data provided. With such research, we can potentially identify different genres and the music trends of a particular time. Within the contemporary time period, with this data set we are basically capable of analyzing the evolution of music in relation to popular taste, as we can discern which artists, genres, and song styles decline and rise in popularity. We can also see trends in song structure, such as if certain time signatures become more widely used throughout the years or if songs show a collective shortening or lengthening of duration.

 

Our dataset represented a lot of quantifiable data, yet a lot of the qualitative information is left out. For example, though popularity is revealed, we are unable to know the exact reaction from the audience towards the song, especially the emotion audiences felt when listening to the song. Another example is that though we have a lot of information of the technical music information, we are unable to know the melody of the songs provided. We also do not have the actual audio of the music to work with. Instead, we have to use the metadata and features that the Echo Nest computed. Although there are many useful features provided, we would be limited in being able to test other applications. The dataset is also limited in time frame, only giving about a century’s worth of data. There are also certain limitations because of how the data was collected, particularly in relation to certain variables. For example, one variable we have data for is “Artist hotness” which is supposed to measure how popular an artist currently is. But this is relatively subjective, and the results can vary depending on how this variable is determined, for example, was this determined by the number of plays the artist got on a streaming service? If so, then which streaming service this data was taken from and the demographics of that streaming service would have a heavy influence on the results. Additionally, some variables are not accounted for every song. For example, only about half of the songs in the data set have the release location listed. Another example of information not included is the revenue each song made for the record company that produced it.

bottom of page