Authors: Natnael Mekonnen, Sonya Lew, and Daniel Park
Last semester in spring 2019, we were hit with a COVID-19 pandemic. For the safety of our friends, family and ourselves, we all were forced to social distance, quarantine, and take our academic classes online. With online classes came a lack of face to face interaction with our classmates, TAs, and professors. Being removed from this interaction made learning and academics more challenging. A variety of challenges and tragedies like social deprivation, negative impacts on mental health, unemployment, lose of loved ones, or our health, have caused many people suffering.
During such trying times, the three of us reflected on what helped us get through the past few months. Music came to the forefront. Many different songs entered and exited our playlists throughout the pandemic. We turned to some artists for comfort, others for energy, and many just for fun. We realized that throughout the year, though our taste in music didn't change much, the moods of songs we listened to most were diverse.
So we wondered if this rang true for other people as well. Did our listening patterns change throughout the year? And how much of that was a result of COVID? Maybe other people are also using music to help cope with the pandemic.
We figured that this was likely true, so we set out to determine if there is a correlation to the progression of the pandemic and the songs being played with their traits (i.e. danceability, tempo, energy, etc.) to help cope with these trying times.
In our research, we found psychology articles that explain music's power to help cope with stressful events and COVID in particular. We also found that music preferences change depending on a person's environment. So did people actually put these medical findings into practice?
In this exploration, we will be using these traits in correlation with COVID data to find if pandemic affected the traits of the music we listen to.
To explore the pandemic's effect on music, we chose to compare the COVID data in the US with the top 10 list on Spotify since January 2020. The reason why we chose to focus on the US is because we have more reliable, accessible, and robust data. The COVID data will be extracted from the Atlantic's The COVID Tracking Project which has been constantly updating the data everyday with representatives from 50 states, 5 territories, and District of Colombia. On the other hand, the Spotify top 10 will be extracted from the Spotify Charts on a weekly starting from January 2020. The charts do not provide robust data on its songs so we will extract more info on each song by querying the Spotify API.
We imported the necessary libraries to conduct our exploratory analysis: pandas, matplotlib, numpy, seaborn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
The Atlantic provides COVID data on a csv file, so we downloaded the file to our repository and store the data in a Pandas DataFrame, which allows us to manipulate the data how we want.
covid_data = pd.read_csv('national-history.csv')
covid_data.head()
The COVID data is detailed but since this project is finding its relationship with music. We only care about the total number of COVID tests administered and the proportion of tests that actually were positive, so we can drop any other data from the table.
if 'death' in covid_data.columns:
covid_data = covid_data.drop(['death','deathIncrease','inIcuCumulative','inIcuCurrently', 'hospitalizedIncrease','hospitalizedCurrently','hospitalizedCumulative','onVentilatorCumulative','onVentilatorCurrently','recovered','states'], axis=1)
We know that Spotify groups top 10 song data on a weekly basis. Each week begins on a Friday, so our next step is to prepare the COVID data to make this dataframe easy to merge with the Spotify dataframe. The Atlantic dataset contains daily reports on COVID statistics, so we want to group data into weeks that coincide with the Spotify data's weeks.
To do this, we must create bins that denote the weekly intervals and then cut the COVID data into these bins. To be more clear, the bins denote the start and end dates of each week specified by Spotify. Pandas will traverse through the COVID data set and group data from days that fall under the corresponding week bin. Once each day of data is assigned to its appropriate week/bin, we group each week's data by adding all the stats from each week and reducing the daily data into weekly data as seen below.
# Creating a custom interval index for grouping the data which starts from begining of the year
# on a Friday and continues weekly
i = pd.to_datetime('01/10/2020')
bins = []
while i < pd.to_datetime('12/05/2020'):
temp = i + pd.Timedelta('7 days')
bins.append((i,temp))
i = temp
bins = pd.IntervalIndex.from_tuples(bins)
# Convert the date from the dataframe to a pandas date and time format to be comuted in the cutting
covid_data['date'] = pd.to_datetime(covid_data['date'])
# Using the interval index created above, create a new column week which has the week interval of the data
covid_data['week'] = pd.cut(covid_data['date'], bins)
# Now that every row has a week interval, they will be grouped with the total for each column
grouped_covid = covid_data.groupby(['week']).sum()
# Number the weeks to easily identify
grouped_covid['week_num'] = list(range(1,len(bins)+1))
# Reorder the variables/column names of the table
column_names = ['week_num','positiveIncrease','negativeIncrease','positive','negative','totalTestResultsIncrease','totalTestResults']
grouped_covid = grouped_covid.reindex(columns=column_names).reset_index()
grouped_covid.head()
As mentioned above, Spotify provides its top 10 song list for each week in 2020 as a csv file. However, since the data is not robust with only the track name, artist, streams, and the URL as the columns, we used the Spotify developer API to get more detail information on the songs and store it under data10.csv
. The data from the API adds several more metrics to the songs including length, popularity, danceability and more.
The script to run this process is under spotify_data_extraction.py
. It works by traversing the list of csv files in top200_data
, opening each csv file, pulling the song id from the URL column, calling the API to return track information for the specified song in an array, and storing the array for each song in a list. It then creates a dataframe of this 2-D list to organize the information and converts the dataframe to a csv file (data10.csv
).
We decided to create a separate .py file for the extractor script so we wouldn't have to run it every time we opened this Jupyter notebook.
spotify_data = pd.read_csv('data.csv')
spotify_data.head()
Then, we assigned each week to its corresponding week bin like we did to the COVID data.
spotify_data['start_date'] = pd.to_datetime(spotify_data['start_date'])
# Using the same interval index created for the covid_data, create a new column week which has the week interval
spotify_data['week'] = pd.cut(spotify_data['start_date'], bins)
# Now that week is included, we can reindex with only the columns we need
spotify_data = spotify_data.reindex(columns=['week','length', 'popularity',
'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness',
'loudness', 'speechiness', 'tempo', 'time_signature'])
Now that every row has a week interval, they will be grouped with the mean for each column. The following table will contain Spotify data during the pandemic.
grouped_spotify = spotify_data.groupby(['week']).mean()
grouped_spotify.head()
For those interested in the definition of each variable in the table above, here's Spotify's definitions.
At this point both the grouped_covid data and the grouped_spotify data have been grouped with the same time interval, week. Hence, we merge the dataframes so we can analyze them together.
merged = grouped_covid.merge(grouped_spotify, left_on='week', right_on='week')[:46]
merged.tail()
To reiterate, we wanted to determine if there's a relationship between COVID's progression and the characteristics of top 10 songs on Spotify. So we'll first focus on finding correlations between the COVID data and the various song features from our Spotify data.
We use a correlation heat map on the merged dataframe to visualize the columns with similar linear trends. Essentially, this heat map lays out the correlation coefficients between pairings for each dataframe column.
sns.set(rc={'figure.figsize':(20,20)})
sns.heatmap(abs(merged.corr(method="spearman")), annot=True).set_title('Correlation of Spotify and COVID Data', fontsize=18)
Columns with lighter color signify a strong correlation and darker color signifies a weak correlation. We only care about correlations between Spotify data columns and COVID data columns. We observe strong relationships between the danceability of the top 10 songs and a positive increase in COVID cases (danceability
vs. positiveInc
), as well as danceability and total number of positive cases (danceability
vs. positive
). We observe another strong correlation between popularity of a song and positive increase (popularity
vs. positiveInc
), as well as total number of positive cases (popularity
vs. positive
).
Spotify defines danceability as a song's suitability for dancing. The range for this metric is 0-1. Popularity is ranked from 0-100 and is determined by how many times users played a song and how recent those plays are. Songs played a lot in the past will have a lower popularity than those played a lot now.
In this project, we will further explore these correlations.
Now that we've found a correlation between danceability and increase in and total number of COVID cases, we can explore the relationships further.
Since danceability ranges from 0 to 1, we must standardize positive increase in order to graph it with danceability.
In this case, min-max (rescale) standardization is used on positive increase to rescale to range from 0 to 1.
merged['stand_posInc'] = (merged.positiveIncrease-merged.positiveIncrease.min())/(merged.positiveIncrease.max()-merged.positiveIncrease.min())
We then graph the progression of positive increases in cases and danceability throughout the pandemic and fit a regression line to determine the general slope/trend of both variables.
melted_merge = merged[['week_num', 'stand_posInc', 'danceability']]
melted_merge = melted_merge.melt('week_num', var_name='dance&posInc', value_name='values')
sns.relplot(data=melted_merge, x='week_num', y='values', hue='dance&posInc', kind='line',height=8, aspect=1).fig.suptitle('Positive Increase and Danceability over Time')
plt.subplots_adjust(top=0.9)
# calculate slope
x = range(0, len(merged.week_num))
posInc_reg = np.polyfit(x, merged.stand_posInc, 1)
posInc_line = np.poly1d(posInc_reg)
dance_reg = np.polyfit(x, merged.danceability, 1)
dance_line = np.poly1d(dance_reg)
print(f'Slope of trend in positive test increase: {posInc_line.c[0]}')
print(f'Slope of trend in danceability of top songs: {dance_line.c[0]}')
We can see that when positve increase per week increases, the danceability of the top 10 songs decreases, more specifially around week 10 and week 28, when COVID significantly spiked. Week 10 is around March when the stay at home orders were enacted and week 28 is around July when everyone went outside to celebrate 4th of July. It's clear that there's a relationship between danceability when the number of positve tests spike, but to see if in danceability is affected in general, further exploration is needed.
Since popularity has a significant correlation with a positive increase in cases as well, we'll delve into its relationship in the same manner.
x = range(0, len(merged.week_num))
pop_reg = np.polyfit(x, merged.popularity, 1)
pop_line = np.poly1d(pop_reg)
print(f'Slope of trend in positive test increase: {posInc_line.c[0]}')
print(f'Slope of trend in popularity of top songs: {pop_line.c[0]}')
merged['stand_pop'] = (merged.popularity-merged.popularity.min())/(merged.popularity.max()-merged.popularity.min())
melted_merge = merged[['week_num', 'stand_posInc', 'stand_pop']]
melted_merge = melted_merge.melt('week_num', var_name='pop&posInc', value_name='values')
sns.relplot(data=melted_merge, x='week_num', y='values', hue='pop&posInc', kind='line',height=8, aspect=1).fig.suptitle('Positive Increase and Danceability over Time')
plt.subplots_adjust(top=0.9)
Somewhat similarly to danceability, peaks in increases in cases (i.e. significant dates for the COVID timeline) conincide with dips in popularity of top 10 songs. However, both popularity and increases in cases had positive slopes.
So far, we've established that there's a significant relationship between the weekly increase in & total number of COVID cases and the danceability and popularity of Spotify's top 10 weekly songs. We can fairly conclude there is some statistical support of the idea that COVID's progression impacts the characteristics of songs that trend on Spotify. But to further explore this relationship, let's see how listening trends in 2019 and the months leading up to the pandemic compare to trends during the pandemic.
We'll extract Spotify's Top 10 song data from 2/15/19 until 2/25/20. We define 2/25/20 as the beginning of the pandemic in the U.S. because the country saw its first positive test results during this week.
We start by running the song info spotify_data_extraction
on the csv files that contain the top 10 Spotify songs during the mentioned timeframe. The script returns a csv file that compiles all the track info for these songs which we use to create a Dataframe.
pre_spotify = pd.read_csv('pre_data.csv')
pre_spotify.sort_values(by=['start_date'], inplace=True)
Then we create bins denoting the start and end date intervals of each week the Spotify songs trended.
# We want to analyze Spotify data from January 2019 to January 2020 to compare to Spotify data when the first tests were administered
i = pd.to_datetime('02/14/2019')
pre_bins = []
while i < pd.to_datetime('02/18/2020'):
temp = i + pd.Timedelta('7 days')
pre_bins.append((i,temp))
i = temp
pre_bins = pd.IntervalIndex.from_tuples(pre_bins)
After categorizing each song by the week they trended, we average the characteristics of songs for each week. More clearly, for each week, we find the average of the danceability, popularity, etc. of all the songs from that week and reduce the DataFrame to a weekly average table
pre_spotify['start_date'] = pd.to_datetime(pre_spotify['start_date'])
# Using the interval index created above, create a new column week which has the week interval of the data
pre_spotify['week'] = pd.cut(pre_spotify['start_date'], pre_bins)
# Now that every row has a week interval, they will be grouped with the total for each column
pre_spotify = pre_spotify.groupby(['week']).mean().reset_index()
# Number the weeks to easily identify
pre_spotify['week_num'] = list(range(1,len(pre_bins)+1))
grouped_pre = pre_spotify[:52]
pre_spotify = grouped_pre.groupby(['week_num']).mean().reset_index()
pre_spotify.head()
How do the trends in danceability and popularity of top Spotify songs compare to these trends pre-COVID? Let's find out. Let's talk about danceability first. We create a relational plot that compares the danceability of top 10 songs throughout February 2019 to February 2020 to those throughout the pandemic this year.
grouped_spotify['week_num'] = list(range(1,len(grouped_spotify)+1))
temp1 = grouped_pre
temp2 = grouped_spotify
temp1["before_after"] = np.array("Before COVID")
temp2["before_after"] = np.array("During COVID")
all_spotify = temp1.append(temp2)
sns.relplot(y="danceability", x="week_num", hue="before_after", data=all_spotify)
Danceability of top 10 Spotify songs pre-COVID are shown in blue and those during the pandemic are shown in orange. There appears to be a more noticeable decline in danceability of top 10 songs as the year before the pandemic progressed compared to that of trending songs during the pandemic.
For a more concrete understanding of their differences, let's look at their regressions.
Below, we plot the danceability of songs before pre-COVID and during COVID on separate graphs and draw regression lines for both.
Seaborn's lmplot
plots the data points and draws the regression line, but it doesn't report the actual equation of the line. But numpy allows us to calculate it ourselves.
fig = sns.lmplot(y="danceability", x="week_num", data=pre_spotify);
fig = fig.fig
fig.suptitle("Danceability of Spotify's Top 10 Songs Weekly Before COVID")
fig.subplots_adjust(top=0.9)
fig = sns.lmplot(y="danceability", x="week_num", data=grouped_spotify);
fig = fig.fig
fig.suptitle("Danceability of Spotify's Top 10 Songs Weekly During COVID")
fig.subplots_adjust(top=0.9)
x = range(0, len(pre_spotify.week_num))
pre_reg = np.polyfit(x, pre_spotify.danceability, 1)
pre_line = np.poly1d(pre_reg)
x = range(0, len(grouped_spotify.week_num))
during_reg = np.polyfit(x, grouped_spotify.danceability, 1)
during_line = np.poly1d(during_reg)
print('Top 10 songs on Spotify pre-COVID:')
print(f'\tSlope of trend of danceability: {pre_line.c[0]}\n\tStandard deviation of danceability: {pre_spotify.danceability.std()}\n')
print('Top 10 songs on Spotify during COVID:')
print(f'\tSlope of trend in danceability: {during_line.c[0]}\n\tStandard deviation of danceability: {grouped_spotify.danceability.std()}')
The regression lines confirm our suspicions: the danceability of top 10 songs before COVID more rapidly declined as the weeks progressed compared to during COVID. We can see that it increased right before the start of the pandemic, but in general the slope of the regression was negative.
The slope of the regression for danceability during COVID also declined, albeit less rapidly. However, danceability during this time period fluctuated a bit. We know that this fluctuation coincided with fluctuations in total number of and increases in COVID cases nationally. Although the general trend of danceability during COVID was negative, danceability during this time was more spread. The standard deviation of danceability during COVID is higher than it is before COVID.
These findings support the correlation we found between COVID and danceability in the previous section. It seems something different (i.e. COVID) is affecting listening trends this year! Let's see if this holds true for top 10 songs' popularity as well.
sns.relplot(y="popularity", x="week_num", hue="before_after", data=all_spotify)
It looks like for most weeks this year, the top 10 songs are more popular than songs from the same weeks a year before.
fig = sns.lmplot(y="popularity", x="week_num", data=pre_spotify);
fig = fig.fig
fig.suptitle("Popularity of Spotify's Top 10 Songs Weekly Before COVID")
fig.subplots_adjust(top=0.9)
fig = sns.lmplot(y="popularity", x="week_num", data=grouped_spotify);
fig = fig.fig
fig.suptitle("Popularity of Spotify's Top 10 Songs Weekly During COVID")
fig.subplots_adjust(top=0.9)
x = range(0, len(pre_spotify.week_num))
pre_reg = np.polyfit(x, pre_spotify.popularity, 1)
pre_line = np.poly1d(pre_reg)
x = range(0, len(grouped_spotify.week_num))
during_reg = np.polyfit(x, grouped_spotify.popularity, 1)
during_line = np.poly1d(during_reg)
print('Top 10 songs on Spotify pre-COVID:')
print(f'\tSlope of trend of popularity: {pre_line.c[0]}\n\tStandard deviation of popularity: {pre_spotify.popularity.std()}\n')
print('Top 10 songs on Spotify during COVID:')
print(f'\tSlope of trend in popularity: {during_line.c[0]}\n\tStandard deviation of popularity: {grouped_spotify.popularity.std()}')
For popularity, there was less of a difference in the slopes of regression lines. Both slopes slightly increased but the standard deviation for popularity during COVID was higher than before COVID. The data that falls farther from the regression during COVID aligns with the weeks in which increases in positive cases peaked.
To recap, we've established a relationship between positive increases in and total number of COVID cases vs. danceability and popularity of the top 10 songs on Spotify weekly based on their correlation coefficients, distributions, and regression lines. The strength of their correlation indicates a predictive relationship between the increase in number of positive cases and the danceability and popularity of these songs.
So let's predict how these trends will continue as the pandemic progresses.
We'll use sklearn to conduct our machine learning regression algorithm. We used this article to decide which machine learning algorithm best fits our data. Since the relationships are linear and don't involve classification, we will use a linear regression model to perform predictive analyis. We utilize sklearn's train_test_splot()
function to split the data into training and testing sets. The training data, as its name suggests, is used to train the model to create a linear regression to predict danceability of top 10 songs depending on the postiive increase in and total number of COVID cases. The linear regression prediction is graphed along with the testing set regression. The testing set regression was the data set aside to test the accuracy of the prediction.
Visually, the prediction somewhat aligns with the actual data, but its highest peak rises much higher than the actual testing set's. To more empirically test the prediction's validity, we use statistical analysis.
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn import model_selection
X = grouped_covid.reset_index()[['positive', 'positiveIncrease']]
y = grouped_spotify[:48].danceability
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
X_train['dance'] = y_train.values
lr = smf.ols(formula='dance ~ positive + positiveIncrease', data=X_train).fit()
preds_lr = lr.predict(X_test)
f, ax = plt.subplots()
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_lr, hist=False, label="Linear Regression Predictions", ax=ax)
from scipy.stats import f as ft
# F-Test to evaluate goodness of fit
test = lr.f_test(np.identity(len(lr.params)))
print('Model - Calculated F-Statistic: ' + str(ft.ppf(.95,test.df_num,test.df_denom)) + \
'\nF-Value: ' + str(test.fvalue[0][0]) + '\nP-Value: ' + str(test.pvalue))
The F-Values are greater than the F-Statistics and the p-value is < 0.05. This supports that the predicted model is significant.
We go on to do the same for top 10 Spotify song popularity vs increase in and total COVID cases.
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
X = grouped_covid.reset_index()[['positive', 'positiveIncrease']]
y = grouped_spotify['popularity']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
X_train['pop'] = y_train.values
lr = smf.ols(formula='pop ~ positive + positiveIncrease', data=X_train).fit()
preds_lr = lr.predict(X_test)
f, ax = plt.subplots()
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_lr, hist=False, label="Linear Regression Predictions", ax=ax)
The predicted regression is even more closely similar to the actual testing data's regression which is a good sign.
# F-Test to evaluate goodness of fit
test = lr.f_test(np.identity(len(lr.params)))
print('Model - Calculated F-Statistic: ' + str(ft.ppf(.95,test.df_num,test.df_denom)) + \
'\nF-Value: ' + str(test.fvalue[0][0]) + '\nP-Value: ' + str(test.pvalue))
Again, the F-Value is greater than the F-Statistic and the p-value is < 0.05, so this model for predicting the popularity of trending songs on Spotify using total and increase in COVID cases as predictors is statistically significant.
After comparing the COVID data and Spotify data, we were able to find a correlation between the current status of the pandemic and Spotify listening trends, specifically the rate at which people are catching COVID, as well as the total number of cases, and the danceability and popularity of the top 10 songs on Spotify. In general, we noticed that danceability decreased and popularity increased as more COVID cases arose. Based on this, we were able to create a formula to predict how these listening trends will continue past the end of the pandemic.
Although we can't explain the reason for the relationships between COVID and top Spotify songs from our findings, our analysis can support future studies that may be more robust and more clearly explain why these relationships exist. However, we predict that danceability likely decreased as many Spotify users became sick or as users' moods declined as the pandemic progressed. Popularity may have increased as users spent more time at home and had more time and opportunities to listen to music.
At a wider scope, our findings support and provide a more context to exisiting studies on how people use music to cope with stressful events in general.
from IPython.display import HTML
HTML('''<script>
$('div.output_stderr').hide();
</script>
''')