COVID's Effect on Music Trends

Authors: Natnael Mekonnen, Sonya Lew, and Daniel Park

Introduction

Last semester in spring 2019, we were hit with a COVID-19 pandemic. For the safety of our friends, family and ourselves, we all were forced to social distance, quarantine, and take our academic classes online. With online classes came a lack of face to face interaction with our classmates, TAs, and professors. Being removed from this interaction made learning and academics more challenging. A variety of challenges and tragedies like social deprivation, negative impacts on mental health, unemployment, lose of loved ones, or our health, have caused many people suffering.

During such trying times, the three of us reflected on what helped us get through the past few months. Music came to the forefront. Many different songs entered and exited our playlists throughout the pandemic. We turned to some artists for comfort, others for energy, and many just for fun. We realized that throughout the year, though our taste in music didn't change much, the moods of songs we listened to most were diverse.

So we wondered if this rang true for other people as well. Did our listening patterns change throughout the year? And how much of that was a result of COVID? Maybe other people are also using music to help cope with the pandemic. We figured that this was likely true, so we set out to determine if there is a correlation to the progression of the pandemic and the songs being played with their traits (i.e. danceability, tempo, energy, etc.) to help cope with these trying times.
In our research, we found psychology articles that explain music's power to help cope with stressful events and COVID in particular. We also found that music preferences change depending on a person's environment. So did people actually put these medical findings into practice?

In this exploration, we will be using these traits in correlation with COVID data to find if pandemic affected the traits of the music we listen to.

1. Data Scraping

To explore the pandemic's effect on music, we chose to compare the COVID data in the US with the top 10 list on Spotify since January 2020. The reason why we chose to focus on the US is because we have more reliable, accessible, and robust data. The COVID data will be extracted from the Atlantic's The COVID Tracking Project which has been constantly updating the data everyday with representatives from 50 states, 5 territories, and District of Colombia. On the other hand, the Spotify top 10 will be extracted from the Spotify Charts on a weekly starting from January 2020. The charts do not provide robust data on its songs so we will extract more info on each song by querying the Spotify API.

We imported the necessary libraries to conduct our exploratory analysis: pandas, matplotlib, numpy, seaborn

In [350]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf

1.1 Load and View COVID data

The Atlantic provides COVID data on a csv file, so we downloaded the file to our repository and store the data in a Pandas DataFrame, which allows us to manipulate the data how we want.

In [351]:
covid_data = pd.read_csv('national-history.csv')
covid_data.head()
Out[351]:
date death deathIncrease inIcuCumulative inIcuCurrently hospitalizedIncrease hospitalizedCurrently hospitalizedCumulative negative negativeIncrease onVentilatorCumulative onVentilatorCurrently positive positiveIncrease recovered states totalTestResults totalTestResultsIncrease
0 12/19/2020 307831.0 2704 34949.0 21688.0 3337 113929.0 641484.0 179327206.0 1185734 3529.0 7790.0 17452905.0 201841 6882996.0 56 230325922 1725036
1 12/18/2020 305127.0 2866 34716.0 21745.0 5240 113955.0 638147.0 178141472.0 1536770 3519.0 7786.0 17251064.0 239246 6762700.0 56 228600886 2197685
2 12/17/2020 302261.0 3438 34485.0 21910.0 5133 114459.0 632907.0 176604702.0 1257526 3504.0 7847.0 17011818.0 240156 6681651.0 56 226403201 1873340
3 12/16/2020 298823.0 3448 34237.0 21946.0 4800 113278.0 627774.0 175347176.0 1200071 3488.0 7778.0 16771662.0 231653 6597661.0 56 224529861 1791968
4 12/15/2020 295375.0 2971 33958.0 21882.0 4398 112814.0 622974.0 174147105.0 1328079 3460.0 7701.0 16540009.0 189783 6490879.0 56 222737893 1781610

The COVID data is detailed but since this project is finding its relationship with music. We only care about the total number of COVID tests administered and the proportion of tests that actually were positive, so we can drop any other data from the table.

In [352]:
if 'death' in covid_data.columns:
    covid_data = covid_data.drop(['death','deathIncrease','inIcuCumulative','inIcuCurrently', 'hospitalizedIncrease','hospitalizedCurrently','hospitalizedCumulative','onVentilatorCumulative','onVentilatorCurrently','recovered','states'], axis=1)

We know that Spotify groups top 10 song data on a weekly basis. Each week begins on a Friday, so our next step is to prepare the COVID data to make this dataframe easy to merge with the Spotify dataframe. The Atlantic dataset contains daily reports on COVID statistics, so we want to group data into weeks that coincide with the Spotify data's weeks.
To do this, we must create bins that denote the weekly intervals and then cut the COVID data into these bins. To be more clear, the bins denote the start and end dates of each week specified by Spotify. Pandas will traverse through the COVID data set and group data from days that fall under the corresponding week bin. Once each day of data is assigned to its appropriate week/bin, we group each week's data by adding all the stats from each week and reducing the daily data into weekly data as seen below.

In [353]:
# Creating a custom interval index for grouping the data which starts from begining of the year 
# on a Friday and continues weekly
i = pd.to_datetime('01/10/2020')
bins = []
while i < pd.to_datetime('12/05/2020'):
    temp = i + pd.Timedelta('7 days')
    bins.append((i,temp))
    i = temp
bins = pd.IntervalIndex.from_tuples(bins) 
In [354]:
# Convert the date from the dataframe to a pandas date and time format to be comuted in the cutting
covid_data['date'] = pd.to_datetime(covid_data['date'])

# Using the interval index created above, create a new column week which has the week interval of the data
covid_data['week'] = pd.cut(covid_data['date'], bins)

# Now that every row has a week interval, they will be grouped with the total for each column 
grouped_covid = covid_data.groupby(['week']).sum()

# Number the weeks to easily identify
grouped_covid['week_num'] = list(range(1,len(bins)+1))
In [355]:
# Reorder the variables/column names of the table
column_names = ['week_num','positiveIncrease','negativeIncrease','positive','negative','totalTestResultsIncrease','totalTestResults']
grouped_covid = grouped_covid.reindex(columns=column_names).reset_index()

grouped_covid.head()
Out[355]:
week week_num positiveIncrease negativeIncrease positive negative totalTestResultsIncrease totalTestResults
0 (2020-01-10, 2020-01-17] 1 0 0 0.0 0.0 0 0
1 (2020-01-17, 2020-01-24] 2 2 0 10.0 0.0 2 5
2 (2020-01-24, 2020-01-31] 3 0 0 14.0 0.0 6 28
3 (2020-01-31, 2020-02-07] 4 3 0 23.0 0.0 8 89
4 (2020-02-07, 2020-02-14] 5 2 0 38.0 0.0 6 140

1.2 Load and View Spotify Top 10 Weekly Data

As mentioned above, Spotify provides its top 10 song list for each week in 2020 as a csv file. However, since the data is not robust with only the track name, artist, streams, and the URL as the columns, we used the Spotify developer API to get more detail information on the songs and store it under data10.csv. The data from the API adds several more metrics to the songs including length, popularity, danceability and more.

The script to run this process is under spotify_data_extraction.py. It works by traversing the list of csv files in top200_data, opening each csv file, pulling the song id from the URL column, calling the API to return track information for the specified song in an array, and storing the array for each song in a list. It then creates a dataframe of this 2-D list to organize the information and converts the dataframe to a csv file (data10.csv).
We decided to create a separate .py file for the extractor script so we wouldn't have to run it every time we opened this Jupyter notebook.

In [356]:
spotify_data = pd.read_csv('data.csv')
spotify_data.head()
Out[356]:
name album artist release_date length popularity danceability acousticness danceability.1 energy instrumentalness liveness loudness speechiness tempo time_signature start_date end_date
0 The Box Please Excuse Me For Being Antisocial Roddy Ricch 2019-12-06 196652 90 0.896 0.1040 0.896 0.586 0.00000 0.7900 -6.687 0.0559 116.971 4 2020-01-03 2020-01-10
1 ROXANNE ROXANNE Arizona Zervas 2019-10-10 163636 88 0.621 0.0522 0.621 0.601 0.00000 0.4600 -5.616 0.1480 116.735 5 2020-01-03 2020-01-10
2 Yummy Yummy Justin Bieber 2020-01-03 210426 79 0.687 0.3660 0.687 0.514 0.00000 0.1160 -6.612 0.0897 145.921 4 2020-01-03 2020-01-10
3 Circles Hollywood's Bleeding Post Malone 2019-09-06 215280 90 0.695 0.1920 0.695 0.762 0.00244 0.0863 -3.497 0.0395 120.042 4 2020-01-03 2020-01-10
4 BOP KIRK DaBaby 2019-09-27 159714 85 0.769 0.1890 0.769 0.787 0.00000 0.1290 -3.909 0.3670 126.770 4 2020-01-03 2020-01-10

Then, we assigned each week to its corresponding week bin like we did to the COVID data.

In [357]:
spotify_data['start_date']  = pd.to_datetime(spotify_data['start_date'])

# Using the same interval index created for the covid_data, create a new column week which has the week interval
spotify_data['week'] = pd.cut(spotify_data['start_date'], bins)

# Now that week is included, we can reindex with only the columns we need
spotify_data = spotify_data.reindex(columns=['week','length', 'popularity',
       'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 
       'loudness', 'speechiness', 'tempo', 'time_signature'])

Now that every row has a week interval, they will be grouped with the mean for each column. The following table will contain Spotify data during the pandemic.

In [358]:
grouped_spotify = spotify_data.groupby(['week']).mean()
grouped_spotify.head()
Out[358]:
length popularity danceability acousticness energy instrumentalness liveness loudness speechiness tempo time_signature
week
(2020-01-10, 2020-01-17] 215694.400000 76.800000 0.736567 0.251691 0.543553 0.034003 0.202063 -7.607967 0.161413 121.679600 4.033333
(2020-01-17, 2020-01-24] 197874.266667 79.166667 0.725900 0.249792 0.580533 0.030814 0.171410 -6.569833 0.160277 130.922133 4.033333
(2020-01-24, 2020-01-31] 198965.700000 81.166667 0.726300 0.250979 0.565867 0.030795 0.178673 -6.502900 0.160463 128.848500 4.033333
(2020-01-31, 2020-02-07] 192784.100000 79.600000 0.719967 0.212842 0.594967 0.022089 0.194757 -6.277200 0.137230 128.143767 4.000000
(2020-02-07, 2020-02-14] 197652.433333 78.333333 0.733367 0.259486 0.574333 0.022342 0.167743 -6.568800 0.141607 129.174133 4.033333

For those interested in the definition of each variable in the table above, here's Spotify's definitions.

1.3 Merge the Spotify and Covid Data

At this point both the grouped_covid data and the grouped_spotify data have been grouped with the same time interval, week. Hence, we merge the dataframes so we can analyze them together.

In [359]:
merged = grouped_covid.merge(grouped_spotify, left_on='week', right_on='week')[:46]
merged.tail()
Out[359]:
week week_num positiveIncrease negativeIncrease positive negative totalTestResultsIncrease totalTestResults length popularity danceability acousticness energy instrumentalness liveness loudness speechiness tempo time_signature
41 (2020-10-23, 2020-10-30] 42 548531 6442855 60987944.0 8.159270e+08 8724944 1001566752 185245.733333 89.066667 0.673300 0.245777 0.594900 0.000036 0.146427 -6.677767 0.121507 120.441200 3.900000
42 (2020-10-30, 2020-11-06] 43 755922 7010609 65678833.0 8.625697e+08 9554208 1064977639 188892.200000 91.833333 0.696067 0.258977 0.617433 0.000303 0.165863 -6.788200 0.123557 120.558833 3.966667
43 (2020-11-06, 2020-11-13] 44 971012 7293300 71671302.0 9.129258e+08 10192332 1134487741 193855.100000 91.300000 0.690100 0.241555 0.597300 0.004369 0.133317 -6.791800 0.140933 125.320600 3.966667
44 (2020-11-13, 2020-11-20] 45 1169882 8311319 79323733.0 9.680047e+08 11799518 1212343446 184234.300000 92.333333 0.689467 0.268086 0.590600 0.004340 0.155223 -7.175900 0.111603 122.797867 3.966667
45 (2020-11-20, 2020-11-27] 46 1167379 8847032 87662357.0 1.030074e+09 12496773 1300019098 181267.100000 89.600000 0.661500 0.344055 0.568900 0.004360 0.175580 -7.995867 0.099927 125.403267 3.933333

2. Data Analysis and Visualization

To reiterate, we wanted to determine if there's a relationship between COVID's progression and the characteristics of top 10 songs on Spotify. So we'll first focus on finding correlations between the COVID data and the various song features from our Spotify data.

2.1 Correlation Heat Map

We use a correlation heat map on the merged dataframe to visualize the columns with similar linear trends. Essentially, this heat map lays out the correlation coefficients between pairings for each dataframe column.

In [360]:
sns.set(rc={'figure.figsize':(20,20)})
sns.heatmap(abs(merged.corr(method="spearman")), annot=True).set_title('Correlation of Spotify and COVID Data', fontsize=18)
Out[360]:
Text(0.5, 1.0, 'Correlation of Spotify and COVID Data')

Columns with lighter color signify a strong correlation and darker color signifies a weak correlation. We only care about correlations between Spotify data columns and COVID data columns. We observe strong relationships between the danceability of the top 10 songs and a positive increase in COVID cases (danceability vs. positiveInc), as well as danceability and total number of positive cases (danceability vs. positive). We observe another strong correlation between popularity of a song and positive increase (popularity vs. positiveInc), as well as total number of positive cases (popularity vs. positive).
Spotify defines danceability as a song's suitability for dancing. The range for this metric is 0-1. Popularity is ranked from 0-100 and is determined by how many times users played a song and how recent those plays are. Songs played a lot in the past will have a lower popularity than those played a lot now. In this project, we will further explore these correlations.

2.2 Danceability of Top 10 Songs vs Total Cases and Increase in Cases

Now that we've found a correlation between danceability and increase in and total number of COVID cases, we can explore the relationships further.
Since danceability ranges from 0 to 1, we must standardize positive increase in order to graph it with danceability.
In this case, min-max (rescale) standardization is used on positive increase to rescale to range from 0 to 1.

In [361]:
merged['stand_posInc'] = (merged.positiveIncrease-merged.positiveIncrease.min())/(merged.positiveIncrease.max()-merged.positiveIncrease.min())

We then graph the progression of positive increases in cases and danceability throughout the pandemic and fit a regression line to determine the general slope/trend of both variables.

In [362]:
melted_merge = merged[['week_num', 'stand_posInc', 'danceability']]
melted_merge = melted_merge.melt('week_num', var_name='dance&posInc', value_name='values')

sns.relplot(data=melted_merge, x='week_num', y='values', hue='dance&posInc', kind='line',height=8, aspect=1).fig.suptitle('Positive Increase and Danceability over Time')

plt.subplots_adjust(top=0.9)

# calculate slope
x = range(0, len(merged.week_num))
posInc_reg = np.polyfit(x, merged.stand_posInc, 1)
posInc_line = np.poly1d(posInc_reg)

dance_reg = np.polyfit(x, merged.danceability, 1)
dance_line = np.poly1d(dance_reg)


print(f'Slope of trend in positive test increase: {posInc_line.c[0]}')
print(f'Slope of trend in danceability of top songs: {dance_line.c[0]}')
Slope of trend in positive test increase: 0.014381765835648484
Slope of trend in danceability of top songs: -0.001875937917566037

We can see that when positve increase per week increases, the danceability of the top 10 songs decreases, more specifially around week 10 and week 28, when COVID significantly spiked. Week 10 is around March when the stay at home orders were enacted and week 28 is around July when everyone went outside to celebrate 4th of July. It's clear that there's a relationship between danceability when the number of positve tests spike, but to see if in danceability is affected in general, further exploration is needed.
Since popularity has a significant correlation with a positive increase in cases as well, we'll delve into its relationship in the same manner.

In [363]:
x = range(0, len(merged.week_num))
pop_reg = np.polyfit(x, merged.popularity, 1)
pop_line = np.poly1d(pop_reg)


print(f'Slope of trend in positive test increase: {posInc_line.c[0]}')
print(f'Slope of trend in popularity of top songs: {pop_line.c[0]}')


merged['stand_pop'] = (merged.popularity-merged.popularity.min())/(merged.popularity.max()-merged.popularity.min())

melted_merge = merged[['week_num', 'stand_posInc', 'stand_pop']]
melted_merge = melted_merge.melt('week_num', var_name='pop&posInc', value_name='values')

sns.relplot(data=melted_merge, x='week_num', y='values', hue='pop&posInc', kind='line',height=8, aspect=1).fig.suptitle('Positive Increase and Danceability over Time')
plt.subplots_adjust(top=0.9)
Slope of trend in positive test increase: 0.014381765835648484
Slope of trend in popularity of top songs: 0.2966923630383393

Somewhat similarly to danceability, peaks in increases in cases (i.e. significant dates for the COVID timeline) conincide with dips in popularity of top 10 songs. However, both popularity and increases in cases had positive slopes.

2.3 Pre-COVID vs. During COVID

So far, we've established that there's a significant relationship between the weekly increase in & total number of COVID cases and the danceability and popularity of Spotify's top 10 weekly songs. We can fairly conclude there is some statistical support of the idea that COVID's progression impacts the characteristics of songs that trend on Spotify. But to further explore this relationship, let's see how listening trends in 2019 and the months leading up to the pandemic compare to trends during the pandemic.

We'll extract Spotify's Top 10 song data from 2/15/19 until 2/25/20. We define 2/25/20 as the beginning of the pandemic in the U.S. because the country saw its first positive test results during this week.

We start by running the song info spotify_data_extraction on the csv files that contain the top 10 Spotify songs during the mentioned timeframe. The script returns a csv file that compiles all the track info for these songs which we use to create a Dataframe.

In [364]:
pre_spotify = pd.read_csv('pre_data.csv')
pre_spotify.sort_values(by=['start_date'], inplace=True)

Then we create bins denoting the start and end date intervals of each week the Spotify songs trended.

In [365]:
# We want to analyze Spotify data from January 2019 to January 2020 to compare to Spotify data when the first tests were administered
i = pd.to_datetime('02/14/2019')
pre_bins = []
while i < pd.to_datetime('02/18/2020'):
    temp = i + pd.Timedelta('7 days')
    pre_bins.append((i,temp))
    i = temp
pre_bins = pd.IntervalIndex.from_tuples(pre_bins)

After categorizing each song by the week they trended, we average the characteristics of songs for each week. More clearly, for each week, we find the average of the danceability, popularity, etc. of all the songs from that week and reduce the DataFrame to a weekly average table

In [366]:
pre_spotify['start_date'] = pd.to_datetime(pre_spotify['start_date'])

# Using the interval index created above, create a new column week which has the week interval of the data
pre_spotify['week'] = pd.cut(pre_spotify['start_date'], pre_bins)

# Now that every row has a week interval, they will be grouped with the total for each column 
pre_spotify = pre_spotify.groupby(['week']).mean().reset_index()

# Number the weeks to easily identify
pre_spotify['week_num'] = list(range(1,len(pre_bins)+1))
grouped_pre = pre_spotify[:52]


pre_spotify = grouped_pre.groupby(['week_num']).mean().reset_index()
pre_spotify.head()
Out[366]:
week_num length popularity danceability acousticness danceability.1 energy instrumentalness liveness loudness speechiness tempo time_signature
0 1 193647.6 63.8 0.7991 0.25516 0.7991 0.5111 1.252100e-04 0.16664 -7.8419 0.14818 114.6845 4.0
1 2 199759.0 60.6 0.8034 0.24636 0.8034 0.5188 1.255160e-04 0.16755 -8.0770 0.14676 115.4888 4.0
2 3 197647.5 58.8 0.7705 0.26578 0.7705 0.5289 5.160000e-07 0.15575 -7.1664 0.12022 113.8648 4.0
3 4 196878.2 69.9 0.7732 0.27028 0.7732 0.5513 5.160000e-07 0.13624 -7.1182 0.12163 120.7803 4.1
4 5 194975.1 70.2 0.7690 0.22921 0.7690 0.5567 5.160000e-07 0.13684 -7.0484 0.14759 121.5846 4.1

How do the trends in danceability and popularity of top Spotify songs compare to these trends pre-COVID? Let's find out. Let's talk about danceability first. We create a relational plot that compares the danceability of top 10 songs throughout February 2019 to February 2020 to those throughout the pandemic this year.

In [367]:
grouped_spotify['week_num'] = list(range(1,len(grouped_spotify)+1))

temp1 = grouped_pre
temp2 = grouped_spotify
temp1["before_after"] = np.array("Before COVID")
temp2["before_after"] = np.array("During COVID")

all_spotify = temp1.append(temp2)

sns.relplot(y="danceability", x="week_num", hue="before_after", data=all_spotify)
Out[367]:
<seaborn.axisgrid.FacetGrid at 0x7fccbb5e2100>

Danceability of top 10 Spotify songs pre-COVID are shown in blue and those during the pandemic are shown in orange. There appears to be a more noticeable decline in danceability of top 10 songs as the year before the pandemic progressed compared to that of trending songs during the pandemic.

For a more concrete understanding of their differences, let's look at their regressions.

Below, we plot the danceability of songs before pre-COVID and during COVID on separate graphs and draw regression lines for both.
Seaborn's lmplot plots the data points and draws the regression line, but it doesn't report the actual equation of the line. But numpy allows us to calculate it ourselves.

In [368]:
fig = sns.lmplot(y="danceability", x="week_num", data=pre_spotify);
fig = fig.fig
fig.suptitle("Danceability of Spotify's Top 10 Songs Weekly Before COVID")
fig.subplots_adjust(top=0.9)

fig = sns.lmplot(y="danceability", x="week_num", data=grouped_spotify);
fig = fig.fig
fig.suptitle("Danceability of Spotify's Top 10 Songs Weekly During COVID")
fig.subplots_adjust(top=0.9)

x = range(0, len(pre_spotify.week_num))
pre_reg = np.polyfit(x, pre_spotify.danceability, 1)
pre_line = np.poly1d(pre_reg)

x = range(0, len(grouped_spotify.week_num))
during_reg = np.polyfit(x, grouped_spotify.danceability, 1)
during_line = np.poly1d(during_reg)


print('Top 10 songs on Spotify pre-COVID:')
print(f'\tSlope of trend of danceability: {pre_line.c[0]}\n\tStandard deviation of danceability: {pre_spotify.danceability.std()}\n')

print('Top 10 songs on Spotify during COVID:')
print(f'\tSlope of trend in danceability: {during_line.c[0]}\n\tStandard deviation of danceability: {grouped_spotify.danceability.std()}')
Top 10 songs on Spotify pre-COVID:
	Slope of trend of danceability: -0.0023888329206864165
	Standard deviation of danceability: 0.05823449035812011

Top 10 songs on Spotify during COVID:
	Slope of trend in danceability: nan
	Standard deviation of danceability: 0.03651253901195181

The regression lines confirm our suspicions: the danceability of top 10 songs before COVID more rapidly declined as the weeks progressed compared to during COVID. We can see that it increased right before the start of the pandemic, but in general the slope of the regression was negative.
The slope of the regression for danceability during COVID also declined, albeit less rapidly. However, danceability during this time period fluctuated a bit. We know that this fluctuation coincided with fluctuations in total number of and increases in COVID cases nationally. Although the general trend of danceability during COVID was negative, danceability during this time was more spread. The standard deviation of danceability during COVID is higher than it is before COVID.
These findings support the correlation we found between COVID and danceability in the previous section. It seems something different (i.e. COVID) is affecting listening trends this year! Let's see if this holds true for top 10 songs' popularity as well.

In [369]:
sns.relplot(y="popularity", x="week_num", hue="before_after", data=all_spotify)
Out[369]:
<seaborn.axisgrid.FacetGrid at 0x7fccba64c160>

It looks like for most weeks this year, the top 10 songs are more popular than songs from the same weeks a year before.

In [370]:
fig = sns.lmplot(y="popularity", x="week_num", data=pre_spotify);
fig = fig.fig
fig.suptitle("Popularity of Spotify's Top 10 Songs Weekly Before COVID")
fig.subplots_adjust(top=0.9)

fig = sns.lmplot(y="popularity", x="week_num", data=grouped_spotify);
fig = fig.fig
fig.suptitle("Popularity of Spotify's Top 10 Songs Weekly During COVID")
fig.subplots_adjust(top=0.9)

x = range(0, len(pre_spotify.week_num))
pre_reg = np.polyfit(x, pre_spotify.popularity, 1)
pre_line = np.poly1d(pre_reg)

x = range(0, len(grouped_spotify.week_num))
during_reg = np.polyfit(x, grouped_spotify.popularity, 1)
during_line = np.poly1d(during_reg)


print('Top 10 songs on Spotify pre-COVID:')
print(f'\tSlope of trend of popularity: {pre_line.c[0]}\n\tStandard deviation of popularity: {pre_spotify.popularity.std()}\n')

print('Top 10 songs on Spotify during COVID:')
print(f'\tSlope of trend in popularity: {during_line.c[0]}\n\tStandard deviation of popularity: {grouped_spotify.popularity.std()}')
Top 10 songs on Spotify pre-COVID:
	Slope of trend of popularity: 0.3894262784939809
	Standard deviation of popularity: 12.617436730695024

Top 10 songs on Spotify during COVID:
	Slope of trend in popularity: nan
	Standard deviation of popularity: 6.572372436324693

For popularity, there was less of a difference in the slopes of regression lines. Both slopes slightly increased but the standard deviation for popularity during COVID was higher than before COVID. The data that falls farther from the regression during COVID aligns with the weeks in which increases in positive cases peaked.

To recap, we've established a relationship between positive increases in and total number of COVID cases vs. danceability and popularity of the top 10 songs on Spotify weekly based on their correlation coefficients, distributions, and regression lines. The strength of their correlation indicates a predictive relationship between the increase in number of positive cases and the danceability and popularity of these songs.

So let's predict how these trends will continue as the pandemic progresses.

We'll use sklearn to conduct our machine learning regression algorithm. We used this article to decide which machine learning algorithm best fits our data. Since the relationships are linear and don't involve classification, we will use a linear regression model to perform predictive analyis. We utilize sklearn's train_test_splot() function to split the data into training and testing sets. The training data, as its name suggests, is used to train the model to create a linear regression to predict danceability of top 10 songs depending on the postiive increase in and total number of COVID cases. The linear regression prediction is graphed along with the testing set regression. The testing set regression was the data set aside to test the accuracy of the prediction.

Visually, the prediction somewhat aligns with the actual data, but its highest peak rises much higher than the actual testing set's. To more empirically test the prediction's validity, we use statistical analysis.

In [371]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn import model_selection

X = grouped_covid.reset_index()[['positive', 'positiveIncrease']]
y = grouped_spotify[:48].danceability

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
X_train['dance'] = y_train.values


lr = smf.ols(formula='dance ~ positive + positiveIncrease', data=X_train).fit()
preds_lr = lr.predict(X_test)

f, ax = plt.subplots()
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_lr, hist=False, label="Linear Regression Predictions", ax=ax)
<ipython-input-371-b08e1faed128>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['dance'] = y_train.values
Out[371]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fccba54dc10>
In [372]:
from scipy.stats import f as ft

# F-Test to evaluate goodness of fit
test = lr.f_test(np.identity(len(lr.params)))
print('Model - Calculated F-Statistic: ' + str(ft.ppf(.95,test.df_num,test.df_denom)) + \
    '\nF-Value: ' + str(test.fvalue[0][0]) + '\nP-Value: ' + str(test.pvalue))
 Model - Calculated F-Statistic: 3.327654498572059 F-Value: 8773.145591210143 P-Value: 4.5048911190439053e-41
/opt/conda/lib/python3.8/site-packages/statsmodels/base/model.py:1830: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 3, but rank is 2
  warnings.warn('covariance of constraints does not have full '

The F-Values are greater than the F-Statistics and the p-value is < 0.05. This supports that the predicted model is significant.

We go on to do the same for top 10 Spotify song popularity vs increase in and total COVID cases.

In [373]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

X = grouped_covid.reset_index()[['positive', 'positiveIncrease']]
y = grouped_spotify['popularity']

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
X_train['pop'] = y_train.values

lr = smf.ols(formula='pop ~ positive + positiveIncrease', data=X_train).fit()
preds_lr = lr.predict(X_test)

f, ax = plt.subplots()
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_lr, hist=False, label="Linear Regression Predictions", ax=ax)
<ipython-input-373-6490c17495fc>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['pop'] = y_train.values
Out[373]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fccc00a0a60>

The predicted regression is even more closely similar to the actual testing data's regression which is a good sign.

In [379]:
# F-Test to evaluate goodness of fit

test = lr.f_test(np.identity(len(lr.params)))
print('Model - Calculated F-Statistic: ' + str(ft.ppf(.95,test.df_num,test.df_denom)) + \
    '\nF-Value: ' + str(test.fvalue[0][0]) + '\nP-Value: ' + str(test.pvalue))
Model - Calculated F-Statistic: 3.340385558237759
F-Value: 1608.8496949814373
P-Value: 1.264434463198663e-29
/opt/conda/lib/python3.8/site-packages/statsmodels/base/model.py:1830: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 3, but rank is 2
  warnings.warn('covariance of constraints does not have full '

Again, the F-Value is greater than the F-Statistic and the p-value is < 0.05, so this model for predicting the popularity of trending songs on Spotify using total and increase in COVID cases as predictors is statistically significant.

Conclusion

After comparing the COVID data and Spotify data, we were able to find a correlation between the current status of the pandemic and Spotify listening trends, specifically the rate at which people are catching COVID, as well as the total number of cases, and the danceability and popularity of the top 10 songs on Spotify. In general, we noticed that danceability decreased and popularity increased as more COVID cases arose. Based on this, we were able to create a formula to predict how these listening trends will continue past the end of the pandemic.
Although we can't explain the reason for the relationships between COVID and top Spotify songs from our findings, our analysis can support future studies that may be more robust and more clearly explain why these relationships exist. However, we predict that danceability likely decreased as many Spotify users became sick or as users' moods declined as the pandemic progressed. Popularity may have increased as users spent more time at home and had more time and opportunities to listen to music.
At a wider scope, our findings support and provide a more context to exisiting studies on how people use music to cope with stressful events in general.

In [381]:
from IPython.display import HTML
HTML('''<script>
$('div.output_stderr').hide();
</script>
''')
Out[381]: