1.1

Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Spotify is universally used across countries and people and people use playlists to easily collect and store songs. Playlists are created for many different situations and events, including for weddings, birthdays, and hangouts with friends. For many of these events people are getting together to have fun and enjoy themselves. One of the main activities at parties and events is dancing. This brings us to our research question: What variables affect danceability the most?

1.2

Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)

Our plan is to use the Spotify dataset to see what variables affect danceability the most. We are going to separate out the variables that have a correlation and see what the correlations are to the danceability variable.

1.3

Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

We will be using a linear regression analysis and decision tree to solve our problem.

1.4

Explain how your analysis will help the consumer of your analysis.

Our analysis will be especially helpful for DJ’s as well as people who are throwing parties and events with friends and family.

2.1

All packages used are loaded upfront so the reader knows which are required to run the script

2.2

Messages and warnings resulting from loading the package are suppressed.

2.3

Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

library(tidyverse) # allows us to read the csv file that the Spotify data is in, give us the ability to add in data visualizations such as linear regression.

library(rpart) # includes the decision tree function that we are using for our analysis

library(rpart.plot) # adds the ability to function

library(knitr) # allows us to display our dataset in a condensed format

library(kableExtra) # allows us to scroll all the way across on the table

3.1

Original source where the data was obtained is cited and, if possible, hyperlinked.

spotify <- read.csv("spotify_songs.csv")

3.2

Source data is thoroughly explained.

The source data comes from the Spotifyr package and was used in a blog post where audio features were explored by genre. It was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff on January 21st, 2020. The data comes from Spotify via the spotifyr package. Created on 2020-01-21, by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The SpotifyR dataset includes 32,833 records, and 23 variables.

3.3

Data importing and cleaning steps are explained in the text and follow a logical process.

Removing NAs from the dataset

colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify$track_name[is.na(spotify$track_name)] <- "Not_Stated"

spotify$track_artist[is.na(spotify$track_artist)] <- "Not_Stated"

spotify$track_album_name[is.na(spotify$track_album_name)] <- "Not_Stated"

Remove Duplicate tracks

We now have 28,356 records.

spotify_unique <- spotify %>% distinct(track_id, .keep_all = TRUE)

Converted duration from milliseconds to seconds

spotify$duration_ms <- spotify$duration_ms / 1000

names(spotify)[23] <- "duration"

3.4

Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

spotify_unique <- head(spotify, 10)  # replace spotify_data with your dataset
spotify_preview <- head(spotify_unique)

spotify_preview %>%
  kable(format = "html", align = "lccrr",
        caption = "Spotify Preview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "300px")
Spotify Preview
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194.754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162.600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176.616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169.093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189.052
7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163.049

3.5

Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

We plan on learning how to build a linear regression and decision tree for each of the variables included in the Spotify data chart, but we only found valence, speechiness, tempo, energy, and accousticness to have a significant impact on danceability.

4.1

Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

We are interested to see what makes a song danceable. We plan on using two different models to see if our findings could be replicated. Before running these models, we changed the variable duration_milliseconds into a more applicable format by changing the data inside the variable from milliseconds to seconds and renamed it duration.

4.2

What types of plots and tables will help you to illustrate the findings to your questions?

We intend on running two different model types to show what variables have a significant influence on danceability. First, we will use a linear regression to see how the strength of the correlations to danceability for all of the numeric variables. Second, we plan on running a decision tree to further support our findings on which song features matter most when deciding which songs would be considered danceable. We plan on running graphs to validate our findings.

4.3

What do you not know how to do right now that you need to learn to answer your questions?

We will need to learn how to run a linear regression model, as well as a decision tree to answer our questions. We will also need to learn how to build different graphs that will display our graphs the best.

4.4

Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

We will implement linear regression and the decsion tree model to solve our questions.