Project 1

Author

Ashley Ramirez

Load Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Dataset

spotify <- read_csv("spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore and clean the data

View the structre of the data

str(spotify)
spc_tbl_ [2,000 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ artist          : chr [1:2000] "Britney Spears" "blink-182" "Faith Hill" "Bon Jovi" ...
 $ song            : chr [1:2000] "Oops!...I Did It Again" "All The Small Things" "Breathe" "It's My Life" ...
 $ duration_ms     : num [1:2000] 211160 167066 250546 224493 200560 ...
 $ explicit        : logi [1:2000] FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ year            : num [1:2000] 2000 1999 1999 2000 2000 ...
 $ popularity      : num [1:2000] 77 79 66 78 65 69 86 68 75 77 ...
 $ danceability    : num [1:2000] 0.751 0.434 0.529 0.551 0.614 0.706 0.949 0.708 0.713 0.72 ...
 $ energy          : num [1:2000] 0.834 0.897 0.496 0.913 0.928 0.888 0.661 0.772 0.678 0.808 ...
 $ key             : num [1:2000] 1 0 7 0 8 2 5 7 5 6 ...
 $ loudness        : num [1:2000] -5.44 -4.92 -9.01 -4.06 -4.81 ...
 $ mode            : num [1:2000] 0 1 1 0 0 1 0 1 0 1 ...
 $ speechiness     : num [1:2000] 0.0437 0.0488 0.029 0.0466 0.0516 0.0654 0.0572 0.0322 0.102 0.0379 ...
 $ acousticness    : num [1:2000] 0.3 0.0103 0.173 0.0263 0.0408 0.119 0.0302 0.0267 0.273 0.00793 ...
 $ instrumentalness: num [1:2000] 1.77e-05 0.00 0.00 1.35e-05 1.04e-03 9.64e-05 0.00 0.00 0.00 2.93e-02 ...
 $ liveness        : num [1:2000] 0.355 0.612 0.251 0.347 0.0845 0.07 0.0454 0.467 0.149 0.0634 ...
 $ valence         : num [1:2000] 0.894 0.684 0.278 0.544 0.879 0.714 0.76 0.861 0.734 0.869 ...
 $ tempo           : num [1:2000] 95.1 148.7 136.9 120 172.7 ...
 $ genre           : chr [1:2000] "pop" "rock, pop" "pop, country" "rock, metal" ...
 - attr(*, "spec")=
  .. cols(
  ..   artist = col_character(),
  ..   song = col_character(),
  ..   duration_ms = col_double(),
  ..   explicit = col_logical(),
  ..   year = col_double(),
  ..   popularity = col_double(),
  ..   danceability = col_double(),
  ..   energy = col_double(),
  ..   key = col_double(),
  ..   loudness = col_double(),
  ..   mode = col_double(),
  ..   speechiness = col_double(),
  ..   acousticness = col_double(),
  ..   instrumentalness = col_double(),
  ..   liveness = col_double(),
  ..   valence = col_double(),
  ..   tempo = col_double(),
  ..   genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Look for missing values

summary(spotify)
    artist              song            duration_ms      explicit      
 Length:2000        Length:2000        Min.   :113000   Mode :logical  
 Class :character   Class :character   1st Qu.:203580   FALSE:1449     
 Mode  :character   Mode  :character   Median :223280   TRUE :551      
                                       Mean   :228748                  
                                       3rd Qu.:248133                  
                                       Max.   :484146                  
      year        popularity     danceability        energy      
 Min.   :1998   Min.   : 0.00   Min.   :0.1290   Min.   :0.0549  
 1st Qu.:2004   1st Qu.:56.00   1st Qu.:0.5810   1st Qu.:0.6220  
 Median :2010   Median :65.50   Median :0.6760   Median :0.7360  
 Mean   :2009   Mean   :59.87   Mean   :0.6674   Mean   :0.7204  
 3rd Qu.:2015   3rd Qu.:73.00   3rd Qu.:0.7640   3rd Qu.:0.8390  
 Max.   :2020   Max.   :89.00   Max.   :0.9750   Max.   :0.9990  
      key            loudness            mode         speechiness     
 Min.   : 0.000   Min.   :-20.514   Min.   :0.0000   Min.   :0.02320  
 1st Qu.: 2.000   1st Qu.: -6.490   1st Qu.:0.0000   1st Qu.:0.03960  
 Median : 6.000   Median : -5.285   Median :1.0000   Median :0.05985  
 Mean   : 5.378   Mean   : -5.512   Mean   :0.5535   Mean   :0.10357  
 3rd Qu.: 8.000   3rd Qu.: -4.168   3rd Qu.:1.0000   3rd Qu.:0.12900  
 Max.   :11.000   Max.   : -0.276   Max.   :1.0000   Max.   :0.57600  
  acousticness       instrumentalness       liveness         valence      
 Min.   :0.0000192   Min.   :0.0000000   Min.   :0.0215   Min.   :0.0381  
 1st Qu.:0.0140000   1st Qu.:0.0000000   1st Qu.:0.0881   1st Qu.:0.3867  
 Median :0.0557000   Median :0.0000000   Median :0.1240   Median :0.5575  
 Mean   :0.1289549   Mean   :0.0152260   Mean   :0.1812   Mean   :0.5517  
 3rd Qu.:0.1762500   3rd Qu.:0.0000683   3rd Qu.:0.2410   3rd Qu.:0.7300  
 Max.   :0.9760000   Max.   :0.9850000   Max.   :0.8530   Max.   :0.9730  
     tempo           genre          
 Min.   : 60.02   Length:2000       
 1st Qu.: 98.99   Class :character  
 Median :120.02   Mode  :character  
 Mean   :120.12                     
 3rd Qu.:134.27                     
 Max.   :210.85                     

Remove the N/A Values

spotify <- spotify %>% drop_na()

Lets filter information from the data set

in this occasion i want to use the characteristics of the same genre (pop) in my visualization.

pop_songs <- spotify %>% filter(genre == "pop")

head(pop_songs)
# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
3 Gigi D'… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
4 Eiffel … Move…      268863 FALSE     1999         56        0.745  0.958     7
5 Bomfunk… Free…      306333 FALSE     2000         55        0.822  0.922    11
6 Anastac… I'm …      245400 FALSE     1999         64        0.761  0.716    10
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

Visualization

ggplot(pop_songs, aes(x = year, y = danceability, color = as.factor(explicit))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, aes(color = "Trend Line")) +  
  labs(title = "Danceability of Pop Songs Over the Years",
       x = "Year",
       y = "Danceability",
       caption = "Source: Spotify",
       color = "Song Type") +  
  scale_color_manual(values = c("TRUE" = "purple", "FALSE" = "orange", "Trend Line" = "green"),
                     labels = c("Explicit", "Non-Explicit", "Trend Line")) + 
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")
`geom_smooth()` using formula = 'y ~ x'

Linear Regression Analysis

Lets analyze how danceability is affected y loudness and explicit.

Linear regression analysis
model <- lm(danceability ~ loudness + as.numeric(explicit) + popularity + energy, data = pop_songs)
Summary of the linear regression analysis
summary(model)

Call:
lm(formula = danceability ~ loudness + as.numeric(explicit) + 
    popularity + energy, data = pop_songs)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.44939 -0.07688  0.01187  0.08866  0.30136 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.4950636  0.0618644   8.002 1.19e-14 ***
loudness             -0.0023410  0.0044841  -0.522  0.60190    
as.numeric(explicit)  0.0444760  0.0192404   2.312  0.02128 *  
popularity            0.0001389  0.0002834   0.490  0.62425    
energy                0.1753682  0.0544465   3.221  0.00138 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1346 on 423 degrees of freedom
Multiple R-squared:  0.0451,    Adjusted R-squared:  0.03607 
F-statistic: 4.995 on 4 and 423 DF,  p-value: 0.0006078

According to this information the formula for my model is:

Danceability = 0.49510.0023×Loudness+0.0445 x Explicit+0.000139×Popularity+0.1754×Energy

Also according to this model,explicit content and energy are key predictors of danceability in pop songs. However loudness and popularity do not influence danceability in pop songs. There are also low R-square values that indicate that other variables may also affect danceability.

Brief Essay

How you cleaned the dataset up

With the code str() I was able to see the internal structure of the data set. Secondly, with the code summary() i was able to see if there were any missing values. Thirdly, with the code spotify <- spotify %>% drop_na() i was able to erase all the missing values in the dataset, if any. Finally, since i only wanted to work with a specific genre of songs,pop, i used the code pop_songs <- spotify %>% filter(genre == “pop”) in order to filter out the songs that fitter that genre.

What the visualization represents, any interesting patterns or surprises that arise within the visualization.

The visualization represents the change over time of the danceability of songs from 1999 to 2020. With each point representing a song, the y-axis being the danceability score and the x-axis shows the year of release, and depending whether the song is explicit or not the color of the point changes. Finally the trend line can show a general trajectory of the danceability over the years, showing whether it has increase, decreased, or stabilized.

Patterns or surprises

A surprise for me was that the trend line showed an increase in the year 2020 higher than those years before, and that the line started increasing whenever explicit songs started appearing. Also within the 2000-2010, some outliners were explicit songs, which could relate to the change in trend of songs during that century.

Anything that you might have shown that you could not get to work or that you wished you could have included.

I was also working on the percentage of incurring with a explicit song over the years, however i could not seem to get it to work or be included into the graph.