Assignment

In this assignment, we’ll explore working with data in R, using a dataset of top songs on Spotify from 2010 to 2019. The data set is based on this Kaggle dataset if you want to learn more about it.

Problem 1.

Download the data set and save it in your data_raw folder. Read the dataset into R to an object called top.

top10s <- read.csv("~/Spring 2022/STAT 158/archive/top10s.csv")

Problem 2

Print a summary of each variable in the data frame

length(unique(top10s$title))
## [1] 584
length(unique(top10s$artist))
## [1] 184
table(top10s$`top genre`)
## < table of extent 0 >
table(top10s$year)
## 
## 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 
##   51   53   35   71   58   95   80   65   64   31
bpm_min <- min(top10s$bpm)
bpm_max <- max(top10s$bpm)
bpm_mean <- mean(top10s$bpm)

bpm_summary <- list(bpm_min,bpm_max,bpm_mean)

print(bpm_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 206
## 
## [[3]]
## [1] 118.5456
nrgy_min <- min(top10s$nrgy)
nrgy_max <- max(top10s$nrgy)
nrgy_mean <- mean(top10s$nrgy)

nrgy_summary <- list(nrgy_min,nrgy_max,nrgy_mean)

print(nrgy_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 98
## 
## [[3]]
## [1] 70.50415
dnce_min <- min(top10s$dnce)
dnce_max <- max(top10s$dnce)
dnce_mean <- mean(top10s$dnce)

dnce_summary <- list(dnce_min,dnce_max,dnce_mean)

print(dnce_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 97
## 
## [[3]]
## [1] 64.37977
dB_min <- min(top10s$dB)
dB_max <- max(top10s$dB)
dB_mean <- mean(top10s$dB)

dB_summary <- list(dB_min,dB_max,dB_mean)

print(dB_summary)
## [[1]]
## [1] -60
## 
## [[2]]
## [1] -2
## 
## [[3]]
## [1] -5.578773
live_min <- min(top10s$live)
live_max <- max(top10s$live)
live_mean <- mean(top10s$live)

live_summary <- list(live_min,live_max,live_mean)

print(live_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 74
## 
## [[3]]
## [1] 17.77446
val_min <- min(top10s$val)
val_max <- max(top10s$val)
val_mean <- mean(top10s$val)

val_summary <- list(val_min,val_max,val_mean)

print(val_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 98
## 
## [[3]]
## [1] 52.22554
dur_min <- min(top10s$dur)
dur_max <- max(top10s$dur)
dur_mean <- mean(top10s$dur)

dur_summary <- list(dur_min,dur_max,dur_mean)

print(dur_summary)
## [[1]]
## [1] 134
## 
## [[2]]
## [1] 424
## 
## [[3]]
## [1] 224.675
acous_min <- min(top10s$acous)
acous_max <- max(top10s$acous)
acous_mean <- mean(top10s$acous)

acous_summary <- list(acous_min,acous_max,acous_mean)

print(acous_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 99
## 
## [[3]]
## [1] 14.3267
spch_min <- min(top10s$spch)
spch_max <- max(top10s$spch)
spch_mean <- mean(top10s$spch)

spch_summary <- list(spch_min,spch_max,spch_mean)

print(spch_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 48
## 
## [[3]]
## [1] 8.358209
pop_min <- min(top10s$pop)
pop_max <- max(top10s$pop)
pop_mean <- mean(top10s$pop)

pop_summary <- list(pop_min,pop_max,pop_mean)

print(pop_summary)
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 99
## 
## [[3]]
## [1] 66.52073

Problem 3

Create a histogram of song duration, (dur). Make sure to set an appropriate title and axis titles.

hist(top10s$dur, breaks=20, main="Song Duration", xlab="time (seconds)", ylab= "songs")

Problem 4

Which song has the highest bpm? (hint: find the max bpm, then index the data frame based on which bpm has that value).

max_bpm <- max(top10s$bpm)

print(max_bpm)
## [1] 206
bpm_song <- top10s$bpm == 206

Song = Four Five Seconds

Problem 5

Create a table of top.genre, then use the sort function to order by counts. Which genre has the most top songs?

tg <- table(top10s$`top genre`)

sort(tg, decreasing = FALSE)
## integer(0)

Top genres are #1 dance pop and #2 pop.

Problem 6

Create a data frame called top3genres with only the rows in the top three genres. Using top3genres, create boxplots of popularity (pop), and split by genre. Make sure to set an appropriate title and axis titles. Which genre tends to have higher popularity values?

top3 <- c("dance pop","pop","canadian pop")

top_songs <- c(top10s$top.genre == top3[1] | top10s$top.genre == top3[2] | top10s$top.genre== top3[3])

top3genres <- data.frame(top10s[top_songs,])
boxplot(top3genres$pop ~ top3genres$top.genre, main= "Top 3 Genres", xlab= "Genre", ylab = "Popularity")

Pop has the highest popularity value.

Problem 7

One of the genres is called “complextro”. Select the rows with this genre. Which artist is categorized in this genre?

complextro <- top10s$`top genre` == "complextro"

top10s$artist[complextro]
## character(0)

The artist Zedd is categorized by the complextro genre. ## Problem 8

Which two artists have the most songs on in the data set? Create a new data frame called top2artists that has only the two artists with the most songs.

topartists <- table(top10s$artist)

head(sort(topartists, decreasing = TRUE),2)
## 
##    Katy Perry Justin Bieber 
##            17            16
top2 <- top10s$artist == "Katy Perry" | top10s$artist == "Justin Bieber"

top2artists <- top10s[top2,]

Katy Perry (#1) and Justin Bieber (#2) have the most songs.

Problem 9

Using top2artists, create a scatter plot of duration (dur) vs. “acousticness” (acous), colored by artist What do you notice about the plot?

colors <- as.factor(top2artists$artist)
plot(top2artists$dur, top2artists$acous, col=colors)

Black= Katy Perry, Red= Justin Bieber

Problem 10

Which of the top 2 artists has longer songs on average?

Justin Bieber has longer songs on average.

End