R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Spotify <- read.csv("/Users/yashuvaishu/Downloads/Spotify.csv")

Here is a database named Spotify with have atleast 10 columns. Using the built in function in R called as summary

##   trackName          artistName           msPlayed            genre          
##  Length:8511        Length:8511        Min.   :        0   Length:8511       
##  Class :character   Class :character   1st Qu.:   139977   Class :character  
##  Mode  :character   Mode  :character   Median :   269850   Mode  :character  
##                                        Mean   :  1539795                     
##                                        3rd Qu.:  1211910                     
##                                        Max.   :158367130                     
##   danceability        energy             key            loudness      
##  Min.   :0.0000   Min.   :0.00108   Min.   : 0.000   Min.   :-42.044  
##  1st Qu.:0.5070   1st Qu.:0.40700   1st Qu.: 2.000   1st Qu.:-10.016  
##  Median :0.6220   Median :0.59200   Median : 5.000   Median : -7.132  
##  Mean   :0.6016   Mean   :0.56681   Mean   : 5.243   Mean   : -8.580  
##  3rd Qu.:0.7140   3rd Qu.:0.75400   3rd Qu.: 8.000   3rd Qu.: -5.309  
##  Max.   :0.9760   Max.   :0.99900   Max.   :11.000   Max.   :  3.010  
##   speechiness         valence           tempo             id           
##  Min.   :0.00000   Min.   :0.0000   Min.   :  0.00   Length:8511       
##  1st Qu.:0.03610   1st Qu.:0.2380   1st Qu.: 97.18   Class :character  
##  Median :0.04790   Median :0.4100   Median :118.94   Mode  :character  
##  Mean   :0.07833   Mean   :0.4353   Mean   :119.10                     
##  3rd Qu.:0.08190   3rd Qu.:0.6180   3rd Qu.:139.32                     
##  Max.   :0.94100   Max.   :0.9860   Max.   :236.20                     
##   duration_ms     
##  Min.   :  10027  
##  1st Qu.: 163173  
##  Median : 195989  
##  Mean   : 203951  
##  3rd Qu.: 231378  
##  Max.   :1847210

Filtering of Data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Creating separate table for Categorical_data

Categorical_data <- Spotify %>%
  select(trackName,artistName,genre,id)
Categorical_summaries <- lapply(Categorical_data, function(x){
  data.frame(Unique_value = unique(x), Counts = table(x))
})
filter(Spotify, artistName == "DJ Snake")
##                                         trackName artistName msPlayed genre
## 1                     A Different Way (with Lauv)   DJ Snake    66060   edm
## 2                                   Broken Summer   DJ Snake  3987042   edm
## 3                                 Let Me Love You   DJ Snake    64207   edm
## 4  Taki Taki (with Selena Gomez, Ozuna & Cardi B)   DJ Snake    44331   edm
## 5                       Try Me (with Plastic Toy)   DJ Snake    13010   edm
## 6                              Turn Down for What   DJ Snake     2222   edm
## 7                     A Different Way (with Lauv)   DJ Snake    66060   edm
## 8                                   Broken Summer   DJ Snake  3987042   edm
## 9                                 Let Me Love You   DJ Snake    64207   edm
## 10 Taki Taki (with Selena Gomez, Ozuna & Cardi B)   DJ Snake    44331   edm
## 11                      Try Me (with Plastic Toy)   DJ Snake    13010   edm
## 12                             Turn Down for What   DJ Snake     2222   edm
##    danceability energy key loudness speechiness valence   tempo
## 1         0.784  0.757   8   -3.912      0.0384  0.5870 104.996
## 2         0.683  0.415  10  -10.720      0.0841  0.5540  81.006
## 3         0.649  0.716   8   -5.371      0.0349  0.1630  99.988
## 4         0.842  0.801   8   -4.167      0.2280  0.6170  95.881
## 5         0.680  0.703   7   -3.360      0.0439  0.1930 102.289
## 6         0.818  0.799   1   -4.100      0.1560  0.0815 100.014
## 7         0.784  0.757   8   -3.912      0.0384  0.5870 104.996
## 8         0.683  0.415  10  -10.720      0.0841  0.5540  81.006
## 9         0.649  0.716   8   -5.371      0.0349  0.1630  99.988
## 10        0.842  0.801   8   -4.167      0.2280  0.6170  95.881
## 11        0.680  0.703   7   -3.360      0.0439  0.1930 102.289
## 12        0.818  0.799   1   -4.100      0.1560  0.0815 100.014
##                        id duration_ms
## 1  1YMBg7rOjxzbya0fPOYfNX      198286
## 2  60mrGpCA4OIUuRLwi4T5Nm      183779
## 3  0lYBSQXN6rCTvUZvg9S0lU      205947
## 4  4w8niZpiMy6qz1mntFA5uM      212500
## 5  5vTtANNlQK5UhwfooDek5y      198405
## 6  67awxiNHNyjMXhVgsHuIrs      213733
## 7  1YMBg7rOjxzbya0fPOYfNX      198286
## 8  60mrGpCA4OIUuRLwi4T5Nm      183779
## 9  0lYBSQXN6rCTvUZvg9S0lU      205947
## 10 4w8niZpiMy6qz1mntFA5uM      212500
## 11 5vTtANNlQK5UhwfooDek5y      198405
## 12 67awxiNHNyjMXhVgsHuIrs      213733

Inculding Plots

library(ggplot2)
plot(Spotify)

Goal/Purpose

The goal of the project is to analyze most played musics based on genre and to predict the most popular artist based on the data in Spotify dataset table.

Data Documentation

“Spotify Song Attributes Dataset: Exploring the Musical Landscape”

The Spotify Song Attributes Dataset is a comprehensive collection of music tracks, encompassing various genres and artist names. This dataset provides valuable insights into the world of music, allowing enthusiasts, researchers, and data scientists to delve into the characteristics and nuances of each track.

The dataset includes the author’s streaming history throughout the year 2022. It consists of key features such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, and time signature. These attributes provide a holistic view of the songs, enabling users to analyze and interpret different aspects of their musical composition.

Dataset Features:

trackName - The name of the track. artistName - The name of the artist or band associated with the track. msPlayed - The duration in milliseconds that the track was played. genre - The genre or genres associated with the track. danceability - A measure of how suitable a track is for dancing. energy - The energy level of the track. key - The key of the track (e.g., C, D, E). loudness - The overall loudness of the track in decibels (dB). speechiness - The presence of spoken words in the track. valence - The musical positiveness or happiness conveyed by the track. tempo - The tempo of the track in beats per minute (BPM). id - The unique identifier of the track. duration_ms - The duration of the track in milliseconds.

Aggregation functions

# Standard deviation of speechiness

std_spotify<- sd(Spotify$speechiness)
print(std_spotify)
## [1] 0.07851632
# Variance of speechiness

var_spotify <- var(Spotify$speechiness)
print(var_spotify)
## [1] 0.006164813
# Sum of total incentive provided

incent_sum <- sum(Spotify$speechiness)
print(incent_sum)
## [1] 666.6545

Energy Vs Loudness Graph
From this graph we can understand that loudness is directly propotional to energy.

Visual Summary

ggplot(Spotify, aes(energy, loudness)) + geom_point(size=2, color="purple") + labs(title = "energy vs loudness") + theme(axis.title.x=element_text(colour="red"),axis.title.y = element_text(colour="red"))

a <- select_if(Spotify, is.numeric)
summary(a)
##     msPlayed          danceability        energy             key        
##  Min.   :        0   Min.   :0.0000   Min.   :0.00108   Min.   : 0.000  
##  1st Qu.:   139977   1st Qu.:0.5070   1st Qu.:0.40700   1st Qu.: 2.000  
##  Median :   269850   Median :0.6220   Median :0.59200   Median : 5.000  
##  Mean   :  1539795   Mean   :0.6016   Mean   :0.56681   Mean   : 5.243  
##  3rd Qu.:  1211910   3rd Qu.:0.7140   3rd Qu.:0.75400   3rd Qu.: 8.000  
##  Max.   :158367130   Max.   :0.9760   Max.   :0.99900   Max.   :11.000  
##     loudness        speechiness         valence           tempo       
##  Min.   :-42.044   Min.   :0.00000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:-10.016   1st Qu.:0.03610   1st Qu.:0.2380   1st Qu.: 97.18  
##  Median : -7.132   Median :0.04790   Median :0.4100   Median :118.94  
##  Mean   : -8.580   Mean   :0.07833   Mean   :0.4353   Mean   :119.10  
##  3rd Qu.: -5.309   3rd Qu.:0.08190   3rd Qu.:0.6180   3rd Qu.:139.32  
##  Max.   :  3.010   Max.   :0.94100   Max.   :0.9860   Max.   :236.20  
##   duration_ms     
##  Min.   :  10027  
##  1st Qu.: 163173  
##  Median : 195989  
##  Mean   : 203951  
##  3rd Qu.: 231378  
##  Max.   :1847210

Here is a line graph between genre and msplayed which help us to know which genre is played more times.

ggplot(Spotify, aes(genre,msPlayed)) + geom_line(size=2, color="green") + geom_point(size=3, color="#008080") + labs(title = "genre vs msplayed", shape= "Research area") + theme(axis.text.x = element_text(angle = 90))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(Spotify, aes(danceability,energy)) + geom_point(alpha = 0.1,color="#008080") + labs(title = "energy vs danceability")+ theme(axis.title.x=element_text(colour="red"),axis.title.y = element_text(colour="red"))

ggplot(Spotify, aes(key,energy)) + geom_bar(stat="identity", fill="steelblue") + labs(title = "energy vs key")

ggplot(Spotify, aes(loudness,speechiness)) + geom_bar(stat="identity", fill="blue")+ labs(title = "loudness vs speechiness")

pie(table(Spotify$key), main="Pie chart for Key")

ggplot(Spotify, aes(x=energy)) + geom_histogram(fill="steelblue") + theme(axis.title.x=element_text(colour="orange"),axis.title.y = element_text(colour="orange")) + labs(title = "energy vs count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Spotify, aes(danceability,valence)) + 
  geom_point(size=0.5)+
  geom_smooth(method=lm,  linetype="dashed",
             color="darkred", fill="blue")+ theme(axis.title.x=element_text(colour="green"),axis.title.y = element_text(colour="red")) + labs(title = "danceability vs valence")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Spotify, aes(valence, loudness)) + 
    geom_boxplot()+theme(axis.title.x=element_text(colour="blue"),axis.title.y = element_text(colour="blue"))+ labs(title = "valence vs loudness")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?