Below is the code that creates a random data frame to use in the first few examples. The sample() function selects a random set of numbers from a given range.
# before running any line of code that employs random selection, we need to set the seed to a number - any number. This means that, although we are generating random values, the values will be the same every time the code is run. This allows our code to be replicated.
set.seed(160)
ex.df <- data.frame(Person = c(1:10),
Age = c(sample(18:60 ,10, replace = T)),
Binary = c(sample(0:1 ,10, replace = T)),
Agree.Scale = c(sample(1:5 ,10, replace = T)))# outputs a random number between 1 and 5 - 10 times
# sampling with replacement means that a number being selected once does not prevent it from possibly being selected again
datatable(ex.df)
if/else loops can only be run on one value at a time. Below, we test our conditions on the first observation of Age in our data frame - which has a value of 49.
if(ex.df$Age[1] > 40){ # our condition is that the age value is greater than 40 - if TRUE, "old" will be output
print("old")
} else{ #if our age value is less than 40, "young" will be output
print("young")
}
## [1] "old"
# since our value is 49, this loop outputs "old"
In order to apply simple if/else conditions to an entire vector, we need to use the ifelse() function.
# supply the function with the condition that needs to be evaluated for each element of the vector, the operation when TRUE, and the operation when FALSE
ifelse(ex.df$Age > 40, "old", "young")
## [1] "old" "old" "young" "old" "young" "young" "young" "young" "old"
## [10] "young"
We can also evaluate multiple conditions by combining them into one logical statement using logical operators (AND/OR - &/|).
# print "both" for all observations with an age over 40 and an agree value over 2
ifelse(ex.df$Age > 40 & ex.df$Agree.Scale > 2, "both", "not both")
## [1] "not both" "not both" "not both" "both" "not both" "not both"
## [7] "not both" "not both" "both" "not both"
# print "either" for all observations that meet at least one of our conditions: an age over 40 OR an agree value over 2
ifelse(ex.df$Age > 40 | ex.df$Agree.Scale > 2, "either", "none")
## [1] "either" "either" "none" "either" "either" "none" "none" "either"
## [9] "either" "none"
In order to evaluate multiple conditional statements that lead to multiple operations, we need to add else if.
If the first if statement evaluates to FALSE, then the next else if statement is evaluated. If that statement is also FALSE, then the else operation will be performed.
if(ex.df$Age[1] %in% 20:35){ #if the first value in Age between 20-35
print("20-35")
} else if(ex.df$Age[1] %in% 35:50){ #if the first value in Age between 25-50
print("35-50")
} else{ #if none of above
print("Other")
}
## [1] "35-50"
In order to apply multiple conditional statements & operations to an entire vector, we need to nest multiple ifelse() statements.
The first condition will be our if statement, then our second condition - the else statement - will be another ifelse() function.
ifelse(ex.df$Age %in% 20:35,"20-35", #if the first value in Age between 20-35 print "20-35" -> if NOT, then evaluate the next ifelse statement
ifelse(ex.df$Age %in% 35:50, "35-50", "Other"))
## [1] "35-50" "35-50" "20-35" "35-50" "20-35" "Other" "35-50" "35-50" "Other"
## [10] "20-35"
Here is an example where our operations make use of other parts of our data, rather than just printing text:
In this case, when the age of an observation is between 20-35, the function outputs the observation’s value for Binary. When the age of an observation is between 35-50, the function outputs the observation’s value for Binary multiplied by 2. Otherwise, the function outputs NA.
ifelse(ex.df$Age %in% 20:35,ex.df$Binary,
ifelse(ex.df$Age %in% 35:50, ex.df$Binary*2, NA))
## [1] 2 2 1 2 0 NA 0 2 NA 1
We can write this same code more efficiently by using the with() function - which specifies the dataframe before the function, so that only the variable names need to be entered.
with(ex.df, ifelse(Age %in% 20:35, Binary,
ifelse(Age %in% 35:50, Binary*2, NA)))
## [1] 2 2 1 2 0 NA 0 2 NA 1
For this example, we are going to use example data called top.songs.Apr.2019 that contains information about the top 100 songs on Soptify from April 2019.
datatable(top.songs.Apr2019)
We can take identify that the data includes 17 variables and 100 songs - song is our unit of observation.
dim(top.songs.Apr2019)
## [1] 100 17
Other than the artist, song name, and song id (columns 1:3), all of the remaining variables (4:17) are numeric ratings of the song’s characteristics.
These are defined as:
- Danceability: Describes how suitable a track is for dancing
- Valence: Describes the musical positiveness conveyed by a track
- Energy: Represents a perceptual measure of intensity and activity
- Tempo: The overall estimated tempo of a track in beats per minute (BPM)
- Loudness: The overall loudness of a track in decibels (dB)
- Speechiness: This detects the presence of spoken words in a track
- Instrumentalness: Predicts whether a track contains no vocals
- Liveness: Detects the presence of an audience in the recording
- Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic
- Key: The estimated overall key of the track
- Mode: Indicates the the type of scale from which melodic content is derived
- Duration: The duration of the track in milliseconds
- Time Signature: An estimated overall time signature of a track - how many beats are in each bar (or measure)
sapply(top.songs.Apr2019, class)
## artist_name track_id track_name acousticness
## "character" "character" "character" "numeric"
## danceability duration_ms energy instrumentalness
## "numeric" "numeric" "numeric" "numeric"
## key liveness loudness mode
## "numeric" "numeric" "numeric" "numeric"
## speechiness tempo time_signature valence
## "numeric" "numeric" "numeric" "numeric"
## popularity
## "numeric"
summary(top.songs.Apr2019)
## artist_name track_id track_name acousticness
## Length:100 Length:100 Length:100 Min. :0.00381
## Class :character Class :character Class :character 1st Qu.:0.05300
## Mode :character Mode :character Mode :character Median :0.17600
## Mean :0.27629
## 3rd Qu.:0.42875
## Max. :0.97900
## danceability duration_ms energy instrumentalness
## Min. :0.3510 Min. :113013 Min. :0.0549 Min. :0.0000000
## 1st Qu.:0.6520 1st Qu.:178600 1st Qu.:0.4830 1st Qu.:0.0000000
## Median :0.7455 Median :198195 Median :0.5965 Median :0.0000000
## Mean :0.7286 Mean :202339 Mean :0.5839 Mean :0.0140151
## 3rd Qu.:0.8343 3rd Qu.:221466 3rd Qu.:0.7300 3rd Qu.:0.0000479
## Max. :0.9500 Max. :360960 Max. :0.9040 Max. :0.3900000
## key liveness loudness mode
## Min. : 0.00 Min. :0.0574 Min. :-23.237 Min. :0.00
## 1st Qu.: 1.00 1st Qu.:0.1017 1st Qu.: -7.957 1st Qu.:0.00
## Median : 5.00 Median :0.1250 Median : -5.891 Median :0.00
## Mean : 4.92 Mean :0.1833 Mean : -6.862 Mean :0.49
## 3rd Qu.: 8.00 3rd Qu.:0.2482 3rd Qu.: -4.654 3rd Qu.:1.00
## Max. :11.00 Max. :0.8000 Max. : -2.652 Max. :1.00
## speechiness tempo time_signature valence
## Min. :0.03080 Min. : 70.14 Min. :3.00 Min. :0.0473
## 1st Qu.:0.05067 1st Qu.: 95.97 1st Qu.:4.00 1st Qu.:0.3190
## Median :0.08370 Median :115.15 Median :4.00 Median :0.4545
## Mean :0.12377 Mean :119.54 Mean :3.99 Mean :0.4607
## 3rd Qu.:0.14750 3rd Qu.:138.47 3rd Qu.:4.00 3rd Qu.:0.6308
## Max. :0.38800 Max. :202.01 Max. :5.00 Max. :0.9520
## popularity
## Min. : 88.00
## 1st Qu.: 89.00
## Median : 91.00
## Mean : 91.66
## 3rd Qu.: 94.00
## Max. :100.00
We want to run a correlation between a song’s popularity and every other numeric variable of the dataset to determine which song characteristics have the strongest relationship with popularity.
In order to do this we can use a loop:
for(i in 4:16){ # columns 4 - 16 are numeric
cor1 <- cor(top.songs.Apr2019$popularity, top.songs.Apr2019[,i]) # find the correlation between popularity and column i
print(colnames(top.songs.Apr2019)[i]) # print the name of column i
print(cor1) # print the value of the correlation
}
## [1] "acousticness"
## [1] -0.0629575
## [1] "danceability"
## [1] 0.1300912
## [1] "duration_ms"
## [1] -0.1253516
## [1] "energy"
## [1] 0.1446426
## [1] "instrumentalness"
## [1] -0.102631
## [1] "key"
## [1] 0.07892893
## [1] "liveness"
## [1] -0.2169122
## [1] "loudness"
## [1] 0.1787721
## [1] "mode"
## [1] 0.05811935
## [1] "speechiness"
## [1] 0.1295045
## [1] "tempo"
## [1] -0.06622938
## [1] "time_signature"
## [1] 0.1098163
## [1] "valence"
## [1] 0.2715716
If we want to use our resulting values for further analysis or just display them in a structured way, we can save our values to a data frame.
The paste() function is used to insert a value from a loop into a cell.
# create an empty data frame to store the outputs of the loop
popular.data <- data.frame(Characteristic = character(),
Correlation = integer(),
stringsAsFactors = F)
for(i in 4:16){
cor1 <- cor(top.songs.Apr2019$popularity, top.songs.Apr2019[,i])
popular.data[i-3,] <- NA # create a new row in the data frame for the information from this iteration of the loop
popular.data$Characteristic[i-3] <- paste(colnames(top.songs.Apr2019)[i]) # paste the name of each variable into our new row in the Characteristic column
popular.data$Correlation[i-3] <- as.numeric(cor1) # paste the value of each correlation into our new row in the Correlation column
}
popular.data
## Characteristic Correlation
## 1 acousticness -0.06295750
## 2 danceability 0.13009117
## 3 duration_ms -0.12535155
## 4 energy 0.14464256
## 5 instrumentalness -0.10263102
## 6 key 0.07892893
## 7 liveness -0.21691220
## 8 loudness 0.17877205
## 9 mode 0.05811935
## 10 speechiness 0.12950445
## 11 tempo -0.06622938
## 12 time_signature 0.10981629
## 13 valence 0.27157160
Now that we have a data frame of correlation values, we can analyze and represent the data in different ways. For example, by creating a bar plot:
par(las=2)
barplot(abs(popular.data$Correlation),
names.arg = popular.data$Characteristic,
cex.names = .6,
main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
xlab = "Song Characteristic",
ylab = "Correlation Value")
par(las=0)
Note: the barplot() function cannot process positive and negative data in the same plot, so we have to use the abs() function to get the absolute value.
In order to specify positive and negative relationships, we could use color:
sign.colors <- ifelse(popular.data$Correlation < 0 , "darkred", "lightblue")
par(las=2)
barplot(abs(popular.data$Correlation), names.arg = popular.data$Characteristic,
cex.names = .6,
col = sign.colors,
main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
xlab = "Song Characteristic",
ylab = "Correlation Value")
legend(0,.25,c("positive","negative"), fill = c("lightblue","darkred"), cex = 1)
par(las=0)
Example of a finalized plot:
ord.popular.data <- popular.data[order(abs(popular.data$Correlation)),]
sign.colors2 <- ifelse(ord.popular.data$Correlation < 0 , "salmon", "lightblue")
barplot(abs(ord.popular.data$Correlation), names.arg = ord.popular.data$Characteristic,
cex.names = .6,
col = sign.colors2,
horiz = T,
yaxt = "n",
xlim = c(0,.3),
main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
ylab = "Song Characteristic",
xlab = "Correlation Value")
legend(.18,5,c("positive","negative"), fill = c("lightblue","salmon"), cex = 1)
x <- rep(0.031, length(ord.popular.data$Characteristic))
y <- c(1:length(ord.popular.data$Characteristic)*1.2) - 0.5
text(x,y,ord.popular.data$Characteristic)
x1 <- rep(abs(ord.popular.data$Correlation) + .014)
y <- c(1:length(ord.popular.data$Characteristic)*1.2) - 0.5
text(x1,y,round(ord.popular.data$Correlation,3))