Creating our Data

Below is the code that creates a random data frame to use in the first few examples. The sample() function selects a random set of numbers from a given range.

# before running any line of code that employs random selection, we need to set the seed to a number - any number. This means that, although we are generating random values, the values will be the same every time the code is run. This allows our code to be replicated.
set.seed(160)

ex.df <- data.frame(Person = c(1:10),
                    Age = c(sample(18:60 ,10, replace = T)), 
                    Binary = c(sample(0:1 ,10, replace = T)),
                    Agree.Scale = c(sample(1:5 ,10, replace = T)))# outputs a random number between 1 and 5 - 10 times 
# sampling with replacement means that a number being selected once does not prevent it from possibly being selected again

datatable(ex.df)

If/Else Loops

Single Condition

if/else loops can only be run on one value at a time. Below, we test our conditions on the first observation of Age in our data frame - which has a value of 49.

if(ex.df$Age[1] > 40){ # our condition is that the age value is greater than 40 - if TRUE, "old" will be output
  print("old")
} else{ #if our age value is less than 40, "young" will be output
  print("young")
}

## [1] "old"

# since our value is 49, this loop outputs "old"

In order to apply simple if/else conditions to an entire vector, we need to use the ifelse() function.

# supply the function with the condition that needs to be evaluated for each element of the vector, the operation when TRUE, and the operation when FALSE
ifelse(ex.df$Age > 40, "old", "young")

##  [1] "old"   "old"   "young" "old"   "young" "young" "young" "young" "old"  
## [10] "young"

Multiple Conditions

We can also evaluate multiple conditions by combining them into one logical statement using logical operators (AND/OR - &/|).

# print "both" for all observations with an age over 40 and an agree value over 2
ifelse(ex.df$Age > 40 & ex.df$Agree.Scale > 2, "both", "not both")

##  [1] "not both" "not both" "not both" "both"     "not both" "not both"
##  [7] "not both" "not both" "both"     "not both"

# print "either" for all observations that meet at least one of our conditions: an age over 40 OR an agree value over 2
ifelse(ex.df$Age > 40 | ex.df$Agree.Scale > 2, "either", "none")

##  [1] "either" "either" "none"   "either" "either" "none"   "none"   "either"
##  [9] "either" "none"

Multiple Operations

In order to evaluate multiple conditional statements that lead to multiple operations, we need to add else if.

If the first if statement evaluates to FALSE, then the next else if statement is evaluated. If that statement is also FALSE, then the else operation will be performed.

if(ex.df$Age[1] %in% 20:35){ #if the first value in Age between 20-35
  print("20-35")
} else if(ex.df$Age[1] %in% 35:50){ #if the first value in Age between 25-50
  print("35-50")
} else{ #if none of above
  print("Other")
}

## [1] "35-50"

In order to apply multiple conditional statements & operations to an entire vector, we need to nest multiple ifelse() statements.

The first condition will be our if statement, then our second condition - the else statement - will be another ifelse() function.

ifelse(ex.df$Age %in% 20:35,"20-35", #if the first value in Age between 20-35 print "20-35" -> if NOT, then evaluate the next ifelse statement
       ifelse(ex.df$Age %in% 35:50, "35-50", "Other"))

##  [1] "35-50" "35-50" "20-35" "35-50" "20-35" "Other" "35-50" "35-50" "Other"
## [10] "20-35"

Here is an example where our operations make use of other parts of our data, rather than just printing text:

In this case, when the age of an observation is between 20-35, the function outputs the observation’s value for Binary. When the age of an observation is between 35-50, the function outputs the observation’s value for Binary multiplied by 2. Otherwise, the function outputs NA.

ifelse(ex.df$Age %in% 20:35,ex.df$Binary,
       ifelse(ex.df$Age %in% 35:50, ex.df$Binary*2, NA))

##  [1]  2  2  1  2  0 NA  0  2 NA  1

We can write this same code more efficiently by using the with() function - which specifies the dataframe before the function, so that only the variable names need to be entered.

with(ex.df, ifelse(Age %in% 20:35, Binary,
       ifelse(Age %in% 35:50, Binary*2, NA)))

##  [1]  2  2  1  2  0 NA  0  2 NA  1

For Loops

For this example, we are going to use example data called top.songs.Apr.2019 that contains information about the top 100 songs on Soptify from April 2019.

datatable(top.songs.Apr2019)

We can take identify that the data includes 17 variables and 100 songs - song is our unit of observation.

dim(top.songs.Apr2019)

## [1] 100  17

Other than the artist, song name, and song id (columns 1:3), all of the remaining variables (4:17) are numeric ratings of the song’s characteristics.

These are defined as:
- Danceability: Describes how suitable a track is for dancing
- Valence: Describes the musical positiveness conveyed by a track
- Energy: Represents a perceptual measure of intensity and activity
- Tempo: The overall estimated tempo of a track in beats per minute (BPM)
- Loudness: The overall loudness of a track in decibels (dB)
- Speechiness: This detects the presence of spoken words in a track
- Instrumentalness: Predicts whether a track contains no vocals
- Liveness: Detects the presence of an audience in the recording
- Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic
- Key: The estimated overall key of the track
- Mode: Indicates the the type of scale from which melodic content is derived
- Duration: The duration of the track in milliseconds
- Time Signature: An estimated overall time signature of a track - how many beats are in each bar (or measure)

sapply(top.songs.Apr2019, class)

##      artist_name         track_id       track_name     acousticness 
##      "character"      "character"      "character"        "numeric" 
##     danceability      duration_ms           energy instrumentalness 
##        "numeric"        "numeric"        "numeric"        "numeric" 
##              key         liveness         loudness             mode 
##        "numeric"        "numeric"        "numeric"        "numeric" 
##      speechiness            tempo   time_signature          valence 
##        "numeric"        "numeric"        "numeric"        "numeric" 
##       popularity 
##        "numeric"

summary(top.songs.Apr2019)

##  artist_name          track_id          track_name         acousticness    
##  Length:100         Length:100         Length:100         Min.   :0.00381  
##  Class :character   Class :character   Class :character   1st Qu.:0.05300  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.17600  
##                                                           Mean   :0.27629  
##                                                           3rd Qu.:0.42875  
##                                                           Max.   :0.97900  
##   danceability     duration_ms         energy       instrumentalness   
##  Min.   :0.3510   Min.   :113013   Min.   :0.0549   Min.   :0.0000000  
##  1st Qu.:0.6520   1st Qu.:178600   1st Qu.:0.4830   1st Qu.:0.0000000  
##  Median :0.7455   Median :198195   Median :0.5965   Median :0.0000000  
##  Mean   :0.7286   Mean   :202339   Mean   :0.5839   Mean   :0.0140151  
##  3rd Qu.:0.8343   3rd Qu.:221466   3rd Qu.:0.7300   3rd Qu.:0.0000479  
##  Max.   :0.9500   Max.   :360960   Max.   :0.9040   Max.   :0.3900000  
##       key           liveness         loudness            mode     
##  Min.   : 0.00   Min.   :0.0574   Min.   :-23.237   Min.   :0.00  
##  1st Qu.: 1.00   1st Qu.:0.1017   1st Qu.: -7.957   1st Qu.:0.00  
##  Median : 5.00   Median :0.1250   Median : -5.891   Median :0.00  
##  Mean   : 4.92   Mean   :0.1833   Mean   : -6.862   Mean   :0.49  
##  3rd Qu.: 8.00   3rd Qu.:0.2482   3rd Qu.: -4.654   3rd Qu.:1.00  
##  Max.   :11.00   Max.   :0.8000   Max.   : -2.652   Max.   :1.00  
##   speechiness          tempo        time_signature    valence      
##  Min.   :0.03080   Min.   : 70.14   Min.   :3.00   Min.   :0.0473  
##  1st Qu.:0.05067   1st Qu.: 95.97   1st Qu.:4.00   1st Qu.:0.3190  
##  Median :0.08370   Median :115.15   Median :4.00   Median :0.4545  
##  Mean   :0.12377   Mean   :119.54   Mean   :3.99   Mean   :0.4607  
##  3rd Qu.:0.14750   3rd Qu.:138.47   3rd Qu.:4.00   3rd Qu.:0.6308  
##  Max.   :0.38800   Max.   :202.01   Max.   :5.00   Max.   :0.9520  
##    popularity    
##  Min.   : 88.00  
##  1st Qu.: 89.00  
##  Median : 91.00  
##  Mean   : 91.66  
##  3rd Qu.: 94.00  
##  Max.   :100.00

Loop Over Variables

We want to run a correlation between a song’s popularity and every other numeric variable of the dataset to determine which song characteristics have the strongest relationship with popularity.

In order to do this we can use a loop:

for(i in 4:16){ # columns 4 - 16 are numeric
  cor1 <- cor(top.songs.Apr2019$popularity, top.songs.Apr2019[,i]) # find the correlation between popularity and column i 
  print(colnames(top.songs.Apr2019)[i]) # print the name of column i
  print(cor1) # print the value of the correlation
}

## [1] "acousticness"
## [1] -0.0629575
## [1] "danceability"
## [1] 0.1300912
## [1] "duration_ms"
## [1] -0.1253516
## [1] "energy"
## [1] 0.1446426
## [1] "instrumentalness"
## [1] -0.102631
## [1] "key"
## [1] 0.07892893
## [1] "liveness"
## [1] -0.2169122
## [1] "loudness"
## [1] 0.1787721
## [1] "mode"
## [1] 0.05811935
## [1] "speechiness"
## [1] 0.1295045
## [1] "tempo"
## [1] -0.06622938
## [1] "time_signature"
## [1] 0.1098163
## [1] "valence"
## [1] 0.2715716

If we want to use our resulting values for further analysis or just display them in a structured way, we can save our values to a data frame.
The paste() function is used to insert a value from a loop into a cell.

# create an empty data frame to store the outputs of the loop
popular.data <- data.frame(Characteristic = character(),
                           Correlation = integer(),
                           stringsAsFactors = F)

for(i in 4:16){
  cor1 <- cor(top.songs.Apr2019$popularity, top.songs.Apr2019[,i])
  
  popular.data[i-3,] <- NA # create a new row in the data frame for the information from this iteration of the loop
  popular.data$Characteristic[i-3] <- paste(colnames(top.songs.Apr2019)[i]) # paste the name of each variable into our new row in the Characteristic column
  popular.data$Correlation[i-3] <- as.numeric(cor1) # paste the value of each correlation into our new row in the Correlation column
}

popular.data

##      Characteristic Correlation
## 1      acousticness -0.06295750
## 2      danceability  0.13009117
## 3       duration_ms -0.12535155
## 4            energy  0.14464256
## 5  instrumentalness -0.10263102
## 6               key  0.07892893
## 7          liveness -0.21691220
## 8          loudness  0.17877205
## 9              mode  0.05811935
## 10      speechiness  0.12950445
## 11            tempo -0.06622938
## 12   time_signature  0.10981629
## 13          valence  0.27157160

Now that we have a data frame of correlation values, we can analyze and represent the data in different ways. For example, by creating a bar plot:

par(las=2)
barplot(abs(popular.data$Correlation), 
        names.arg = popular.data$Characteristic, 
        cex.names = .6,
        main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
        xlab = "Song Characteristic",
        ylab = "Correlation Value")

par(las=0)

Note: the barplot() function cannot process positive and negative data in the same plot, so we have to use the abs() function to get the absolute value.

In order to specify positive and negative relationships, we could use color:

sign.colors <- ifelse(popular.data$Correlation < 0 , "darkred", "lightblue")

par(las=2)
barplot(abs(popular.data$Correlation), names.arg = popular.data$Characteristic, 
        cex.names = .6,
        col = sign.colors,
        main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
        xlab = "Song Characteristic",
        ylab = "Correlation Value")
legend(0,.25,c("positive","negative"), fill = c("lightblue","darkred"), cex = 1)

par(las=0)

Example of a finalized plot:

ord.popular.data <- popular.data[order(abs(popular.data$Correlation)),]

sign.colors2 <- ifelse(ord.popular.data$Correlation < 0 , "salmon", "lightblue")

barplot(abs(ord.popular.data$Correlation), names.arg = ord.popular.data$Characteristic, 
        cex.names = .6,
        col = sign.colors2,
        horiz = T,
        yaxt = "n",
        xlim = c(0,.3),
        main = "Value of Correlation with Song Popularity \n (Top 100 songs on Spotify, April 2019)",
        ylab = "Song Characteristic",
        xlab = "Correlation Value")

legend(.18,5,c("positive","negative"), fill = c("lightblue","salmon"), cex = 1)

x <- rep(0.031, length(ord.popular.data$Characteristic))
y <- c(1:length(ord.popular.data$Characteristic)*1.2) - 0.5

text(x,y,ord.popular.data$Characteristic)

x1 <- rep(abs(ord.popular.data$Correlation) + .014)
y <- c(1:length(ord.popular.data$Characteristic)*1.2) - 0.5
text(x1,y,round(ord.popular.data$Correlation,3))