This lab will introduce finding correlations and covariance in data, as well as simple regression.

First, to reduce issues involved in reading the data, we are going to put all relevant files into a folder and then set it as R’s working director. This can be done using the “Files” tab of the graphic interface, or using the setwd() function. Please note that all file paths that appear in the example code are the locations of files on my computer. You will need to change the path to reflect the location of files on your computer. By setting the working directory, you should only need to define the path once.

#setwd("C:/Users/ropadenga/Downloads/RMS-1 R Labs/Lab2/RFiles")

Now that you have defined a working directory, you can read any file inside that folder without specifying the whole path.

However, depending on your versions of RStudio, your operating system, or other arcane factors, you may only be able to set a working directory for the active chunk, and not the rest of the markdown file, and you’ll still get errors when trying to read in your data files. The simplest workaround for this is, when working on your lab in markdown form, use the RStudio interface to import your files, and then copy the path from the code generated in the console into the import line of your markdown file. Just remember to keep your variable names the same.

Example Problem: Exam Anxiety

A researcher is interested in examining the effects of exam stress and revision time on exam performance. The data has five variables, Code (a unique identifier for each subject in the study), Revise (the number of hours spent reviewing), Exam (performance on the exam, as a percentage score), Anxiety (a standardized measurement of anxiety), and Gender (a factor identifying each subject as male or female). This analysis will focus on exam performance as a depdendant variable and how it is affected by the two independent variables of interest, revision time and anxiety.

#read in exam anxiety data
examAnxiety <- read.csv("ExamAnxiety.csv")

The base dataset has more columns than we really need. We can either work with the full set and ignore the columns we don’t want, or we can make a smaller dataset containing just the variables of interest.

#create a new dataframe containing variables of interest
newAnxiety <- data.frame(Revise = examAnxiety$Revise,
                         Anxiety = examAnxiety$Anxiety,
                         Exam = examAnxiety$Exam)

Before we look at the data, we want to state our hypotheses about the data. To help formally define them, we can first diagram our ideas.

[Anxiety] <————–> [Revision Time]

^                            ^

|                            |

----> [Exam Performance] <----

H0 - The null hypothesis is that there will be no significant correlation between anxiety, revision time, and exam performance

H1 - The alternate hypothesis is that there is a relationship between anxiety, revision time, and exam performance.

In simple terms, we are hypothesizing that the amount of time spent revising and anxiety levels will have an effect on overall exam performance.

If you were doing this study for real, you would probably want to define your hypotheses even further, identifying each of the three relationships and speculating about their direction, but that is beyond the scope of this lab.

To start our exploration of the data, we want to get some basic information about our dataset such as mean and sd.

## Exam Performance
exam_avg = mean(newAnxiety$Exam)
exam_sd = sd(newAnxiety$Exam)
print(exam_sd)
## [1] 25.94058
## Anxiety Level
anxiety_avg = mean(newAnxiety$Anxiety)
anxiety_sd = sd(newAnxiety$Anxiety)
print(anxiety_sd)
## [1] 17.18186
## Revision Time
rev_avg = mean(newAnxiety$Revise)
rev_sd = sd(newAnxiety$Revise)
print(rev_sd)
## [1] 18.1591

We can use correlations to get general ideas about the relationships between variables. We are primarily interested in the relationship between exam performance and anxiety levels.

#print correlation matrix for the Anxiety Subset
cor(newAnxiety, method = "pearson")  #pearson correlation coefficient (SPEARMINT OR CAO)
##             Revise    Anxiety       Exam
## Revise   1.0000000 -0.7092493  0.3967207
## Anxiety -0.7092493  1.0000000 -0.4409934
## Exam     0.3967207 -0.4409934  1.0000000
#perform a correlation test of the relationship between Exam and Anxiety
cor.test(newAnxiety$Exam,
         newAnxiety$Anxiety, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  newAnxiety$Exam and newAnxiety$Anxiety
## t = -4.938, df = 101, p-value = 3.128e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5846244 -0.2705591
## sample estimates:
##        cor 
## -0.4409934
cor.test(newAnxiety$Exam,
         newAnxiety$Revise, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  newAnxiety$Exam and newAnxiety$Revise
## t = 4.3434, df = 101, p-value = 3.343e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2200938 0.5481602
## sample estimates:
##       cor 
## 0.3967207

From this we can see that exam performance correlates with anxiety with an r = -0.441, giving an r^2 = 0.194, meaning that anxiety explains 19.4% of the variance in exam performance.

The whole correlation matrix does not provide significance levels, but we can test individual relationships. The cor.test() function tells us that the correlation between exam performance and anxiety has a p-value of 0.00000312, p < 0.05, suggesting the relationship is significant.

However, anxiety also correlates with revision time with an r = -0.709, suggesting that some of the variance in performance explained by anxiety could also be explained by revision time. To determine how much of that variance comes from anxiety alone, we need to do a partial correlation, which will estimate the variance while holding revision time constant.

To do this, we will use the package “ppcor”.

library(ppcor)
## Loading required package: MASS
#use the function pcor.test to estimate the correlation while holding revision time constant
pcor.test(newAnxiety$Anxiety, 
          newAnxiety$Exam,
          newAnxiety$Revise)
##     estimate    p.value statistic   n gp  Method
## 1 -0.2466658 0.01244581 -2.545307 103  1 pearson

Here, holding revision time constant, exam performance correlates with anxiety with an r = -0.247, r^2 = 0.06. This means that anxiety is really only explaining 6% of the variance. The result is still significant, p = 0.0012, but the effect size is smaller.

Example Problem - Non-Parametric Correlation

For most things, the basic Pearson correlation will work fine. But sometimes you will be looking at non-continuous data, or data that otherwise violates assumptions, and you’ll need to use an alternative test.

This data is taken from The Biggest Liar competition. It has three variables, Creativity (as measured by some standardized test), Rank (the placement earned in the tournament), and Novice (a factor determining if this was the subject’s first time in the tournament). We are interested in exploring the relationship between Rank, the dependent variable, and Creativity, the independent variable.

#read in the BiggestLiar dataset
bigliar = read.csv("BiggestLiar.csv")

#create a datafram containing only the variables of interest i.e. Position and Creativity
newliar = data.frame(Creativity = bigliar$Creativity,
                     Position = bigliar$Position)

#compute mean and sd
creat_mean = mean(newliar$Creativity)
creat_sd = sd(newliar$Creativity)

#no useful information 
#pos_mean = mean(newliar$Position)
#pos_sd = sd(newliar$Position)

This data comes from the top finishers at The Biggest Liar Competition. Of these, 26 were first place finishers, 15 placed second, 11 placed third, and the remaining subjects placed between fourth and sixth. The mean creativity score was 39.99 (sd = 8.12).

Because one of our variables (Rank) is a factor, we can’t do a traditional correlation. Instead, we have to use Spearman’s coefficient or Kendall’s tau.

#use pcore to print out correlation matrix using Spearman's coefficient or Kendall's tau 
cor(newliar$Creativity, newliar$Position, method = "kendall") #lower rank, meaning better performance
## [1] -0.3002413
cor.test(newliar$Creativity, newliar$Position)
## 
##  Pearson's product-moment correlation
## 
## data:  newliar$Creativity and newliar$Position
## t = -2.6115, df = 66, p-value = 0.01115
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.50743150 -0.07292754
## sample estimates:
##        cor 
## -0.3060314
s = 1 / (sqrt(68-3))

print(-0.3 ^ -2.6)
## [1] -22.88151

Using Kendall’s correlation factor, the correlation between creativity and placement in a tournament is r = -0.300, r^s = 0.09, p < 0.05.

Using Spearman’s correlation factor, the correlation between creativity and placement in a tournament is r = -0.373, r^s = 0.139, p < 0.05. Both measures indicate a small but significant effect.

Walkthrough Problem - Work Accidents

A researcher wants to know if the number of hours worked is related to the number of mistakes made by a worker.

  1. Read the data into R and describe the dataset (2 points)
#read in workaccidents data 



#compute summary stats

The data examines the relationship between the numer of hours worked and amount of mistakes made. The data has two columns hours (which specifies the number of hours worked in a day), and mistakes (The number of mistakes in the hours worked). The mean number of mistakes made was 7.37 with a standard deviation of 3.56. While, the mean of the hours variable is 8.57 with a sd of 3.07

  1. Diagram your hypotheses and state them formally (2 points)

[hours] <————-> [mistakes]

H0

H1

  1. Test your hypothesis (3 points)
#print out the correlation matrix

#perform a correlation test on the variables of interest
  1. Report your results (3 points)

Lab Problem 1 - Ice Cream Flavors

The owner of an ice cream store is interested to know if customers who buy chocolate ice cream also tend to buy mocha ice cream.

  1. Read the data into R and describe the dataset (2 points)
#read in icecream data
iceCream_file <- read.csv("icecream.csv")
iceCream <- data.frame(Choco = iceCream_file$chocolate,
                         Mocha = iceCream_file$mocha)
#compute summary stats
choco_mean = mean(iceCream$Choco)
choco_sd = sd(iceCream$Choco)

mocha_mean = mean(iceCream$Mocha)
mocha_sd = sd(iceCream$Mocha)

print("Chocolate mean:")
## [1] "Chocolate mean:"
print(choco_mean)
## [1] 7.333333
print("Chocolate standard deviation:")
## [1] "Chocolate standard deviation:"
print(choco_sd)
## [1] 3.220867
print("Mocha mean:")
## [1] "Mocha mean:"
print(mocha_mean)
## [1] 4.904762
print("Mocha standard deviation:")
## [1] "Mocha standard deviation:"
print(mocha_sd)
## [1] 3.098462
  1. Diagram your hypotheses and state them formally (2 points)

[Chocolate bought] <————–> [Mocha bought]

    ^                                       ^

    |                                       |

  --> [Mocha Bought after buying Chocolate] <--

H0 - The null hypothesis is that there will be no significant correlation between people who buy chocolate ice cream also buying mocha ice cream

H1 - The alternate hypothesis is that there is a relationship between buying chocolate ice cream causing one to also buy mocha ice cream.

  1. Test your hypothesis (3 points)
cor(iceCream, method= "kendall")
##             Choco       Mocha
## Choco  1.00000000 -0.03323558
## Mocha -0.03323558  1.00000000
cor(iceCream$Choco, iceCream$Mocha, method = "kendall") 
## [1] -0.03323558
cor.test(iceCream$Choco, iceCream$Mocha)
## 
##  Pearson's product-moment correlation
## 
## data:  iceCream$Choco and iceCream$Mocha
## t = 1.5997, df = 40, p-value = 0.1175
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06344753  0.51105594
## sample estimates:
##       cor 
## 0.2452124
cor(iceCream$Choco, iceCream$Mocha, method = "spearman") 
## [1] -0.05333265
cor.test(iceCream$Choco, iceCream$Mocha)
## 
##  Pearson's product-moment correlation
## 
## data:  iceCream$Choco and iceCream$Mocha
## t = 1.5997, df = 40, p-value = 0.1175
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06344753  0.51105594
## sample estimates:
##       cor 
## 0.2452124
  1. Report your results (3 points)

Using Kendall’s correlation factor, the correlation between buying chocolate ice cream and buying mocha ice cream is r = -0.033, r^2 = 0.001, p < 0.05.

Using Spearman’s correlation factor, the correlation between buying chocolate ice cream and buying mocha ice cream is r = -0.053, r^2 = 0.003, p < 0.05. Both measures indicate a weak effect.

The null hypothesis is NOT rejected.

Lab Problem 2 - Ice Cream and Heart Attacks

A doctor wants to find out if the number of spoonfuls of ice cream eaten by patients is related to the number of heart attacks they have.

R Hint - this dataset contains missing data. For some functions, R can deal with this for you. Other times, you have to tell R that the data is incomplete. If you aren’t getting the results you expect from your summary statistics, try running the function na.omit() on your dataset.

  1. Read the data into R and describe the dataset (2 points)
#read in icecreamattacks data
attacks_file <- read.csv("icecreamattacks.csv")

attacks <- data.frame(spoons = attacks_file$ice_cream, h_attacks = attacks_file$heart_attacks)

attacks = na.omit(attacks)

spoon_mean = mean(attacks$spoons) 
spoon_sd = sd(attacks$spoons)
  
attacks_mean = mean(attacks$h_attacks)
attacks_sd = sd(attacks$h_attacks)

print("Ice Cream spoons mean:")
## [1] "Ice Cream spoons mean:"
print(spoon_mean)
## [1] 7.8
print("Ice Cream spoons deviation:")
## [1] "Ice Cream spoons deviation:"
print(spoon_sd)
## [1] 3.903976
print("Heart attacks mean:")
## [1] "Heart attacks mean:"
print(attacks_mean)
## [1] 8.35
print("Heart attacks deviation:")
## [1] "Heart attacks deviation:"
print(attacks_sd)
## [1] 2.983287
  1. Diagram your hypotheses and state them formally (2 points) [Ice cream scoops #] <————–> [Heart attacks]

         ^                            ^
    
         |                            |
    
         ----> [More scoops of ice cream, more heart attacks] <----

H0 - The null hypothesis is that there will be no significant correlation between the number of ice cream scoops one eats and the amount of heart attacks they have.

H1 - The alternate hypothesis is that there is a relationship between the number of ice cream scoops one eats and the amount of heart attacks they have.

  1. Test your hypothesis (3 points)
cor(attacks, method = "pearson")  #pearson correlation coefficient (SPEARMINT OR CAO)
##              spoons h_attacks
## spoons    1.0000000 0.5961862
## h_attacks 0.5961862 1.0000000
#perform a correlation test of the relationship between Exam and Anxiety
cor(attacks$spoons,
         attacks$h_attacks, method = "kendall")
## [1] 0.4294711
cor.test(attacks$spoons,
         attacks$h_attacks)
## 
##  Pearson's product-moment correlation
## 
## data:  attacks$spoons and attacks$h_attacks
## t = 4.5776, df = 38, p-value = 4.919e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3496046 0.7655243
## sample estimates:
##       cor 
## 0.5961862
cor(attacks$spoons,
         attacks$h_attacks, method = "spearman")
## [1] 0.5135512
cor.test(attacks$spoons,
         attacks$h_attacks)
## 
##  Pearson's product-moment correlation
## 
## data:  attacks$spoons and attacks$h_attacks
## t = 4.5776, df = 38, p-value = 4.919e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3496046 0.7655243
## sample estimates:
##       cor 
## 0.5961862
  1. Report your results (3 points) Using Kendall’s correlation factor, the correlation between buying chocolate ice cream and buying mocha ice cream is r = 0.43, r^2 = 0.185, p > 0.05.

Using Spearman’s correlation factor, the correlation between buying chocolate ice cream and buying mocha ice cream is r = 0.51, r^2 = 0.260, p > 0.05. Both measures indicate a moderately strong (medium) effect.

The null hypothesis is rejected.