RPubs Log Week 1

This is my first RPubs log. I am somewhat familiar with R already from other uni courses and from experience as a research assistant. However, I am completely unfamiliar with R markdown. So my goal for Week 1 is to familiarise myself with the basics of R markdown. To achieve this, I will use this first this first markdwown document to explain a practice analysis I did a few months ago and hopefully I will get the hang of some of the markdown tools.

Late last year I took some time to learn basic data analysis with R, at the advice of a post-grad researcher I worked with at the uni. She directed me to Dr. Erin Buchanan’s amazing website ‘Stats of DOOM’ where I found an array of free statistics courses for R, with lecture videos, worked examples and practice questions.

Practice Two-Way Between Subjects ANOVA Example

In this example we are looking at 2 years of spending data for four different sports. We want to know if there are any differences across years and sports in spending.

We have two IVs:

Year: 2007 and 2008
Sport: Basketball, baseball, volleyball, football, soccer

We have one DV:

Money spent: transactions for each individual sport in dollar amounts.

Step 1: Check Accuracy of Data

We must check our data for anything out of the ordinary, such as scores outside the maximum and minimum values, variables that aren’t factored, etc. Our data can be easily checked loading up our data set and using the summary function:

setwd("~/Desktop/Uni/R Practice")
master = read.csv("bn 2 anova.csv")
summary(master)

##       year              type          money        
##  Min.   :2007   Baseball  :3102   Min.   :  22.19  
##  1st Qu.:2007   Basketball:3543   1st Qu.: 131.61  
##  Median :2007   Football  :4419   Median : 232.88  
##  Mean   :2007   Soccer    :3219   Mean   : 282.45  
##  3rd Qu.:2008   Volleyball:3977   3rd Qu.: 389.00  
##  Max.   :2008                     Max.   :1439.37

Here we can see that our “year” variable is being read as a continous variable, rather than a factor variable. To fix this, we factor the “year” variable with appropriate labels for each year. The table function allows us to check the frequencies for each year.

master$year = factor(master$year, levels = c("2007", "2008"))
table(master$year)

## 
## 2007 2008 
## 9586 8674

Step 2: Check for Missing Data

The summary of the data above revealed no missing data. As such, we have no issue here, but in a realistic research environment, one must know how to deal with missing data.

Step 3: Check for Outliers

To check for outliers in the “money” variable, we can create z-scores for all the spending values and see if any fall 3 SDs above or below the mean. We can then create a new dataset called “noout” where we have excluded any outliers. From this point onwards, all analysis will be carried out on this new outlier-free dataset.

zscore = scale(master$money)
summary(abs(zscore) < 3)

##      V1         
##  Mode :logical  
##  FALSE:223      
##  TRUE :18037

noout = subset(master, abs(zscore) < 3)

Step 4: Check Assumptions

Before we conduct our ANOVA we must ensure that our assumptions of normality, linearity and homogeneity are satisfied. To save time and space in this RPubs log, I will not show how to check each assumption. For the purpose of this example, every assumption was satisfied.

Step 5: Run ANOVA

To run the ANOVA on our “noout” dataset we must assign each score a participant number:

noout$partno = 1:nrow(noout)

And then we can run the ANOVA using the “ez” package.

library(ez)

ezANOVA(data = noout, 
        dv = money, 
        wid = partno, 
        between = .(type, year), 
        type = 3)

## $ANOVA
##      Effect DFn   DFd           F            p p<.05         ges
## 2      type   4 18027 501.2736405 0.000000e+00     * 0.100094104
## 3      year   1 18027  35.1955372 3.036590e-09     * 0.001948575
## 4 type:year   4 18027   0.7053989 5.881218e-01       0.000156496
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn   DFd      SSn       SSd        F p p<.05
## 1   9 18027 18598534 200366595 185.9235 0     *

We can see that the ANOVA found a significant result for both the “type” and “year” variables, but not for the interaction. We can now conduct individual t-tests across the type of sport and year to search for significant differences.

Step 6: Pariwise t-tests

First, we can compare the mean spending of each sport. We will also specify to R that we would like to use the Bonferroni adjustment method to control for the familywise Type I error rate.

pairwise.t.test(noout$money, noout$type,
                paired = FALSE,
                var.equal = T,
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  noout$money and noout$type 
## 
##            Baseball Basketball Football Soccer
## Basketball <2e-16   -          -        -     
## Football   <2e-16   <2e-16     -        -     
## Soccer     <2e-16   <2e-16     <2e-16   -     
## Volleyball <2e-16   <2e-16     <2e-16   <2e-16
## 
## P value adjustment method: bonferroni

We have a significant result for each of the t-tests. As such, the spending in each sport is significantly different from the spending in every other sport. We can interpret the results more accurately if we compare the means for each sport. We can also investigate the other main effect of spending across each year, but to save time and space in this log, I won’t show that.

RPubs Log Week 1

Luke Keevers

21/02/2021