This is my first RPubs log. I am somewhat familiar with R already from other uni courses and from experience as a research assistant. However, I am completely unfamiliar with R markdown. So my goal for Week 1 is to familiarise myself with the basics of R markdown. To achieve this, I will use this first this first markdwown document to explain a practice analysis I did a few months ago and hopefully I will get the hang of some of the markdown tools.
Late last year I took some time to learn basic data analysis with R, at the advice of a post-grad researcher I worked with at the uni. She directed me to Dr. Erin Buchanan’s amazing website ‘Stats of DOOM’ where I found an array of free statistics courses for R, with lecture videos, worked examples and practice questions.
You can find her website here –> https://statisticsofdoom.com/
In this example we are looking at 2 years of spending data for four different sports. We want to know if there are any differences across years and sports in spending.
We have two IVs:
Year: 2007 and 2008
Sport: Basketball, baseball, volleyball, football, soccer
We have one DV:
We must check our data for anything out of the ordinary, such as scores outside the maximum and minimum values, variables that aren’t factored, etc. Our data can be easily checked loading up our data set and using the summary function:
setwd("~/Desktop/Uni/R Practice")
master = read.csv("bn 2 anova.csv")
summary(master)
## year type money
## Min. :2007 Baseball :3102 Min. : 22.19
## 1st Qu.:2007 Basketball:3543 1st Qu.: 131.61
## Median :2007 Football :4419 Median : 232.88
## Mean :2007 Soccer :3219 Mean : 282.45
## 3rd Qu.:2008 Volleyball:3977 3rd Qu.: 389.00
## Max. :2008 Max. :1439.37
Here we can see that our “year” variable is being read as a continous variable, rather than a factor variable. To fix this, we factor the “year” variable with appropriate labels for each year. The table function allows us to check the frequencies for each year.
master$year = factor(master$year, levels = c("2007", "2008"))
table(master$year)
##
## 2007 2008
## 9586 8674
The summary of the data above revealed no missing data. As such, we have no issue here, but in a realistic research environment, one must know how to deal with missing data.
To check for outliers in the “money” variable, we can create z-scores for all the spending values and see if any fall 3 SDs above or below the mean. We can then create a new dataset called “noout” where we have excluded any outliers. From this point onwards, all analysis will be carried out on this new outlier-free dataset.
zscore = scale(master$money)
summary(abs(zscore) < 3)
## V1
## Mode :logical
## FALSE:223
## TRUE :18037
noout = subset(master, abs(zscore) < 3)
Before we conduct our ANOVA we must ensure that our assumptions of normality, linearity and homogeneity are satisfied. To save time and space in this RPubs log, I will not show how to check each assumption. For the purpose of this example, every assumption was satisfied.
To run the ANOVA on our “noout” dataset we must assign each score a participant number:
noout$partno = 1:nrow(noout)
And then we can run the ANOVA using the “ez” package.
library(ez)
ezANOVA(data = noout,
dv = money,
wid = partno,
between = .(type, year),
type = 3)
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 2 type 4 18027 501.2736405 0.000000e+00 * 0.100094104
## 3 year 1 18027 35.1955372 3.036590e-09 * 0.001948575
## 4 type:year 4 18027 0.7053989 5.881218e-01 0.000156496
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 9 18027 18598534 200366595 185.9235 0 *
We can see that the ANOVA found a significant result for both the “type” and “year” variables, but not for the interaction. We can now conduct individual t-tests across the type of sport and year to search for significant differences.
First, we can compare the mean spending of each sport. We will also specify to R that we would like to use the Bonferroni adjustment method to control for the familywise Type I error rate.
pairwise.t.test(noout$money, noout$type,
paired = FALSE,
var.equal = T,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: noout$money and noout$type
##
## Baseball Basketball Football Soccer
## Basketball <2e-16 - - -
## Football <2e-16 <2e-16 - -
## Soccer <2e-16 <2e-16 <2e-16 -
## Volleyball <2e-16 <2e-16 <2e-16 <2e-16
##
## P value adjustment method: bonferroni
We have a significant result for each of the t-tests. As such, the spending in each sport is significantly different from the spending in every other sport. We can interpret the results more accurately if we compare the means for each sport. We can also investigate the other main effect of spending across each year, but to save time and space in this log, I won’t show that.
This week I successfully managed to transfer some previous R code into a nice-looking markdown document. I am now familiar with the basics of markdown, including how to format my doc, make it look nice and embed R code.
One challenge I encountered was how to exlcude all the warning messaages and irrelevant information of the code output from the markdown document. It took me a while, but I found that inserting “messages=FALSE” and “warnings=FALSE” in the {r} section at the beginning of the code chunk helps trim the fat, so that what you see in the markdown document is just the source code and relevant output.
The obvious next step for me is to learn how to display data visually, using the ggplot package and Associate Professor Navarro’s helpful guides.