Handout for week 1 practical

ANOVA script

You’ll see the first thing I did was make sure there was nothing in R’s memory (rm command). Then I put in all the libraries I would need. That’s just a stylistic preference.

# clear the decks
rm(list = ls())

# Libraries
library("ggplot2")
library("ggfortify") # required for autoplot
library("skimr")
library("dplyr") # needed for group_by
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dioecious <- read.csv("~/Dropbox/Teaching/second_year_stats/r_sessions/1.ANOVAS_regressions/dioecious.csv")
dioecious$SEX<-as.factor(dioecious$SEX) #Needs to be a factor for later analysis.


skim(dioecious)
Data summary
Name dioecious
Number of rows 50
Number of columns 3
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SEX 0 1 FALSE 2 2: 30, 1: 20

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DBH 0 1 191.6 71.21 68 140.50 188 242 316 ▅▇▇▅▇
FLOWERS 0 1 383.5 383.04 1 115.75 253 533 1482 ▇▅▁▂▁
ggplot(dioecious, aes(x = SEX, y = FLOWERS)) +
  geom_boxplot() +
  theme_bw()

You can answer the MCQ by just looking at the boxplot plots, but I wanted to get exact figures (to put into MCQs). So I used the following code using pipes. Might come in handy.

dioecious %>%
  group_by(SEX) %>%
  summarize(median_flower= median(FLOWERS, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   SEX   median_flower
##   <fct>         <dbl>
## 1 1              292 
## 2 2              230.

Carrying out the ANOVA

model_flowers<-lm(FLOWERS~SEX, data = dioecious)
sex_anova<-aov(model_flowers)
summary(sex_anova)
##             Df  Sum Sq Mean Sq F value Pr(>F)
## SEX          1  171841  171841   1.175  0.284
## Residuals   48 7017255  146193

So we can say sex has no significant effect on the number of flowers (One way ANOVA F1,48 = 1.175, p = 0.284)

autoplot(model_flowers, colour = 'SEX')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Regression script

# clear the decks
rm(list = ls())

# Libraries
library("ggplot2")
library("openintro") #Where I'm getting the data from
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library("ggpubr")
library("ggfortify") # required for autoplot

summary(gpa_iq)
##       obs             gpa               iq            gender     
##  Min.   : 1.00   Min.   : 0.530   Min.   : 72.0   Min.   :1.000  
##  1st Qu.:20.25   1st Qu.: 6.278   1st Qu.:103.0   1st Qu.:1.000  
##  Median :42.00   Median : 7.829   Median :110.0   Median :2.000  
##  Mean   :42.97   Mean   : 7.447   Mean   :108.9   Mean   :1.603  
##  3rd Qu.:62.75   3rd Qu.: 8.983   3rd Qu.:117.5   3rd Qu.:2.000  
##  Max.   :89.00   Max.   :10.760   Max.   :136.0   Max.   :2.000  
##     concept     
##  Min.   :20.00  
##  1st Qu.:51.00  
##  Median :59.50  
##  Mean   :56.96  
##  3rd Qu.:66.00  
##  Max.   :80.00
ggplot(gpa_iq, aes(x = iq, y = gpa)) +
  geom_point() +
  theme_bw()

It looks like the higher the student’s IQ the better they do on their GPA

model_iq <- lm(gpa_iq$gpa~gpa_iq$iq)
summary(model_iq)
## 
## Call:
## lm(formula = gpa_iq$gpa ~ gpa_iq$iq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3182 -0.5377  0.2178  1.0268  3.5785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.55706    1.55176  -2.292   0.0247 *  
## gpa_iq$iq    0.10102    0.01414   7.142 4.74e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.635 on 76 degrees of freedom
## Multiple R-squared:  0.4016, Adjusted R-squared:  0.3937 
## F-statistic: 51.01 on 1 and 76 DF,  p-value: 4.737e-10

The equation of the line would be y = -3.56 + 0.1x IQ scores affect GPA scores (R squared = 0.39, F1,76, = 51.01, p = 4.7 x 10^-10)

autoplot(model_iq)

ggscatter(gpa_iq, x = "iq", y = "gpa", add = "reg.line") +
  stat_cor(label.y = 10, 
           aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
  stat_regline_equation(label.y = 12)
## `geom_smooth()` using formula 'y ~ x'