Spooky Stats Workshop

Data Link
Time Series: Atmospheric CO$_2$ Concentrations
Linear Modelling: CO$_2$ Uptake in Chilled Environments
More Modelling: mtcars and Logistic Regression
ggplot2 Examples
RMarkdown
Further Topics

Data Link

This is the data we used in the Time Series section to compare to the data that’s built in to R. You can download it by right clicking the following link and clicking “Open in new tab”.

https://drive.google.com/file/d/0B0xWkUJMicalUUJBejNpSjJneFE/view?usp=sharing

Time Series: Atmospheric CO$_2$ Concentrations

In this section, we looked at some basic time series models and plots in R. Below is a summary of what we covered, followed by the script file.

The code below comes from a script file. This is a basically a text ile, but the extension (.R) tells your operating system to open it in RStudio instead of Notepad. It is good practice to add a description at the start of a file and comments to your code along the way to remind yourself what your code was trying to accomplish. Always treat script files as self-contained; they should contain the data imports, cleaning, and analysis. You should be able to open the script file and run it line by line to recreate your previous analysis.
Packages need to be installed and then loaded. If you get an error that says “object not found”, make sure you’ve loaded the package (with library()) and have spelled everything correctly.
R uses different types of objects. Functions may act differently for different types of objects. Time series objects contain information about the periodic/seasonal nature of the time series as well as the actual time points. Because of this, the basic plot command will know how to plot the data - you don’t need to use y and x. The other arguments, such as col (colour), lty (line type), pch (plotting character), main, xlab, xlim, etc. will still work.
It is possible to take advantage of the structure of different types of objects. The results of the forecast() function include something called the mean. If we want to plot multiple time series forecasts, we can extract the mean label out of the forecast object with a dollar sign (forecast.obj$mean).
There are many ways to import data. The easiest is to use the Import Dataset button in the Environment tab of RStudio. This automatically generates R code, including the path to the file. I recommend that you copy and paste the code into your script file (as I have done below) so that you don’t have to go through the import process every time you open up you files.
There is a common expression in IT and computer science: RTFS. It stands for READ THE SCREEN. The F in RTFS is there because people tend to ignore error messages and the people helping them get very frustrated. Post these letters on your desk when programming with R. Here are some of the interpretations for R’s most common errors:
- “Object not found”: You either didn’t spell it correctly, didn’t load the package, or wrote the code in the script file but never hit run
- “Unexpected symbol”: R will use quotation marks to show where it found the error. The end of the quotation marks will be where the unexpected symbol is. You are probably missing a comma or a bracket.
- + instead of >: Hit escape. You are missing a bracket or a quotation mark somewhere.
- “Could not open connection” or “file not found”: You might not have internet, or you typed the path to a file wrong. Double check spelling and networks.

# Time series data for co2 conc.
# 2017-10-29
# Forecasted using co2 data that only goes up to 1997.
# Since the NOAA still collects data, we were able to 
# compare there forecasts to the actual data.
# It seems that the CO2 concentration is increasing 
# faster than our simple models predicted.

#install.packages("forecast") # Make this a comment after installing
# so that you don't try to install it every time
library(forecast)

## Warning: package 'forecast' was built under R version 3.3.3

data(co2)
#?co2 # comment out help files so they don't run every time

plot(co2)
class(co2)

## [1] "ts"

# I commented these out because I don't want the data to show up
# in the output
#co2
#time(co2)
#frequency(co2)

plot(co2)
abline(lm(co2 ~ time(co2)))
lines(aggregate(co2, FUN=mean), col = 'red')

aggregate(co2, FUN=mean)

## Time Series:
## Start = 1959 
## End = 1997 
## Frequency = 1 
##  [1] 315.8258 316.7475 317.4850 318.2975 318.8325 319.4625 319.8725
##  [8] 321.2100 322.0200 322.8900 324.4592 325.5175 326.1550 327.2933
## [15] 329.5117 330.0792 330.9858 331.9858 333.7300 335.3358 336.6808
## [22] 338.5150 339.7608 340.9592 342.6083 344.2467 345.7258 346.9750
## [29] 348.7508 351.3133 352.7542 354.0367 355.4783 356.2917 356.9958
## [36] 358.8800 360.9142 362.6867 363.8175

# Hey look what happens here!
monthplot(co2)

seasonplot(co2, col = heat.colors(39))

# Autocorrelation function
acf(co2)

pacf(co2)

co2.ar <- auto.arima(co2)
summary(co2.ar)

## Series: co2 
## ARIMA(1,1,1)(1,1,2)[12]                    
## 
## Coefficients:
##          ar1      ma1     sar1     sma1     sma2
##       0.2569  -0.5847  -0.5489  -0.2620  -0.5123
## s.e.  0.1406   0.1204   0.5879   0.5701   0.4819
## 
## sigma^2 estimated as 0.08576:  log likelihood=-84.39
## AIC=180.78   AICc=180.97   BIC=205.5
## 
## Training set error measures:
##                      ME     RMSE       MAE        MPE       MAPE      MASE
## Training set 0.01742092 0.287159 0.2303994 0.00507377 0.06845665 0.1819636
##                     ACF1
## Training set -0.00285809

co2.fore <- forecast(co2.ar, h = 4*12)
co2.fore

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1998       365.1519 364.7766 365.5272 364.5779 365.7259
## Feb 1998       365.9244 365.4722 366.3766 365.2329 366.6160
## Mar 1998       366.7579 366.2548 367.2611 365.9884 367.5275
## Apr 1998       368.1529 367.6066 368.6991 367.3175 368.9882
## May 1998       368.7106 368.1253 369.2960 367.8154 369.6059
## Jun 1998       367.9974 367.3756 368.6193 367.0464 368.9485
## Jul 1998       366.5046 365.8483 367.1609 365.5009 367.5083
## Aug 1998       364.4546 363.7656 365.1436 363.4009 365.5083
## Sep 1998       362.5883 361.8681 363.3085 361.4868 363.6898
## Oct 1998       362.7245 361.9744 363.4746 361.5773 363.8717
## Nov 1998       364.1808 363.4019 364.9597 362.9896 365.3720
## Dec 1998       365.6001 364.7934 366.4067 364.3664 366.8337
## Jan 1999       366.6541 365.8001 367.5082 365.3479 367.9603
## Feb 1999       367.4890 366.5970 368.3811 366.1248 368.8533
## Mar 1999       368.3600 367.4332 369.2868 366.9426 369.7774
## Apr 1999       369.6847 368.7249 370.6446 368.2167 371.1528
## May 1999       370.2598 369.2680 371.2516 368.7430 371.7766
## Jun 1999       369.5900 368.5673 370.6127 368.0260 371.1540
## Jul 1999       368.0811 367.0284 369.1337 366.4712 369.6909
## Aug 1999       366.0296 364.9479 367.1114 364.3752 367.6841
## Sep 1999       364.1807 363.0705 365.2908 362.4829 365.8785
## Oct 1999       364.2948 363.1570 365.4326 362.5547 366.0349
## Nov 1999       365.7120 364.5472 366.8769 363.9306 367.4935
## Dec 1999       367.1133 365.9221 368.3045 365.2915 368.9351
## Jan 2000       368.1622 366.9350 369.3895 366.2854 370.0391
## Feb 2000       368.9957 367.7369 370.2545 367.0706 370.9209
## Mar 2000       369.8545 368.5658 371.1433 367.8836 371.8255
## Apr 2000       371.2200 369.9022 372.5378 369.2046 373.2354
## May 2000       371.7861 370.4399 373.1323 369.7272 373.8450
## Jun 2000       371.0926 369.7186 372.4666 368.9913 373.1940
## Jul 2000       369.5926 368.1913 370.9938 367.4495 371.7356
## Aug 2000       367.5420 366.1140 368.9699 365.3581 369.7258
## Sep 2000       365.6835 364.2293 367.1377 363.4595 367.9075
## Oct 2000       365.8097 364.3298 367.2897 363.5463 368.0731
## Nov 2000       367.2484 365.7431 368.7537 364.9463 369.5505
## Dec 2000       368.6595 367.1294 370.1897 366.3193 370.9998
## Jan 2001       369.7113 368.1450 371.2776 367.3158 372.1068
## Feb 2001       370.5456 368.9479 372.1432 368.1021 372.9890
## Mar 2001       371.4110 369.7835 373.0385 368.9220 373.9000
## Apr 2001       372.7541 371.0977 374.4106 370.2208 375.2875
## May 2001       373.3252 371.6402 375.0101 370.7483 375.9021
## Jun 2001       372.6447 370.9318 374.3576 370.0250 375.2644
## Jul 2001       371.1398 369.3993 372.8802 368.4780 373.8015
## Aug 2001       369.0887 367.3212 370.8562 366.3855 371.7919
## Sep 2001       367.2354 365.4412 369.0296 364.4915 369.9794
## Oct 2001       367.3550 365.5346 369.1755 364.5709 370.1392
## Nov 2001       368.7819 366.9356 370.6283 365.9581 371.6057
## Dec 2001       370.1877 368.3157 372.0596 367.3248 373.0506

plot(co2.fore, main = "Forecasted co2",
     xlim = c(1995, 2002), ylim = c(355, 375))

# Forecasting with loess smoothers
#?stl
co2.stl <- stl(co2, s.window = "period")
plot(co2.stl)

plot(forecast(co2.stl, h = 4*12),
     xlim= c(1995, 2002), ylim = c(355, 375))
lines(co2.fore$mean, col = 'purple', lwd = 2)
legend('topleft', legend = c("ARIMA", "STL"),
       col = c("purple", "blue"), lwd=2)

# Load in the new data - contains co2 up to 2017
newco2 <- read.csv("C:/Users/d7bec/Desktop/newco2.csv")
plot(newco2$Average, type = 'l')

# Some of the average co2 conc's are below 0 - that's
# not how concentrations work.
# Missing data was recorded as -99.99.
ave <- newco2$Average
which(ave < 0)

## [1]   4   8  72  73  74 214 314

ave[which(ave < 0)] <- NA

plot(ave, type = 'l')

# Create a time series object to take advantage of 
# time series functions
newts <- ts(ave, start = c(1958, 3),
            end = c(2017, 9), frequency = 12)

plot(newts)

plot(newts, xlim = c(1995, 2017), ylim = c(355,415))
forc.arima <- forecast(co2.ar, h = 20*12)
forc.stl <- forecast(co2.stl, h = 20*12)

lines(forc.arima$mean, col = "purple")
lines(forc.stl$mean, col = "blue")

plot(forc.arima, xlim = c(1995, 2017), ylim = c(355, 415))
lines(newts)

Linear Modelling: CO$_2$ Uptake in Chilled Environments

Formula notation, y ~ x1 + x2, is used by almost all models in R. The left side is the response/dependent variable, and the explanatory/independent variables are on the right. There are a lot of interesting things to do with formulae, the table below shows some of them. Note that multiplication is done with :. Also, if you want to transform your data, you should do it before you put it in a formula - math doesn’t quite work the same here. If you do want to take, say, the log of your data, you need to use I(log(x)) - the I() tells R to interpret it as math instead of a formula.

Model	Notation
$y = \beta_0+\beta_1x_1+\beta_2x_2$	`y ~ x1 + x2`
$y = \beta_1x_1 +\beta_2x_2$	`y ~ x1 + x2 - 1`
$y = \beta_0 + \beta_1x_1x_2$	`y ~ x1 : x2`
$y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2$	`y ~ x1 + x2 + x1:x2`
$y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2$	`y ~ x1 * x2`
$y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2$	`y ~ (x1 + x2)^2`
$y = \beta_0 + \beta_1x_1^2$	`y ~ I(x1^2)`
$y = \beta_0 + \beta_1x_1 + \beta_2x_1^2$	`y ~ poly(x1, 2)`
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...$	`y ~ .`

There are a lot of important plots to make when exploring the data, but very few of them are actually needed when doing a final report. It’s good to keep these in a script file then have a second file with only the results you need.
I don’t like using the attach function. attach() will clone your data so that you can access columns outside of a data.frame. Here are my arguments against doing this:
- Changing a cloned version doesn’t change the original and vice versa, so you may end up thinking you changed something that you haven’t. This is especially problematic if you need to fix bad observations, which happens a lot.
- Assigning a new value to a column is done with mydataframe$columnname <- newthings, which is a little bit cumbersome. This forces you to acknowledge that you are changing your data so that you don’t accidentally overwrite a variable.
- Inside a dataframe, all columns must have the same number of observations. If you try and change your data and you accidentally change the number of observations, the data frame will give you an error. This won’t happen if you use the attach function, then you’ll get confusing errors later.
- You can have two different dataframes with the same column names. You can do plot(y~x, data=df1) and plot(y~x, data=df2) without error, but attaching both will destroy the attached data from either df1 or df2 and sometimes it’s hard to know which.
The abline() function adds a line $y = a + bx$ to a plot. If you have a linear model with an intercep and a slope, abline() will automatically add this line. If you have several x’s in your data, you can look through summary(mylm) and get the coefficients from there.
The results of summary(mylm) are an object as well. The output can be used to excract coefficients and p-values, where necessary.
R makes prominent use of dummy variables (0’s and 1’s). For factors, the coeffeicients are the coefficients for the 1’s. The 0’s are called the base level. In the table, the coefficient shows how much that variable changes the mean. For instance, if your model is $y = a + bx$ where $x$ is 0 or 1, then $a$ is the mean of $y$ when $x=0$ and $a + b$ is the mean of $y$ when $x=1$. The coefficient table reports the value of $b$, but note that $b$ only makes sense in relation to $a$.

# Linear Modelling - CO2 data

data(CO2)
#?CO2

plot(uptake ~ conc, data = CO2)

boxplot(uptake ~ Treatment, data = CO2)

hist(CO2$uptake)

hist(CO2$conc)

# I probably shouldn't fit a linear model to this data.

# Fitting a linear model to this data ----

colm <- lm(uptake ~ conc, data = CO2)
summary(colm)

## 
## Call:
## lm(formula = uptake ~ conc, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.831  -7.729   1.483   7.748  16.394 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.500290   1.853080  10.523  < 2e-16 ***
## conc         0.017731   0.003529   5.024 2.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.514 on 82 degrees of freedom
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2261 
## F-statistic: 25.25 on 1 and 82 DF,  p-value: 2.906e-06

plot(uptake ~ conc, data = CO2)
abline(colm)

coef(colm)

## (Intercept)        conc 
## 19.50028981  0.01773059

abline(a = 19.5, b = 0.0177, col = 'hotpink')

colm# not interesting

## 
## Call:
## lm(formula = uptake ~ conc, data = CO2)
## 
## Coefficients:
## (Intercept)         conc  
##    19.50029      0.01773

summary(colm)

## 
## Call:
## lm(formula = uptake ~ conc, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.831  -7.729   1.483   7.748  16.394 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.500290   1.853080  10.523  < 2e-16 ***
## conc         0.017731   0.003529   5.024 2.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.514 on 82 degrees of freedom
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2261 
## F-statistic: 25.25 on 1 and 82 DF,  p-value: 2.906e-06

names(summary(colm))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

str(summary(colm))

## List of 11
##  $ call         : language lm(formula = uptake ~ conc, data = CO2)
##  $ terms        :Classes 'terms', 'formula'  language uptake ~ conc
##   .. ..- attr(*, "variables")= language list(uptake, conc)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "uptake" "conc"
##   .. .. .. ..$ : chr "conc"
##   .. ..- attr(*, "term.labels")= chr "conc"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(uptake, conc)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "uptake" "conc"
##  $ residuals    : Named num [1:84] -5.18 7.8 10.87 11.49 6.93 ...
##   ..- attr(*, "names")= chr [1:84] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 19.50029 0.01773 1.85308 0.00353 10.52318 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "conc"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "conc"
##  $ sigma        : num 9.51
##  $ df           : int [1:3] 2 82 2
##  $ r.squared    : num 0.235
##  $ adj.r.squared: num 0.226
##  $ fstatistic   : Named num [1:3] 25.2 1 82
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 3.79e-02 -5.98e-05 -5.98e-05 1.38e-07
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "conc"
##   .. ..$ : chr [1:2] "(Intercept)" "conc"
##  - attr(*, "class")= chr "summary.lm"

# Extract the coefficients
summary(colm)$coef

##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 19.50028981 1.853080024 10.523177 6.967392e-17
## conc         0.01773059 0.003528853  5.024461 2.905607e-06

summary(colm)$coef[1,1]

## [1] 19.50029

summary(colm)$coef[1,]

##     Estimate   Std. Error      t value     Pr(>|t|) 
## 1.950029e+01 1.853080e+00 1.052318e+01 6.967392e-17

summary(colm)$coef[,4]

##  (Intercept)         conc 
## 6.967392e-17 2.905607e-06

# Multiple Linear Regression ----
colmm <- lm(uptake ~ conc + Treatment, data = CO2)
summary(colmm)

## 
## Call:
## lm(formula = uptake ~ conc + Treatment, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.401  -7.066  -1.168   7.573  17.597 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.930052   1.989746  11.524  < 2e-16 ***
## conc              0.017731   0.003306   5.364 7.55e-07 ***
## Treatmentchilled -6.859524   1.944840  -3.527 0.000695 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.912 on 81 degrees of freedom
## Multiple R-squared:  0.3372, Adjusted R-squared:  0.3208 
## F-statistic:  20.6 on 2 and 81 DF,  p-value: 5.837e-08

# This only changes the intercept for Treatment = chilled

# Slope and intercept
colmm2 <- lm(uptake ~ conc * Treatment, data = CO2)
# * means conc + Treatment + conc:Treatment
summary(colmm2)

## 
## Call:
## lm(formula = uptake ~ conc * Treatment, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.218  -7.401  -1.117   7.835  17.209 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           22.019163   2.464160   8.936 1.17e-13 ***
## conc                   0.019825   0.004693   4.225 6.29e-05 ***
## Treatmentchilled      -5.037747   3.484848  -1.446    0.152    
## conc:Treatmentchilled -0.004188   0.006636  -0.631    0.530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.946 on 80 degrees of freedom
## Multiple R-squared:  0.3405, Adjusted R-squared:  0.3157 
## F-statistic: 13.77 on 3 and 80 DF,  p-value: 2.528e-07

# Comparing Models ----
anova(colmm, colmm2)

## Analysis of Variance Table
## 
## Model 1: uptake ~ conc + Treatment
## Model 2: uptake ~ conc * Treatment
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     81 6433.9                           
## 2     80 6402.0  1    31.871 0.3983 0.5298

# p-value is 0.53 so there is no significant difference
# between the variances in the model.
# The models have approximately the same fit.

AIC(colmm)

## [1] 610.8169

AIC(colmm2)

## [1] 612.3997

# Model without conc:Treatment is better (lower AIC).
# Use uptake ~ conc + Treatment as the model

BIC(colmm)

## [1] 620.5401

BIC(colmm2)

## [1] 624.5538

# Checking the residuals ----
plot(colmm$residuals ~ CO2$uptake)

# This looks really bad - there shouldn't be patterns!

qqnorm(colmm$residuals)
qqline(colmm$residuals)

hist(colmm$residuals)

# Conclusion: Our best model is still pretty bad.

# Challenge
data(mtcars)
plot(mpg ~ disp, data = mtcars)

plot(mpg ~ wt, data = mtcars)

boxplot(mpg ~ am, data = mtcars)

# fit a linear model and tell me about it.

More Modelling: mtcars and Logistic Regression

In this section, I include some of the example code. The first part shows the effect of interaction in a linear model; using * instead of + in your formula (see the table in the previous section).

An interaction between a continous variable and a 0/1 variable is essentially fitting two different lines - onle line for 0, another line for 1. The advantage of this is that you get to test if these two lines are statistically significantly different.
ANOVA models are made using anova(aov(response ~ category)). ANOVA is secretly just linear models applied to a categorical predictor, so you can use lm() instead of aov. ANOVA can also be used to compare two different models, regardless of what kind of predictors there are.
Logistic regression has basically the same syntax as linear models, but with glm() and you must include a family argument. Logistic models are fit with family = binomial() and Poisson models are fit with family = Poisson(). To actually add the fitted model to a plot, use the predict() function. An example is given, and this example applies to any glm() model.
The step() function can be used to find the predictors which give the lowest AIC. This is an automated method, and relies on doing a lot of hypothesis tests.
- Each hypothesis test has a 5% chance of incorrectly concluding that something is significant when it is not (type 1 error). When you do two of these, both have a 5% chance, which means there is a 9.75% chance that at least one of them is falsely significant. With 3, there’s a 14.27% chance. The step() function will do 40 or 50 of these tests for moderate sized data sets, which gives a 92.3% chance that one of the tests was falsely significant. Becareful with this. See the Wikipedia Multiple Comparisons artcle for more info.

data(mtcars)

lm1 <- lm(mpg ~ disp + am, data = mtcars)
lm2 <- lm(mpg ~ disp * am, data = mtcars)

plot(mpg ~ disp, col = am + 1, data=mtcars, pch = 16)
coef(lm1)

## (Intercept)        disp          am 
## 27.84808111 -0.03685086  1.83345825

coef(lm2)

## (Intercept)        disp          am     disp:am 
## 25.15706407 -0.02758360  7.70907298 -0.03145482

abline(a = 27.848 + 1.833, b = -0.03685, col = 1)
abline(a = 27.848, b = -0.03685, col = 2)

abline(a = 25.15706407, 
       b = -0.02758360, 
       col = 1, lty = 2)
abline(a = 25.15706407 + 7.70907298, 
       b = -0.02758360-0.03145482 , 
       col = 2, lty = 2)
legend('topright', legend=c("With *", "With +"), lty = 2:1)

# ~. performs 
mtcars.all <- lm (mpg ~ ., data = mtcars)
mtfinal <- step(mtcars.all)

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

plot(mtfinal$residuals ~ mtcars$mpg)

summary(lm(mpg ~ am, data = mtcars))

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

# Anova and linear models are the same!
anova(aov(mpg ~ am, data = mtcars))

## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## am         1 405.15  405.15   16.86 0.000285 ***
## Residuals 30 720.90   24.03                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Silly examples
# Logistic Regression
plot(am ~ mpg, data = mtcars)

ampg <- glm(am ~ mpg, data = mtcars, family = binomial())
summary(ampg)

## 
## Call:
## glm(formula = am ~ mpg, family = binomial(), data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5701  -0.7531  -0.4245   0.5866   2.0617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -6.6035     2.3514  -2.808  0.00498 **
## mpg           0.3070     0.1148   2.673  0.00751 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 29.675  on 30  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 5

#?predict
newdata <- data.frame(mpg = seq(10,35,0.5))
ampg.pred <- predict(object = ampg, newdata = newdata,
                     type = 'response')

plot(am ~ mpg, data = mtcars)
lines(ampg.pred ~ newdata$mpg)

# Do not run:
# Repeated measures anova
anova(aov(response ~ predictor + Error(subject_id)))
?Error

ggplot2 Examples

ggplot2 is a relatively new plotting method that creates beautiful plots easily. I believe it is worth learning the base R functions for exploratory plotting, but look into ggplot2 for final reports. It makes a lot of plots very easy, but fine-tuning plots gets finnicky. If it’s really hard to do in ggplot2, then it’s probably not a good plot in the first place.

ggplot objects start with a ggplot() function, which sets up the data. You add geoms to the data to make plots. You give the data plotting functions, rather than plotting data. It’s a subtle but powerful change in philosophy.
Recall that the following plots were slowly built up - I didn’t start with the final products. This is part of the ggplot philosophy - you add ideas to a plot.

#install.packages("ggplot2")
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

#Base R
plot(mpg ~ disp, col = am + 1, data = mtcars)

## ggplot2
ggplot(data = mtcars, 
       mapping = aes(x = disp, 
                     y = mpg, 
                     colour = factor(am))) +
    geom_point() + geom_smooth(method = "lm") + 
    theme_bw() + xlab("Displacement (Cu.in)") +
    ylab("Miles Per Gallon") +
    ggtitle(label = "MPG against displacement",
            subtitle = "Manual cars are better")

ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 
    geom_density(alpha = 0.5) + theme_minimal() +
    facet_grid(~am)

RMarkdown

RMarkdown is a fantastic tool for creating reports. You can use some plain English text and then switch to LaTeX code for equations and then add some R code that will be evaluated. The R code chunks can show the code, output, neither, or both. It can be used to make webpages, blogs, and applications in html, papers, books, and reports in pdf (requires LaTeX), and reports in docx (requires Word or LibreOffice).

Your code will be run each time you compile, so make sure the code can run on its own (you’ll need to have code chunks that import data and do all of the cleaning). The great thing about this is that if your data changes, your report changes too! You don’t have to re-make your plots and tables, they automatically update.

If you click the green plus in RStudio to open a new file, select RMarkdown. The default file has a nice, short overview to get you started. Fine-tuning reports can be fickle.

Further Topics

I obviously couldn’t cover everything. Here are some things to help you keep going with R:

dplyr and the tidyverse are a collection of packages written by Hadley Wickham. They are designed to make data manipulation easier for most tasks. Packages like forcats make it easier to deal with factor variables (forcats is an anagram of factors), lubridate makes it easier to deal with date/time variables, and the dplyr funcntions are made for manipulating dataframes. data.table is a package that makes R better at dealing with large data frames and has some unique syntax to go along with it.
Datacamp has some good R courses, but I believe some of them are paid. There are also great courses from MIT and Stanford that are available online for free. Note that some courses use a lot of base R functions and others will focus on dplyr. This may be confusing at first. I believe base R is a little bit more useful and harder to learn, so when you come across dplyr it will be easier to adjust.
R-Bloggers.com aggregates blogs from many different places and is a good thing to browse in order to expose yourself to more R code. #rstats on twitter is also useful.
Any time you have a problem, Googling “r functionname errormessage” will usually get you a good solution. Most of the time, this will take you to R-Bloggers, StackExchange, or the RStudio help forums. These are all great resources.

Thanks for attending this workshop!

Spooky Stats Workshop

Devan Becker

October 29, 2017

Data Link

Time Series: Atmospheric CO\(_2\) Concentrations

Linear Modelling: CO\(_2\) Uptake in Chilled Environments

More Modelling: mtcars and Logistic Regression

ggplot2 Examples

RMarkdown

Further Topics

Model	Notation
\(y = \beta_0+\beta_1x_1+\beta_2x_2\)	`y ~ x1 + x2`
\(y = \beta_1x_1 +\beta_2x_2\)	`y ~ x1 + x2 - 1`
\(y = \beta_0 + \beta_1x_1x_2\)	`y ~ x1 : x2`
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\)	`y ~ x1 + x2 + x1:x2`
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\)	`y ~ x1 * x2`
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\)	`y ~ (x1 + x2)^2`
\(y = \beta_0 + \beta_1x_1^2\)	`y ~ I(x1^2)`
\(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\)	`y ~ poly(x1, 2)`
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\)	`y ~ .`