This is the data we used in the Time Series section to compare to the data that’s built in to R. You can download it by right clicking the following link and clicking “Open in new tab”.
https://drive.google.com/file/d/0B0xWkUJMicalUUJBejNpSjJneFE/view?usp=sharing
In this section, we looked at some basic time series models and plots in R. Below is a summary of what we covered, followed by the script file.
library()) and have spelled everything correctly.forecast() function include something called the mean. If we want to plot multiple time series forecasts, we can extract the mean label out of the forecast object with a dollar sign (forecast.obj$mean).+ instead of >: Hit escape. You are missing a bracket or a quotation mark somewhere.# Time series data for co2 conc.
# 2017-10-29
# Forecasted using co2 data that only goes up to 1997.
# Since the NOAA still collects data, we were able to
# compare there forecasts to the actual data.
# It seems that the CO2 concentration is increasing
# faster than our simple models predicted.
#install.packages("forecast") # Make this a comment after installing
# so that you don't try to install it every time
library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
data(co2)
#?co2 # comment out help files so they don't run every time
plot(co2)
class(co2)
## [1] "ts"
# I commented these out because I don't want the data to show up
# in the output
#co2
#time(co2)
#frequency(co2)
plot(co2)
abline(lm(co2 ~ time(co2)))
lines(aggregate(co2, FUN=mean), col = 'red')
aggregate(co2, FUN=mean)
## Time Series:
## Start = 1959
## End = 1997
## Frequency = 1
## [1] 315.8258 316.7475 317.4850 318.2975 318.8325 319.4625 319.8725
## [8] 321.2100 322.0200 322.8900 324.4592 325.5175 326.1550 327.2933
## [15] 329.5117 330.0792 330.9858 331.9858 333.7300 335.3358 336.6808
## [22] 338.5150 339.7608 340.9592 342.6083 344.2467 345.7258 346.9750
## [29] 348.7508 351.3133 352.7542 354.0367 355.4783 356.2917 356.9958
## [36] 358.8800 360.9142 362.6867 363.8175
# Hey look what happens here!
monthplot(co2)
seasonplot(co2, col = heat.colors(39))
# Autocorrelation function
acf(co2)
pacf(co2)
co2.ar <- auto.arima(co2)
summary(co2.ar)
## Series: co2
## ARIMA(1,1,1)(1,1,2)[12]
##
## Coefficients:
## ar1 ma1 sar1 sma1 sma2
## 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
## s.e. 0.1406 0.1204 0.5879 0.5701 0.4819
##
## sigma^2 estimated as 0.08576: log likelihood=-84.39
## AIC=180.78 AICc=180.97 BIC=205.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01742092 0.287159 0.2303994 0.00507377 0.06845665 0.1819636
## ACF1
## Training set -0.00285809
co2.fore <- forecast(co2.ar, h = 4*12)
co2.fore
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1998 365.1519 364.7766 365.5272 364.5779 365.7259
## Feb 1998 365.9244 365.4722 366.3766 365.2329 366.6160
## Mar 1998 366.7579 366.2548 367.2611 365.9884 367.5275
## Apr 1998 368.1529 367.6066 368.6991 367.3175 368.9882
## May 1998 368.7106 368.1253 369.2960 367.8154 369.6059
## Jun 1998 367.9974 367.3756 368.6193 367.0464 368.9485
## Jul 1998 366.5046 365.8483 367.1609 365.5009 367.5083
## Aug 1998 364.4546 363.7656 365.1436 363.4009 365.5083
## Sep 1998 362.5883 361.8681 363.3085 361.4868 363.6898
## Oct 1998 362.7245 361.9744 363.4746 361.5773 363.8717
## Nov 1998 364.1808 363.4019 364.9597 362.9896 365.3720
## Dec 1998 365.6001 364.7934 366.4067 364.3664 366.8337
## Jan 1999 366.6541 365.8001 367.5082 365.3479 367.9603
## Feb 1999 367.4890 366.5970 368.3811 366.1248 368.8533
## Mar 1999 368.3600 367.4332 369.2868 366.9426 369.7774
## Apr 1999 369.6847 368.7249 370.6446 368.2167 371.1528
## May 1999 370.2598 369.2680 371.2516 368.7430 371.7766
## Jun 1999 369.5900 368.5673 370.6127 368.0260 371.1540
## Jul 1999 368.0811 367.0284 369.1337 366.4712 369.6909
## Aug 1999 366.0296 364.9479 367.1114 364.3752 367.6841
## Sep 1999 364.1807 363.0705 365.2908 362.4829 365.8785
## Oct 1999 364.2948 363.1570 365.4326 362.5547 366.0349
## Nov 1999 365.7120 364.5472 366.8769 363.9306 367.4935
## Dec 1999 367.1133 365.9221 368.3045 365.2915 368.9351
## Jan 2000 368.1622 366.9350 369.3895 366.2854 370.0391
## Feb 2000 368.9957 367.7369 370.2545 367.0706 370.9209
## Mar 2000 369.8545 368.5658 371.1433 367.8836 371.8255
## Apr 2000 371.2200 369.9022 372.5378 369.2046 373.2354
## May 2000 371.7861 370.4399 373.1323 369.7272 373.8450
## Jun 2000 371.0926 369.7186 372.4666 368.9913 373.1940
## Jul 2000 369.5926 368.1913 370.9938 367.4495 371.7356
## Aug 2000 367.5420 366.1140 368.9699 365.3581 369.7258
## Sep 2000 365.6835 364.2293 367.1377 363.4595 367.9075
## Oct 2000 365.8097 364.3298 367.2897 363.5463 368.0731
## Nov 2000 367.2484 365.7431 368.7537 364.9463 369.5505
## Dec 2000 368.6595 367.1294 370.1897 366.3193 370.9998
## Jan 2001 369.7113 368.1450 371.2776 367.3158 372.1068
## Feb 2001 370.5456 368.9479 372.1432 368.1021 372.9890
## Mar 2001 371.4110 369.7835 373.0385 368.9220 373.9000
## Apr 2001 372.7541 371.0977 374.4106 370.2208 375.2875
## May 2001 373.3252 371.6402 375.0101 370.7483 375.9021
## Jun 2001 372.6447 370.9318 374.3576 370.0250 375.2644
## Jul 2001 371.1398 369.3993 372.8802 368.4780 373.8015
## Aug 2001 369.0887 367.3212 370.8562 366.3855 371.7919
## Sep 2001 367.2354 365.4412 369.0296 364.4915 369.9794
## Oct 2001 367.3550 365.5346 369.1755 364.5709 370.1392
## Nov 2001 368.7819 366.9356 370.6283 365.9581 371.6057
## Dec 2001 370.1877 368.3157 372.0596 367.3248 373.0506
plot(co2.fore, main = "Forecasted co2",
xlim = c(1995, 2002), ylim = c(355, 375))
# Forecasting with loess smoothers
#?stl
co2.stl <- stl(co2, s.window = "period")
plot(co2.stl)
plot(forecast(co2.stl, h = 4*12),
xlim= c(1995, 2002), ylim = c(355, 375))
lines(co2.fore$mean, col = 'purple', lwd = 2)
legend('topleft', legend = c("ARIMA", "STL"),
col = c("purple", "blue"), lwd=2)
# Load in the new data - contains co2 up to 2017
newco2 <- read.csv("C:/Users/d7bec/Desktop/newco2.csv")
plot(newco2$Average, type = 'l')
# Some of the average co2 conc's are below 0 - that's
# not how concentrations work.
# Missing data was recorded as -99.99.
ave <- newco2$Average
which(ave < 0)
## [1] 4 8 72 73 74 214 314
ave[which(ave < 0)] <- NA
plot(ave, type = 'l')
# Create a time series object to take advantage of
# time series functions
newts <- ts(ave, start = c(1958, 3),
end = c(2017, 9), frequency = 12)
plot(newts)
plot(newts, xlim = c(1995, 2017), ylim = c(355,415))
forc.arima <- forecast(co2.ar, h = 20*12)
forc.stl <- forecast(co2.stl, h = 20*12)
lines(forc.arima$mean, col = "purple")
lines(forc.stl$mean, col = "blue")
plot(forc.arima, xlim = c(1995, 2017), ylim = c(355, 415))
lines(newts)
y ~ x1 + x2, is used by almost all models in R. The left side is the response/dependent variable, and the explanatory/independent variables are on the right. There are a lot of interesting things to do with formulae, the table below shows some of them. Note that multiplication is done with :. Also, if you want to transform your data, you should do it before you put it in a formula - math doesn’t quite work the same here. If you do want to take, say, the log of your data, you need to use I(log(x)) - the I() tells R to interpret it as math instead of a formula.| Model | Notation |
|---|---|
| \(y = \beta_0+\beta_1x_1+\beta_2x_2\) | y ~ x1 + x2 |
| \(y = \beta_1x_1 +\beta_2x_2\) | y ~ x1 + x2 - 1 |
| \(y = \beta_0 + \beta_1x_1x_2\) | y ~ x1 : x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 + x2 + x1:x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 * x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ (x1 + x2)^2 |
| \(y = \beta_0 + \beta_1x_1^2\) | y ~ I(x1^2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\) | y ~ poly(x1, 2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\) | y ~ . |
attach() will clone your data so that you can access columns outside of a data.frame. Here are my arguments against doing this:
mydataframe$columnname <- newthings, which is a little bit cumbersome. This forces you to acknowledge that you are changing your data so that you don’t accidentally overwrite a variable.plot(y~x, data=df1) and plot(y~x, data=df2) without error, but attaching both will destroy the attached data from either df1 or df2 and sometimes it’s hard to know which.abline() function adds a line \(y = a + bx\) to a plot. If you have a linear model with an intercep and a slope, abline() will automatically add this line. If you have several x’s in your data, you can look through summary(mylm) and get the coefficients from there.summary(mylm) are an object as well. The output can be used to excract coefficients and p-values, where necessary.# Linear Modelling - CO2 data
data(CO2)
#?CO2
plot(uptake ~ conc, data = CO2)
boxplot(uptake ~ Treatment, data = CO2)
hist(CO2$uptake)
hist(CO2$conc)
# I probably shouldn't fit a linear model to this data.
# Fitting a linear model to this data ----
colm <- lm(uptake ~ conc, data = CO2)
summary(colm)
##
## Call:
## lm(formula = uptake ~ conc, data = CO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.831 -7.729 1.483 7.748 16.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.500290 1.853080 10.523 < 2e-16 ***
## conc 0.017731 0.003529 5.024 2.91e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.514 on 82 degrees of freedom
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2261
## F-statistic: 25.25 on 1 and 82 DF, p-value: 2.906e-06
plot(uptake ~ conc, data = CO2)
abline(colm)
coef(colm)
## (Intercept) conc
## 19.50028981 0.01773059
abline(a = 19.5, b = 0.0177, col = 'hotpink')
colm# not interesting
##
## Call:
## lm(formula = uptake ~ conc, data = CO2)
##
## Coefficients:
## (Intercept) conc
## 19.50029 0.01773
summary(colm)
##
## Call:
## lm(formula = uptake ~ conc, data = CO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.831 -7.729 1.483 7.748 16.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.500290 1.853080 10.523 < 2e-16 ***
## conc 0.017731 0.003529 5.024 2.91e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.514 on 82 degrees of freedom
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2261
## F-statistic: 25.25 on 1 and 82 DF, p-value: 2.906e-06
names(summary(colm))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
str(summary(colm))
## List of 11
## $ call : language lm(formula = uptake ~ conc, data = CO2)
## $ terms :Classes 'terms', 'formula' language uptake ~ conc
## .. ..- attr(*, "variables")= language list(uptake, conc)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "uptake" "conc"
## .. .. .. ..$ : chr "conc"
## .. ..- attr(*, "term.labels")= chr "conc"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(uptake, conc)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "uptake" "conc"
## $ residuals : Named num [1:84] -5.18 7.8 10.87 11.49 6.93 ...
## ..- attr(*, "names")= chr [1:84] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] 19.50029 0.01773 1.85308 0.00353 10.52318 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "conc"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "conc"
## $ sigma : num 9.51
## $ df : int [1:3] 2 82 2
## $ r.squared : num 0.235
## $ adj.r.squared: num 0.226
## $ fstatistic : Named num [1:3] 25.2 1 82
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 3.79e-02 -5.98e-05 -5.98e-05 1.38e-07
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "conc"
## .. ..$ : chr [1:2] "(Intercept)" "conc"
## - attr(*, "class")= chr "summary.lm"
# Extract the coefficients
summary(colm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.50028981 1.853080024 10.523177 6.967392e-17
## conc 0.01773059 0.003528853 5.024461 2.905607e-06
summary(colm)$coef[1,1]
## [1] 19.50029
summary(colm)$coef[1,]
## Estimate Std. Error t value Pr(>|t|)
## 1.950029e+01 1.853080e+00 1.052318e+01 6.967392e-17
summary(colm)$coef[,4]
## (Intercept) conc
## 6.967392e-17 2.905607e-06
# Multiple Linear Regression ----
colmm <- lm(uptake ~ conc + Treatment, data = CO2)
summary(colmm)
##
## Call:
## lm(formula = uptake ~ conc + Treatment, data = CO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.401 -7.066 -1.168 7.573 17.597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.930052 1.989746 11.524 < 2e-16 ***
## conc 0.017731 0.003306 5.364 7.55e-07 ***
## Treatmentchilled -6.859524 1.944840 -3.527 0.000695 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.912 on 81 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3208
## F-statistic: 20.6 on 2 and 81 DF, p-value: 5.837e-08
# This only changes the intercept for Treatment = chilled
# Slope and intercept
colmm2 <- lm(uptake ~ conc * Treatment, data = CO2)
# * means conc + Treatment + conc:Treatment
summary(colmm2)
##
## Call:
## lm(formula = uptake ~ conc * Treatment, data = CO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.218 -7.401 -1.117 7.835 17.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.019163 2.464160 8.936 1.17e-13 ***
## conc 0.019825 0.004693 4.225 6.29e-05 ***
## Treatmentchilled -5.037747 3.484848 -1.446 0.152
## conc:Treatmentchilled -0.004188 0.006636 -0.631 0.530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.946 on 80 degrees of freedom
## Multiple R-squared: 0.3405, Adjusted R-squared: 0.3157
## F-statistic: 13.77 on 3 and 80 DF, p-value: 2.528e-07
# Comparing Models ----
anova(colmm, colmm2)
## Analysis of Variance Table
##
## Model 1: uptake ~ conc + Treatment
## Model 2: uptake ~ conc * Treatment
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6433.9
## 2 80 6402.0 1 31.871 0.3983 0.5298
# p-value is 0.53 so there is no significant difference
# between the variances in the model.
# The models have approximately the same fit.
AIC(colmm)
## [1] 610.8169
AIC(colmm2)
## [1] 612.3997
# Model without conc:Treatment is better (lower AIC).
# Use uptake ~ conc + Treatment as the model
BIC(colmm)
## [1] 620.5401
BIC(colmm2)
## [1] 624.5538
# Checking the residuals ----
plot(colmm$residuals ~ CO2$uptake)
# This looks really bad - there shouldn't be patterns!
qqnorm(colmm$residuals)
qqline(colmm$residuals)
hist(colmm$residuals)
# Conclusion: Our best model is still pretty bad.
# Challenge
data(mtcars)
plot(mpg ~ disp, data = mtcars)
plot(mpg ~ wt, data = mtcars)
boxplot(mpg ~ am, data = mtcars)
# fit a linear model and tell me about it.
In this section, I include some of the example code. The first part shows the effect of interaction in a linear model; using * instead of + in your formula (see the table in the previous section).
anova(aov(response ~ category)). ANOVA is secretly just linear models applied to a categorical predictor, so you can use lm() instead of aov. ANOVA can also be used to compare two different models, regardless of what kind of predictors there are.glm() and you must include a family argument. Logistic models are fit with family = binomial() and Poisson models are fit with family = Poisson(). To actually add the fitted model to a plot, use the predict() function. An example is given, and this example applies to any glm() model.step() function can be used to find the predictors which give the lowest AIC. This is an automated method, and relies on doing a lot of hypothesis tests.
step() function will do 40 or 50 of these tests for moderate sized data sets, which gives a 92.3% chance that one of the tests was falsely significant. Becareful with this. See the Wikipedia Multiple Comparisons artcle for more info.data(mtcars)
lm1 <- lm(mpg ~ disp + am, data = mtcars)
lm2 <- lm(mpg ~ disp * am, data = mtcars)
plot(mpg ~ disp, col = am + 1, data=mtcars, pch = 16)
coef(lm1)
## (Intercept) disp am
## 27.84808111 -0.03685086 1.83345825
coef(lm2)
## (Intercept) disp am disp:am
## 25.15706407 -0.02758360 7.70907298 -0.03145482
abline(a = 27.848 + 1.833, b = -0.03685, col = 1)
abline(a = 27.848, b = -0.03685, col = 2)
abline(a = 25.15706407,
b = -0.02758360,
col = 1, lty = 2)
abline(a = 25.15706407 + 7.70907298,
b = -0.02758360-0.03145482 ,
col = 2, lty = 2)
legend('topright', legend=c("With *", "With +"), lty = 2:1)
# ~. performs
mtcars.all <- lm (mpg ~ ., data = mtcars)
mtfinal <- step(mtcars.all)
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
plot(mtfinal$residuals ~ mtcars$mpg)
summary(lm(mpg ~ am, data = mtcars))
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
# Anova and linear models are the same!
anova(aov(mpg ~ am, data = mtcars))
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## am 1 405.15 405.15 16.86 0.000285 ***
## Residuals 30 720.90 24.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Silly examples
# Logistic Regression
plot(am ~ mpg, data = mtcars)
ampg <- glm(am ~ mpg, data = mtcars, family = binomial())
summary(ampg)
##
## Call:
## glm(formula = am ~ mpg, family = binomial(), data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5701 -0.7531 -0.4245 0.5866 2.0617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.6035 2.3514 -2.808 0.00498 **
## mpg 0.3070 0.1148 2.673 0.00751 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 29.675 on 30 degrees of freedom
## AIC: 33.675
##
## Number of Fisher Scoring iterations: 5
#?predict
newdata <- data.frame(mpg = seq(10,35,0.5))
ampg.pred <- predict(object = ampg, newdata = newdata,
type = 'response')
plot(am ~ mpg, data = mtcars)
lines(ampg.pred ~ newdata$mpg)
# Do not run:
# Repeated measures anova
anova(aov(response ~ predictor + Error(subject_id)))
?Error
ggplot2 is a relatively new plotting method that creates beautiful plots easily. I believe it is worth learning the base R functions for exploratory plotting, but look into ggplot2 for final reports. It makes a lot of plots very easy, but fine-tuning plots gets finnicky. If it’s really hard to do in ggplot2, then it’s probably not a good plot in the first place.
ggplot() function, which sets up the data. You add geoms to the data to make plots. You give the data plotting functions, rather than plotting data. It’s a subtle but powerful change in philosophy.#install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
#Base R
plot(mpg ~ disp, col = am + 1, data = mtcars)
## ggplot2
ggplot(data = mtcars,
mapping = aes(x = disp,
y = mpg,
colour = factor(am))) +
geom_point() + geom_smooth(method = "lm") +
theme_bw() + xlab("Displacement (Cu.in)") +
ylab("Miles Per Gallon") +
ggtitle(label = "MPG against displacement",
subtitle = "Manual cars are better")
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
geom_density(alpha = 0.5) + theme_minimal() +
facet_grid(~am)
RMarkdown is a fantastic tool for creating reports. You can use some plain English text and then switch to LaTeX code for equations and then add some R code that will be evaluated. The R code chunks can show the code, output, neither, or both. It can be used to make webpages, blogs, and applications in html, papers, books, and reports in pdf (requires LaTeX), and reports in docx (requires Word or LibreOffice).
Your code will be run each time you compile, so make sure the code can run on its own (you’ll need to have code chunks that import data and do all of the cleaning). The great thing about this is that if your data changes, your report changes too! You don’t have to re-make your plots and tables, they automatically update.
If you click the green plus in RStudio to open a new file, select RMarkdown. The default file has a nice, short overview to get you started. Fine-tuning reports can be fickle.
I obviously couldn’t cover everything. Here are some things to help you keep going with R:
dplyr and the tidyverse are a collection of packages written by Hadley Wickham. They are designed to make data manipulation easier for most tasks. Packages like forcats make it easier to deal with factor variables (forcats is an anagram of factors), lubridate makes it easier to deal with date/time variables, and the dplyr funcntions are made for manipulating dataframes. data.table is a package that makes R better at dealing with large data frames and has some unique syntax to go along with it.dplyr. This may be confusing at first. I believe base R is a little bit more useful and harder to learn, so when you come across dplyr it will be easier to adjust.Thanks for attending this workshop!