R Markdown Example

We're going to create a document that combines text, code, and output to record everything that we did in an analysis.

First let's read in the data.

dat <- read.csv("/Users/maryjane/hampshire/dataHampshire.csv")
dat$x2 <- as.factor(dat$x2)

We now have a data set named “dat”.

Let's look at a summary of the data and some univariate plots.

dat$X <- NULL
summary(dat)
##        y                x1         x2    
##  Min.   : 0.115   Min.   :-0.207   0:25  
##  1st Qu.: 2.918   1st Qu.: 2.330   1:25  
##  Median : 5.189   Median : 4.826         
##  Mean   : 4.768   Mean   : 5.047         
##  3rd Qu.: 6.239   3rd Qu.: 7.061         
##  Max.   :10.797   Max.   :12.416
# boxplots of continuous variables.
boxplot(y, x1, names = c("y", "x1"))
## Error: object 'y' not found

Now let's look at some bivariate relationships between the variables?

# Scatterplot matrix
plot(dat)

plot of chunk unnamed-chunk-3

Now we will build some linear models.

Let's first fit the model \( y_i = \beta_{0} + \beta_{1} x_i + \epsilon_i, i=1,\dots,n \)

lmOut <- lm(y ~ x1, data = dat)
summary(lmOut)
## 
## Call:
## lm(formula = y ~ x1, data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.772 -2.112  0.059  1.384  6.249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.984      0.753    7.94  2.7e-10 ***
## x1            -0.241      0.129   -1.86    0.069 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 48 degrees of freedom
## Multiple R-squared:  0.0673, Adjusted R-squared:  0.0479 
## F-statistic: 3.46 on 1 and 48 DF,  p-value: 0.0688
plot(dat$x1, dat$y, pch = 16)
points(dat$x1, lmOut$fitted.values, type = "l", col = "red")

plot of chunk unnamed-chunk-4

The p-value associated with \( \widehat{\beta}_1 \) is 0.0688. So we could conclude that there is a negative relationship between \( y \) and \( x_1 \).

Let's look at that scatterplot matrix again, but now we'll use colors to distinguish the two groups.

# Scatterplot matrix
col <- rep("red", 50)
col[dat$x2 == 1] <- "blue"
plot(dat, col = col, pch = 16)

plot of chunk unnamed-chunk-5

It actually looks like the relationship between \( y \) and \( x_1 \) is positive, but that there are two distinct groups. Let's now fit a multiple linear regression model with \( x_2 \) added to the model.

lmOut <- lm(y ~ x1 + x2, data = dat)
summary(lmOut)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4441 -0.9430  0.0231  0.8579  1.3153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.2148     0.6160   -6.84  1.4e-08 ***
## x1            0.9450     0.0791   11.94  7.8e-16 ***
## x21           8.4275     0.4590   18.36  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.938 on 47 degrees of freedom
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.881 
## F-statistic:  182 on 2 and 47 DF,  p-value: <2e-16
plot(dat$x1, dat$y, pch = 16)
points(dat$x1[dat$x2 == 1], lmOut$fitted.values[dat$x2 == 1], type = "l", col = "blue", 
    lwd = 3)
points(dat$x1[dat$x2 == 0], lmOut$fitted.values[dat$x2 == 0], type = "l", col = "red", 
    lwd = 3)

plot of chunk unnamed-chunk-6

The p-value associated with \( \widehat{\beta}_1 \) is 7.7671 × 10-16 and the p-value associated with \( \widehat{\beta}_2 \) is 4.4704 × 10-23. Both are highly significant.

Now let's add some diagnostic plots to show that we are checking out assumptions.

# boxplot to check for outliers
boxplot(lmOut$residuals)

plot of chunk unnamed-chunk-7

# Plot fitted values versus residual to check for homoskedasticity
plot(lmOut$fitted.values, lmOut$residuals, pch = 16)
abline(h = 0)

plot of chunk unnamed-chunk-7

# Plot predictors versus residual to check for homoskedasticity
plot(dat$x1, lmOut$residuals, pch = 16)
abline(h = 0)

plot of chunk unnamed-chunk-7

# check for normality
qqnorm(lmOut$residuals)

plot of chunk unnamed-chunk-7

# Influence plot
library(car)
## Error: there is no package called 'car'
influencePlot(lmOut)
## Error: could not find function "influencePlot"