We're going to create a document that combines text, code, and output to record everything that we did in an analysis.
First let's read in the data.
dat <- read.csv("/Users/maryjane/hampshire/dataHampshire.csv")
dat$x2 <- as.factor(dat$x2)
We now have a data set named “dat”.
Let's look at a summary of the data and some univariate plots.
dat$X <- NULL
summary(dat)
## y x1 x2
## Min. : 0.115 Min. :-0.207 0:25
## 1st Qu.: 2.918 1st Qu.: 2.330 1:25
## Median : 5.189 Median : 4.826
## Mean : 4.768 Mean : 5.047
## 3rd Qu.: 6.239 3rd Qu.: 7.061
## Max. :10.797 Max. :12.416
# boxplots of continuous variables.
boxplot(y, x1, names = c("y", "x1"))
## Error: object 'y' not found
Now let's look at some bivariate relationships between the variables?
# Scatterplot matrix
plot(dat)
Now we will build some linear models.
Let's first fit the model \( y_i = \beta_{0} + \beta_{1} x_i + \epsilon_i, i=1,\dots,n \)
lmOut <- lm(y ~ x1, data = dat)
summary(lmOut)
##
## Call:
## lm(formula = y ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.772 -2.112 0.059 1.384 6.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.984 0.753 7.94 2.7e-10 ***
## x1 -0.241 0.129 -1.86 0.069 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 48 degrees of freedom
## Multiple R-squared: 0.0673, Adjusted R-squared: 0.0479
## F-statistic: 3.46 on 1 and 48 DF, p-value: 0.0688
plot(dat$x1, dat$y, pch = 16)
points(dat$x1, lmOut$fitted.values, type = "l", col = "red")
The p-value associated with \( \widehat{\beta}_1 \) is 0.0688. So we could conclude that there is a negative relationship between \( y \) and \( x_1 \).
Let's look at that scatterplot matrix again, but now we'll use colors to distinguish the two groups.
# Scatterplot matrix
col <- rep("red", 50)
col[dat$x2 == 1] <- "blue"
plot(dat, col = col, pch = 16)
It actually looks like the relationship between \( y \) and \( x_1 \) is positive, but that there are two distinct groups. Let's now fit a multiple linear regression model with \( x_2 \) added to the model.
lmOut <- lm(y ~ x1 + x2, data = dat)
summary(lmOut)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4441 -0.9430 0.0231 0.8579 1.3153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2148 0.6160 -6.84 1.4e-08 ***
## x1 0.9450 0.0791 11.94 7.8e-16 ***
## x21 8.4275 0.4590 18.36 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.938 on 47 degrees of freedom
## Multiple R-squared: 0.886, Adjusted R-squared: 0.881
## F-statistic: 182 on 2 and 47 DF, p-value: <2e-16
plot(dat$x1, dat$y, pch = 16)
points(dat$x1[dat$x2 == 1], lmOut$fitted.values[dat$x2 == 1], type = "l", col = "blue",
lwd = 3)
points(dat$x1[dat$x2 == 0], lmOut$fitted.values[dat$x2 == 0], type = "l", col = "red",
lwd = 3)
The p-value associated with \( \widehat{\beta}_1 \) is 7.7671 × 10-16 and the p-value associated with \( \widehat{\beta}_2 \) is 4.4704 × 10-23. Both are highly significant.
Now let's add some diagnostic plots to show that we are checking out assumptions.
# boxplot to check for outliers
boxplot(lmOut$residuals)
# Plot fitted values versus residual to check for homoskedasticity
plot(lmOut$fitted.values, lmOut$residuals, pch = 16)
abline(h = 0)
# Plot predictors versus residual to check for homoskedasticity
plot(dat$x1, lmOut$residuals, pch = 16)
abline(h = 0)
# check for normality
qqnorm(lmOut$residuals)
# Influence plot
library(car)
## Error: there is no package called 'car'
influencePlot(lmOut)
## Error: could not find function "influencePlot"