The basic syntax structure for linear regression in R is pretty easy to follow. It makes use of the lm function. One of the nice things about R is that you can code categorical data into factors. In SAS on the other hand, you would have to create dummy variables for incorporation into the model. Let’s use the database that comes with the ggplot2 package called ‘diamonds’
Let’s say we want to create a diamond pricing model
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
linmodel <- lm(price ~ cut + color + clarity + (x*y*z),
data = diamonds)
summary(linmodel) # to show the results
##
## Call:
## lm(formula = price ~ cut + color + clarity + (x * y * z), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16260.7 -561.7 -161.7 371.2 11891.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18191.875 341.204 53.317 < 2e-16 ***
## cut.L 271.273 19.987 13.572 < 2e-16 ***
## cut.Q -111.783 17.566 -6.364 1.99e-10 ***
## cut.C 138.708 15.268 9.085 < 2e-16 ***
## cut^4 30.242 12.100 2.499 0.0124 *
## color.L -1934.049 17.115 -113.006 < 2e-16 ***
## color.Q -659.710 15.596 -42.300 < 2e-16 ***
## color.C -160.126 14.549 -11.006 < 2e-16 ***
## color^4 53.593 13.364 4.010 6.07e-05 ***
## color^5 -97.719 12.624 -7.741 1.01e-14 ***
## color^6 -53.904 11.475 -4.697 2.64e-06 ***
## clarity.L 3898.696 29.865 130.545 < 2e-16 ***
## clarity.Q -1826.088 27.896 -65.461 < 2e-16 ***
## clarity.C 915.093 23.861 38.350 < 2e-16 ***
## clarity^4 -308.229 19.059 -16.172 < 2e-16 ***
## clarity^5 192.269 15.559 12.357 < 2e-16 ***
## clarity^6 18.024 13.548 1.330 0.1834
## clarity^7 83.751 11.959 7.003 2.53e-12 ***
## x -6437.213 116.440 -55.283 < 2e-16 ***
## y -1913.808 120.064 -15.940 < 2e-16 ***
## z -2541.423 173.117 -14.680 < 2e-16 ***
## x:y 893.877 16.756 53.347 < 2e-16 ***
## x:z 1119.344 19.762 56.640 < 2e-16 ***
## y:z -238.581 27.662 -8.625 < 2e-16 ***
## x:y:z -53.058 2.516 -21.092 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1117 on 53915 degrees of freedom
## Multiple R-squared: 0.9217, Adjusted R-squared: 0.9217
## F-statistic: 2.645e+04 on 24 and 53915 DF, p-value: < 2.2e-16
diamonds$predictedPrice <- fitted(linmodel) # to add predicted values to the dataset
diamonds$residuals <- residuals(linmodel,type="working") # to add residuals to the dataset
coef(linmodel) # extracts the linear model betas
## (Intercept) cut.L cut.Q cut.C cut^4 color.L
## 18191.87475 271.27350 -111.78323 138.70807 30.24246 -1934.04867
## color.Q color.C color^4 color^5 color^6 clarity.L
## -659.71030 -160.12595 53.59269 -97.71936 -53.90434 3898.69644
## clarity.Q clarity.C clarity^4 clarity^5 clarity^6 clarity^7
## -1826.08769 915.09269 -308.22927 192.26930 18.02413 83.75081
## x y z x:y x:z y:z
## -6437.21267 -1913.80784 -2541.42287 893.87663 1119.34423 -238.58136
## x:y:z
## -53.05786
We can create some diagnostic plots fairly easily ]
layout(matrix(c(1,2,3,4),2,2)) # to facet all graphs on one page
plot(linmodel)
For checking outliers and residuals; we will need the these packages: car, MASS, stats
# Using the car package
library(car)
## Warning: package 'car' was built under R version 3.1.3
outlierTest(linmodel) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferonni p
## 48411 -30.535541 4.7756e-203 2.5760e-198
## 49557 -14.635577 2.0634e-48 1.1130e-43
## 49558 -14.635577 2.0634e-48 1.1130e-43
## 11964 -12.640322 1.4261e-36 7.6926e-32
## 27416 -12.136952 7.4497e-34 4.0183e-29
## 15952 -11.890628 1.4543e-32 7.8447e-28
## 16284 -11.638636 2.8578e-31 1.5415e-26
## 26496 10.668164 1.5246e-26 8.2236e-22
## 25999 -10.482954 1.0952e-25 5.9076e-21
## 24329 -9.579958 1.0103e-21 5.4497e-17
leveragePlots(linmodel) # leverage plots
For checking the residuals:
# distribution of studentized residuals using the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.1.3
sresid <- studres(linmodel)
hist(sresid, freq=FALSE,xlim = c(-4,4), ylim = c(0,0.45),
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
To check for multicolinearity with the CAR package:
library(car) # for outlier and multicollinearity tests
vif(linmodel) # variance inflation factor
## GVIF Df GVIF^(1/(2*Df))
## cut 1.230904 4 1.026309
## color 1.176448 6 1.013634
## clarity 1.344080 7 1.021347
## x 738.218976 1 27.170185
## y 813.655109 1 28.524640
## z 645.799344 1 25.412582
## x:y 2212.622354 1 47.038520
## x:z 1159.291034 1 34.048363
## y:z 2390.947463 1 48.897315
## x:y:z 1676.385448 1 40.943686
In R, we can do a logistic regression with the same glm package by specifying the family = binomial option. For this example, lets use the sample data that comes with the MASS package.
library(MASS)
data(menarche)
menarche
## Age Total Menarche
## 1 9.21 376 0
## 2 10.21 200 0
## 3 10.58 93 0
## 4 10.83 120 2
## 5 11.08 90 2
## 6 11.33 88 5
## 7 11.58 105 10
## 8 11.83 111 17
## 9 12.08 100 16
## 10 12.33 93 29
## 11 12.58 100 39
## 12 12.83 108 51
## 13 13.08 99 47
## 14 13.33 106 67
## 15 13.58 105 81
## 16 13.83 117 88
## 17 14.08 98 79
## 18 14.33 97 90
## 19 14.58 120 113
## 20 14.83 102 95
## 21 15.08 122 117
## 22 15.33 111 107
## 23 15.58 94 92
## 24 15.83 114 112
## 25 17.58 1049 1049
The dataset is called ‘menarche’ containing the following data: Age: average age of groups of girls. Total: The total number rof girls in each group. Menarche: the number of girls in the group who have been to menarche.
head(menarche)
## Age Total Menarche
## 1 9.21 376 0
## 2 10.21 200 0
## 3 10.58 93 0
## 4 10.83 120 2
## 5 11.08 90 2
## 6 11.33 88 5
This dataset lends itself nicely for a logistic regression (if we choose Menarche/Total as the dependent variable. Notice how we have a nice S/logit curve:
library(ggplot2)
qplot(data=menarche, x=menarche$Age, y=Menarche/Total)
logmodel <- glm(Menarche/Total ~ Age,
data = menarche,
family=binomial())
summary(logmodel) # to show the results
##
## Call:
## glm(formula = Menarche/Total ~ Age, family = binomial(), data = menarche)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.20047 -0.08220 -0.04327 0.05982 0.13236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.9117 8.1110 -2.578 0.00993 **
## Age 1.6082 0.6215 2.588 0.00967 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19.12138 on 24 degrees of freedom
## Residual deviance: 0.22143 on 23 degrees of freedom
## AIC: 12.314
##
## Number of Fisher Scoring iterations: 6
confint(logmodel) # 95% CI for the coefficients
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -42.5560759 -8.785764
## Age 0.6801044 3.268012
menarche$predicted <- fitted(logmodel, type="response") # predicted values
menarche$residual <- residuals(logmodel, type="deviance") # residuals
And finally the model:
ggplot(data=menarche, aes(x=Age, y= Menarche/Total))+
geom_point(color = 'blue', shape=1,size=3) +
geom_line(aes(x = Age, y = predicted, color='red'))+
theme(legend.position = 'none')