Predicting Prices with Regression

SIMPLE LINEAR REGRESSION

The basic syntax structure for linear regression in R is pretty easy to follow. It makes use of the lm function. One of the nice things about R is that you can code categorical data into factors. In SAS on the other hand, you would have to create dummy variables for incorporation into the model. Let’s use the database that comes with the ggplot2 package called ‘diamonds’

Let’s say we want to create a diamond pricing model

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

linmodel <- lm(price ~ cut + color + clarity + (x*y*z),
            data = diamonds)
summary(linmodel)  # to show the results

## 
## Call:
## lm(formula = price ~ cut + color + clarity + (x * y * z), data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16260.7   -561.7   -161.7    371.2  11891.5 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 18191.875    341.204   53.317  < 2e-16 ***
## cut.L         271.273     19.987   13.572  < 2e-16 ***
## cut.Q        -111.783     17.566   -6.364 1.99e-10 ***
## cut.C         138.708     15.268    9.085  < 2e-16 ***
## cut^4          30.242     12.100    2.499   0.0124 *  
## color.L     -1934.049     17.115 -113.006  < 2e-16 ***
## color.Q      -659.710     15.596  -42.300  < 2e-16 ***
## color.C      -160.126     14.549  -11.006  < 2e-16 ***
## color^4        53.593     13.364    4.010 6.07e-05 ***
## color^5       -97.719     12.624   -7.741 1.01e-14 ***
## color^6       -53.904     11.475   -4.697 2.64e-06 ***
## clarity.L    3898.696     29.865  130.545  < 2e-16 ***
## clarity.Q   -1826.088     27.896  -65.461  < 2e-16 ***
## clarity.C     915.093     23.861   38.350  < 2e-16 ***
## clarity^4    -308.229     19.059  -16.172  < 2e-16 ***
## clarity^5     192.269     15.559   12.357  < 2e-16 ***
## clarity^6      18.024     13.548    1.330   0.1834    
## clarity^7      83.751     11.959    7.003 2.53e-12 ***
## x           -6437.213    116.440  -55.283  < 2e-16 ***
## y           -1913.808    120.064  -15.940  < 2e-16 ***
## z           -2541.423    173.117  -14.680  < 2e-16 ***
## x:y           893.877     16.756   53.347  < 2e-16 ***
## x:z          1119.344     19.762   56.640  < 2e-16 ***
## y:z          -238.581     27.662   -8.625  < 2e-16 ***
## x:y:z         -53.058      2.516  -21.092  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1117 on 53915 degrees of freedom
## Multiple R-squared:  0.9217, Adjusted R-squared:  0.9217 
## F-statistic: 2.645e+04 on 24 and 53915 DF,  p-value: < 2.2e-16

diamonds$predictedPrice <- fitted(linmodel) # to add predicted values to the dataset
diamonds$residuals <- residuals(linmodel,type="working") # to add residuals to the dataset
coef(linmodel) # extracts the linear model betas

## (Intercept)       cut.L       cut.Q       cut.C       cut^4     color.L 
## 18191.87475   271.27350  -111.78323   138.70807    30.24246 -1934.04867 
##     color.Q     color.C     color^4     color^5     color^6   clarity.L 
##  -659.71030  -160.12595    53.59269   -97.71936   -53.90434  3898.69644 
##   clarity.Q   clarity.C   clarity^4   clarity^5   clarity^6   clarity^7 
## -1826.08769   915.09269  -308.22927   192.26930    18.02413    83.75081 
##           x           y           z         x:y         x:z         y:z 
## -6437.21267 -1913.80784 -2541.42287   893.87663  1119.34423  -238.58136 
##       x:y:z 
##   -53.05786

We can create some diagnostic plots fairly easily ]

layout(matrix(c(1,2,3,4),2,2)) # to facet all graphs on one page
plot(linmodel)

For checking outliers and residuals; we will need the these packages: car, MASS, stats

# Using the car package
library(car)

## Warning: package 'car' was built under R version 3.1.3

outlierTest(linmodel)                 # Bonferonni p-value for most extreme obs

##         rstudent unadjusted p-value Bonferonni p
## 48411 -30.535541        4.7756e-203  2.5760e-198
## 49557 -14.635577         2.0634e-48   1.1130e-43
## 49558 -14.635577         2.0634e-48   1.1130e-43
## 11964 -12.640322         1.4261e-36   7.6926e-32
## 27416 -12.136952         7.4497e-34   4.0183e-29
## 15952 -11.890628         1.4543e-32   7.8447e-28
## 16284 -11.638636         2.8578e-31   1.5415e-26
## 26496  10.668164         1.5246e-26   8.2236e-22
## 25999 -10.482954         1.0952e-25   5.9076e-21
## 24329  -9.579958         1.0103e-21   5.4497e-17

leveragePlots(linmodel)               # leverage plots

For checking the residuals:

# distribution of studentized residuals using the MASS package
library(MASS)

## Warning: package 'MASS' was built under R version 3.1.3

sresid <- studres(linmodel) 
hist(sresid, freq=FALSE,xlim = c(-4,4), ylim = c(0,0.45), 
     main="Distribution of Studentized Residuals")

xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

To check for multicolinearity with the CAR package:

library(car)     # for outlier and multicollinearity tests
vif(linmodel)    # variance inflation factor

##                GVIF Df GVIF^(1/(2*Df))
## cut        1.230904  4        1.026309
## color      1.176448  6        1.013634
## clarity    1.344080  7        1.021347
## x        738.218976  1       27.170185
## y        813.655109  1       28.524640
## z        645.799344  1       25.412582
## x:y     2212.622354  1       47.038520
## x:z     1159.291034  1       34.048363
## y:z     2390.947463  1       48.897315
## x:y:z   1676.385448  1       40.943686

LOGISTIC (BINOMIAL) REGRESSION

In R, we can do a logistic regression with the same glm package by specifying the family = binomial option. For this example, lets use the sample data that comes with the MASS package.

library(MASS)
data(menarche)
menarche

##      Age Total Menarche
## 1   9.21   376        0
## 2  10.21   200        0
## 3  10.58    93        0
## 4  10.83   120        2
## 5  11.08    90        2
## 6  11.33    88        5
## 7  11.58   105       10
## 8  11.83   111       17
## 9  12.08   100       16
## 10 12.33    93       29
## 11 12.58   100       39
## 12 12.83   108       51
## 13 13.08    99       47
## 14 13.33   106       67
## 15 13.58   105       81
## 16 13.83   117       88
## 17 14.08    98       79
## 18 14.33    97       90
## 19 14.58   120      113
## 20 14.83   102       95
## 21 15.08   122      117
## 22 15.33   111      107
## 23 15.58    94       92
## 24 15.83   114      112
## 25 17.58  1049     1049

The dataset is called ‘menarche’ containing the following data: Age: average age of groups of girls. Total: The total number rof girls in each group. Menarche: the number of girls in the group who have been to menarche.

head(menarche)

##     Age Total Menarche
## 1  9.21   376        0
## 2 10.21   200        0
## 3 10.58    93        0
## 4 10.83   120        2
## 5 11.08    90        2
## 6 11.33    88        5

This dataset lends itself nicely for a logistic regression (if we choose Menarche/Total as the dependent variable. Notice how we have a nice S/logit curve:

library(ggplot2)
qplot(data=menarche, x=menarche$Age, y=Menarche/Total)

logmodel <- glm(Menarche/Total ~ Age,
                data = menarche, 
                family=binomial())
summary(logmodel)                    # to show the results

## 
## Call:
## glm(formula = Menarche/Total ~ Age, family = binomial(), data = menarche)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.20047  -0.08220  -0.04327   0.05982   0.13236  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -20.9117     8.1110  -2.578  0.00993 **
## Age           1.6082     0.6215   2.588  0.00967 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.12138  on 24  degrees of freedom
## Residual deviance:  0.22143  on 23  degrees of freedom
## AIC: 12.314
## 
## Number of Fisher Scoring iterations: 6

confint(logmodel)                    # 95% CI for the coefficients

## Waiting for profiling to be done...

##                   2.5 %    97.5 %
## (Intercept) -42.5560759 -8.785764
## Age           0.6801044  3.268012

menarche$predicted <- fitted(logmodel, type="response")   # predicted values
menarche$residual <- residuals(logmodel, type="deviance")  # residuals

And finally the model:

ggplot(data=menarche, aes(x=Age, y= Menarche/Total))+                
    geom_point(color = 'blue', shape=1,size=3) +
    geom_line(aes(x = Age, y = predicted, color='red'))+
    theme(legend.position = 'none')

Predicting Prices with Regression

Jason Macwan

Saturday, July 26, 2014

SIMPLE LINEAR REGRESSION

LOGISTIC (BINOMIAL) REGRESSION