Applied Regression Analysis: Project 2

Applied Regression Analysis: Project 2 Multiple Linear Regression

Ali Svoobda

RPI

3/19/15

1. Dataset Selection

The data used in this analysis was found in the Data and Story Library (DASL) which was listed as one of the 100 Interesting Datasets.

The code below reads in the dataset, saves and attaches the variable names, and displays the set. It also orders the data by price for purposes of running future code:

houses <- read.csv("~/1.RENSSELAER POLYTECHNIC INSTITUTE/a- Senior Spring/Applied Regression Analysis/houses.csv")

data <-houses[order(houses$PRICE),]
attach(data)

head(data)

##    PRICE SQFT AGE FEATS
## 79   540 1142   0     0
## 84   580 1051  15     2
## 71   619  837   0     2
## 83   660 1159   0     0
## 42   670 1181   0     4
## 70   670 1350   0     2

View(data)

Description of Datset

The dataset contains 107 observations of 4 continuous variables: PRICE- Selling price of the home (hundreds of dollars) SQFT = Square feet of living space AGE = Age of home (years) FEATS = Number out of 11 features (dishwasher, refrigerator, microwave, disposer, washer, intercom skylight(s), compactor, dryer, handicap fit, cable TV access.

For this project, the dependent/response variable will be the house price and the independent variables will be square footage, age of the home, and number of features the home has. Althought there are only only 11 possible values of “feats” and a list of items that correspond to the features, it is still used as a continuous variable in this analysis. The experiment is set up this way because if a house has a feature value of 2, there are many different combinations of features this could represent (i.e. a compactor and dryer, handicap fit and refridgerator, skylight and dishwasher, etc.). The goal of including features in the model to predict price is to see if the sheer number of features included can explain the price of a home.

The goal is to see if the house price can be explained at all by the square footage, age, and features of the home.

Below is a summary of the data:

summary(data)

##      PRICE           SQFT           AGE            FEATS     
##  Min.   : 540   Min.   : 837   Min.   : 0.00   Min.   :0.00  
##  1st Qu.: 815   1st Qu.:1290   1st Qu.: 0.00   1st Qu.:3.00  
##  Median : 975   Median :1565   Median : 4.00   Median :4.00  
##  Mean   :1077   Mean   :1667   Mean   : 9.33   Mean   :3.53  
##  3rd Qu.:1190   3rd Qu.:1897   3rd Qu.:15.00   3rd Qu.:4.00  
##  Max.   :2150   Max.   :3750   Max.   :53.00   Max.   :8.00

Initial Plots

Below we create a boxplot of the Price to exam the distribution and outliers

boxplot(PRICE, main="Boxplot of House Prices")

plot of chunk unnamed-chunk-3

As you can see, there are some outliers at the top end of the price range.

Now we will look at boxplots of the price broken up by each of the independent variables as another initial way to examine the data:

boxplot(PRICE~SQFT, main="Boxplot of PRICE vs SQFT", xlab="Square footage", ylab="PRICE")

plot of chunk unnamed-chunk-4

boxplot(PRICE~AGE, main="Boxplot of PRICE vs AGE", xlab="Age of House", ylab="PRICE")

plot of chunk unnamed-chunk-4

boxplot(PRICE~FEATS, main="Boxplot of PRICE vs FEATS", xlab="Number of Features", ylab="PRICE")

plot of chunk unnamed-chunk-4

By SQFT: There are no outliers but as you can see from the individual lines, there are many observations that only have a single house with for a given square footage. The plot also shows a general trend of price increasing with square footage.

By AGE: The age of the house does cause some of the outliers. For houses less than a year old (age=0) and 6 year old houses, there are a couple observations that are outliers because they are more expensive than the others. For this independent variable, there is not a trend that jumps out from the boxplot.

By FEATS: Again, there are a few outliers caused by the number of features. Like we saw with square footage, price seems to increase as the number of features increases.

Since there are a few number of outliers relative to the total number of observations, we will not remove them as to not manipulate the data. It is important to keep in mind that there were outliers, however, when it is time to interpret the model and results.

2. Building Multiple Linear Regression Models

Null Hypothesis

The null hypothesis is that the variation in the dependent variable, price, cannot be explained by anything other than randomization. In other words, for the model described below, the variation in square footage, age, or features and the home cannot explain the variation in price.

Examine the scattergram for relationships between Price and the independent variables:

plot(data)

plot of chunk unnamed-chunk-5

There does seem to be a correlation between price and square footage and it is expected that the correlation will be high in the matrix below. There also appears to be correlation between price and features, although it is harder to see since only interger values of FEATS are held in the dataset. The relationship is harder to see between age and price, so we will pay close attention in the further analysis.

From the plot, there is no evident correlation between the independent variables, so surpression may not be an issue.

2.1 Building an Entry-wise Model

To build an entry-wise model, all the independent variables are included at once. So we would build a model with sqft, age, and features to try to explain price difference Entry-Wise Model

ewmodel<-lm(PRICE~SQFT+FEATS+AGE)
summary(ewmodel)

## 
## Call:
## lm(formula = PRICE ~ SQFT + FEATS + AGE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1003.0   -83.0    -8.5    63.8   795.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.3819    71.4364    0.05    0.962    
## SQFT          0.5749     0.0408   14.10   <2e-16 ***
## FEATS        34.7173    15.5911    2.23    0.028 *  
## AGE          -0.7641     1.6142   -0.47    0.637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205 on 103 degrees of freedom
## Multiple R-squared:  0.724,  Adjusted R-squared:  0.716 
## F-statistic:   90 on 3 and 103 DF,  p-value: <2e-16

Interpretation

Price, square footage, and number of features of a house explain 71.57% of the price, as evident from the adjusted R^2 value. However, when we look at the p-values for each independent variable, Age has a value greater than .05, suggesting that it does not help explain the variation in house price. For the independent variable Age, we fail to reject the null hypothesis.

Next, a model for house price will be built using different methods to see if a higher R^2 value can be achieved (a better model).

Note, expanded analysis, including plots, regression lines, and confidence intervals will be conducted once the best model is identified from the summary results.

2.2 Building a Hierarchial Model

To build a hierarchial model, the independent variables are added one at a time in a theoretical manner. The variable that is believed to be the biggest predictor or most likely to explain the variation in the dependent variable is added first, followed by the second then third. Each time we add a variable, we examine the model output and interpret the p-value of the variable and the R^2 value of the model.

For this data, we will first add square footage, because this is most often refered to when house hunting: Hierarchical Model 1: Square Foot

model_sqft<-lm(PRICE~SQFT)
summary(model_sqft)

## 
## Call:
## lm(formula = PRICE ~ SQFT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1050.9   -93.5     1.4    60.6   749.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   61.823     66.392    0.93     0.35    
## SQFT           0.609      0.038   16.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208 on 105 degrees of freedom
## Multiple R-squared:  0.71,   Adjusted R-squared:  0.708 
## F-statistic:  258 on 1 and 105 DF,  p-value: <2e-16

Interpretation

For this model, we reject the null hypothesis that the variation in Price cannot be explained by anything other than randomization. Square footage is a significant predictor of price, with a p-value of 2e-16. Sqaure footage alone also explains 70.76% of the change in price! The Entry-Wise model above, which in addition to price incldued age and features, only explained less than 1% more than this model, meaning adding both age and features doesn’t make a much better model than square footage alone.

Since Square footage alone explains such a significant amount of the house price, a scatterplot and regression line are plotted below: Scatterplot With Regression Line for Square footage Model

plot(SQFT,PRICE, main="Price vs Square Footage (with regression line)", xlab= "SQFT", ylab="PRICE", col="blue", pch=18)
abline(model_sqft$coef, lwd=2, col="dark blue")

plot of chunk unnamed-chunk-8

The regression line fits the data well. The fit is better for the houses that are about 2000sqft and under. This is likely true because there are simply not as many observations in the dataset of houses larger than 2000sqft.

To continue the Hierachical model building process, we will now add age to the model to see how square footage and age together predict price: Hierarchical Model 2: Square Foot and Age

model_sqft_age<-lm(PRICE~SQFT+AGE)
summary(model_sqft_age)

## 
## Call:
## lm(formula = PRICE ~ SQFT + AGE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1056.1   -95.7     0.0    59.3   754.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.8772    67.3142    0.95     0.34    
## SQFT          0.6099     0.0383   15.92   <2e-16 ***
## AGE          -0.3683     1.6346   -0.23     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209 on 104 degrees of freedom
## Multiple R-squared:  0.71,   Adjusted R-squared:  0.705 
## F-statistic:  128 on 2 and 104 DF,  p-value: <2e-16

Interpretation

Previously, in the entry-wise model, we included Age in the model and it turned out not to be a significant predictor, however, we still include in this model since it is slightly different model and in some cases, an independent variable can be significant in one model and not another. But, this is not the chgase this time. Square footage once again returned a small p-value and we reject the null, but for age, we fail to reject the null. Age is not adding and value to the model. The R^2 value actually decreased, only by a fraction of a percent, indicating that age is interfering with the models ability to explain the variation in price.

The next step in the hierarchical model building process would be to add the 3rd variable, features: Hierarchical Model 3: All 3 Variables

This model has already been run and interpreted in the Entry-Wise Model above. As a reminded, the R^2 value was slightly higher than the other models, indicating that features may assist square footage in predicting price.

A model with square footage and features will be created in the Step-Wise section below.

2.3 Building the Step-Wise Model

To build a step-wise model, we add variables one at a time based on their correlation with the dependent variable. We add the most correlated independent variable first, then continue to add the other variables as long as the contriburte to explaining the variation in the dependent variable (viewed through the R^2 value).

To determine the order, we build a correlation matrix:

cor(data)

##         PRICE    SQFT     AGE  FEATS
## PRICE 1.00000 0.84282 0.06905 0.4363
## SQFT  0.84282 1.00000 0.09597 0.3941
## AGE   0.06905 0.09597 1.00000 0.1386
## FEATS 0.43629 0.39410 0.13856 1.0000

Square footage is the most correlated with price (.84), followed by features (.44) and then by age (.07).

Therefore, we would first create a model with just square-footage (this model was created above).

Next, a model with Square footage and features is constructed: Step-Wise Model: Square Footage and Features

model_sqft_feat<-lm(PRICE~SQFT+FEATS)
summary(model_sqft_feat)

## 
## Call:
## lm(formula = PRICE ~ SQFT + FEATS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -993.6  -90.7   -3.4   64.9  783.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.5881    70.9261    0.01     0.99    
## SQFT          0.5740     0.0406   14.15   <2e-16 ***
## FEATS        33.9045    15.4384    2.20     0.03 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204 on 104 degrees of freedom
## Multiple R-squared:  0.723,  Adjusted R-squared:  0.718 
## F-statistic:  136 on 2 and 104 DF,  p-value: <2e-16

Interpretation

In this model, we reject the null hypothesis for both variables. Both square footage and number of features in a house contribute to the variation in house price. The R^2 value is the highest of all the models at 71.79%. Further analysis and interpretation of this model will be completed in section 3 below.

3. Final Model Analysis

Since the model with Square footage and features explained the most variation in price, more analysis will be conducted.

Plot with Regression Plane

A 3-Dimensional Scatterplot will be required since there are two independent variables. To plot this, we must first install an additional package:

#install.packages("scatterplot3d")
library("scatterplot3d", lib.loc="~/R/win-library/3.1")

## Warning: package 'scatterplot3d' was built under R version 3.1.3

Create 3D plot with regression plane:

best<-scatterplot3d(SQFT, FEATS, PRICE, pch=18, main="Price vs Square Footage and Features (with Regression)", xlab="Square Footage", ylab="Number of Features", zlab="House Price (in Hundreds of Dollars" )
best$plane3d(model_sqft_feat, col="blueviolet")

plot of chunk unnamed-chunk-13

Although it is hard to see on an angle, the plane seems to cut through the points with some points just above and below. There are a few points that are not as close to the fitted plane, however, these can be attributed to the outliers that were left in the model.

Confidence Intervals

Next, to take a look at the confidence intervals of the model, we will go back to examining the model with just square footage so it is easier to interpret the lines on a 2D plot vs a 3D plot. We do this since the square footage model alone explains nearly all the variation in price (~70% for square footage versus about ~71% for square footage and features).

95% confidence interval for the model:

model_conf <- predict(model_sqft, interval="confidence")
plot(SQFT,PRICE, main="95% Confidence Interval for Price vs Square Footage", xlab= "Square Footage", ylab="House Price (hundreds of dollars", col="blue", pch=18)

abline(model_sqft$coef, lwd=2, col="dark blue")

lines(SQFT, model_conf[,2], lty=2, lwd=2, col="blueviolet")
lines(SQFT, model_conf[,3], lty=2, lwd=2, col="blueviolet")

plot of chunk unnamed-chunk-14

For 95% of the samples, the true value will lie between the upper and lower confidence interval lines shown in the plot above. The fit of the interval is tighter for the smaller houses and widens out towards the top due to the number of observations that exist for the smaller sqft homes versus larger ones.

Check Model Assumptions

First, we will check that the data meets the normality assumption:

qqnorm(residuals(model_sqft_feat))
qqline(residuals(model_sqft_feat))

plot of chunk unnamed-chunk-15

The residuals appear to be normally distributed. They tail off at the upper and lower ends, but a majority fit the line well. The few that tail off could be due to the outliers that were left in the dataset for analysis.

Finally, to check the fit of the model, we will examine the model residuals:

model.resid<-model_sqft_feat$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")

plot of chunk unnamed-chunk-16

The resididuals show the difference between the acutal and fitted values of the model. They are spread out across the dynamic range and distributed equally about the 0, indicating the model is a good fit!

Applied Regression Analysis: Project 2

Ali Svoboda

March 19th , 2015