Executive Summary
To predict the price of the house according to the covariates by creating the best Multivariate Linear Regression Model.
Understanding the interaction between various regressor variable.
We have in total 313 observations as part of Cincinnati Housing data set. The data was collected by all the classmates by adding a minimum of 15 entries into a google sheet as part of a data gathering task. The classmates used “Zillow” website to collect the housing data.
We have one response variable and six covariates.
List of independent variables:
X1:Age
X2:SqFt
X3:Bathrooms
X4:Zip
X5:Neighborhood
Dependent Variable:
y1:SalePrice
Our goal is to understand and study regression model derived out of this dataset. We will be applying the stastical methods learned during class to get the best regression model out of the dataset. The objective is to create the model with minimum number of variables without compromising with the accuracy of prediction. Moreover, we will also study how different regressor variables are dependent on each other by doing covariance analysis.
SalesPrice = 10.8 - 3.7610^-3 Age + 2.9410^-4 SqFt + 0.127 Bathrooms + 0.874 * Zip_Indfour - 0.0079 * Zip_Indone + .208 * Zip_Indothers + .668 * Zip_Indsix + .776 * Zip_Indthree + .201 * Neighborhood_IndiNE2 +.899 * Neighborhood_IndiNE3 +0.691 * Neighborhood_IndiNE5 + 0.53 * Neighborhood_Indiothers + 0.79 * Neighborhood_IndiNE6
Data Preparation And Cleansing
Prepare data set:
• The data set was downloaded from the class google sheet into Excel CSV.
• Data columns were formatted as applicable in Excel CSV.
•Duplicate values based were identified and removed using Excel remove duplicates feature on the address column.
Eliminate bad data based on the following criteria:
•Street addresses that included apartment #s
•Street addresses outside the I275 loop
•High # of stories that were determined to be multi-family dwellings according to Zillow
•More than 2 obvious errors due to not trusting the data collector (e.g. wrong year or unrealistic sq ft on any measurement column)
• Sale date earlier than 3 months ago to reduce extrapolation effects
•Missing values due to poor data collection methods
Add neighborhood variable to data set:
• The file of 2019 to date sales was downloaded from the Hamilton County Ohio Auditor https://www.hamiltoncountyauditor.org/transfer_download_menu.asp
• Data columns were split or concatenated to match formatting between files. VLOOKUP was used to add the neighborhood based on the street address.
•Missing neighborhoods were manually collected from Zillow and Google. Most of the missing values were in Clermont County and therefore not available from the Hamilton County Auditor.
Libraries Used For this model
library(MASS)
library(car)
library(psych)
library(dplyr)
library(DAAG)
library(leaps)
Loading the dataset
Using the domain knowledge Zip should a nominal factor.
library(MASS)
library(psych)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(DAAG)
## Loading required package: lattice
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:psych':
##
## cities
## The following object is masked from 'package:MASS':
##
## hills
library(leaps)
neighbourhood_data <- read.csv('Project_1.csv',h = T)
head(neighbourhood_data)
## ï..Index Address DateSold Year Zip SalePrice Bedrooms
## 1 1 1337 Voll Rd 9/27/2019 1959 45230 212000 3
## 2 2 5786 Brookstone Dr 8/9/2019 2004 45230 972500 7
## 3 3 6160 Woodlark Dr 8/28/2019 1987 45230 420000 3
## 4 4 6265 Salem Rd 10/11/2019 1937 45230 150000 3
## 5 5 7099 Petri Dr 9/13/2019 1959 45230 125001 3
## 6 6 7621 FOREST RD 10/24/2019 1941 45255 259000 3
## Bathrooms Stories SqFt LotSqFt Neighborhood
## 1 2 2 1384 6011 ANDERSON TOWNSHIP
## 2 5 2 4628 34412 ANDERSON TOWNSHIP
## 3 4 2 2634 17424 ANDERSON TOWNSHIP
## 4 1 2 1580 23958 ANDERSON TOWNSHIP
## 5 2 2 1404 8276 ANDERSON TOWNSHIP
## 6 2 1 1678 48918 ANDERSON TOWNSHIP
attach(neighbourhood_data)
str(neighbourhood_data)
## 'data.frame': 313 obs. of 12 variables:
## $ ï..Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Address : Factor w/ 313 levels "1006 Rutledge Ave",..: 33 232 244 250 272 281 294 39 289 45 ...
## $ DateSold : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
## $ Year : int 1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
## $ Zip : int 45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
## $ SalePrice : int 212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
## $ Bedrooms : int 3 7 3 3 3 3 4 4 5 5 ...
## $ Bathrooms : int 2 5 4 1 2 2 2 2 3 3 ...
## $ Stories : int 2 2 2 2 2 1 2 1 2 3 ...
## $ SqFt : int 1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
## $ LotSqFt : int 6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
## $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
Using the domain knowledge Zip should a nominal factor.
neighbourhood_data$Zip <- as.factor(neighbourhood_data$Zip)
Year Built in itself has no weightage, therefore, transform the feature to subtract current year with built date to give the property age
Append Age to dataframe
Age <- (2019-neighbourhood_data$Year)
neighbourhood_data <- cbind(neighbourhood_data,Age)
head(neighbourhood_data)
## ï..Index Address DateSold Year Zip SalePrice Bedrooms
## 1 1 1337 Voll Rd 9/27/2019 1959 45230 212000 3
## 2 2 5786 Brookstone Dr 8/9/2019 2004 45230 972500 7
## 3 3 6160 Woodlark Dr 8/28/2019 1987 45230 420000 3
## 4 4 6265 Salem Rd 10/11/2019 1937 45230 150000 3
## 5 5 7099 Petri Dr 9/13/2019 1959 45230 125001 3
## 6 6 7621 FOREST RD 10/24/2019 1941 45255 259000 3
## Bathrooms Stories SqFt LotSqFt Neighborhood Age
## 1 2 2 1384 6011 ANDERSON TOWNSHIP 60
## 2 5 2 4628 34412 ANDERSON TOWNSHIP 15
## 3 4 2 2634 17424 ANDERSON TOWNSHIP 32
## 4 1 2 1580 23958 ANDERSON TOWNSHIP 82
## 5 2 2 1404 8276 ANDERSON TOWNSHIP 60
## 6 2 1 1678 48918 ANDERSON TOWNSHIP 78
Assigned the variables
SalePrice <- neighbourhood_data$SalePrice
Bedrooms <- neighbourhood_data$Bedrooms
Bathrooms <- neighbourhood_data$Bathrooms
Stories <- neighbourhood_data$Stories
SqFt <- neighbourhood_data$SqFt
LoftSqft <- neighbourhood_data$LotSqFt
Neighborhood <- neighbourhood_data$Neighborhood
We will exclude variables such as index,address and DateSold using our domain knowledge.
housing_data <- neighbourhood_data[,5:13]
Final Structure of the dataset
str(housing_data)
## 'data.frame': 313 obs. of 9 variables:
## $ Zip : Factor w/ 42 levels "45002","45202",..: 25 25 25 25 25 42 42 16 24 30 ...
## $ SalePrice : int 212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
## $ Bedrooms : int 3 7 3 3 3 3 4 4 5 5 ...
## $ Bathrooms : int 2 5 4 1 2 2 2 2 3 3 ...
## $ Stories : int 2 2 2 2 2 1 2 1 2 3 ...
## $ SqFt : int 1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
## $ LotSqFt : int 6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
## $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ Age : num 60 15 32 82 60 78 27 73 109 83 ...
Check for null or empty values
colSums(is.na(housing_data))
## Zip SalePrice Bedrooms Bathrooms Stories
## 0 0 0 0 0
## SqFt LotSqFt Neighborhood Age
## 0 0 0 0
Descriptive Statistics of housing_data
pairs.panels(housing_data)
The graph depicts that Bedrooms, Bathrooms and Stories are ordinal factors.SQFt and Bathrooms are showing strong correlation.SqFt and LotSqFt is also showing good relation. Other variables are not highly correlated.
Response Variable, SalePrice is not normally distributed.
Also, the behavior of such distribution can be smoothen by taking log.
#Part2 : Building Model
Model with response SalePrice and having all subsets as covariates.
model1 <- lm(log(SalePrice) ~ ., data=housing_data)
summary(model1)
##
## Call:
## lm(formula = log(SalePrice) ~ ., data = housing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1789 -0.1061 0.0000 0.1407 1.0797
##
## Coefficients: (15 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 1.138e+01 3.962e-01
## Zip45202 -2.190e-01 7.256e-01
## Zip45203 5.769e-01 4.726e-01
## Zip45204 1.140e-01 7.090e-01
## Zip45205 -2.895e-01 6.419e-01
## Zip45206 -6.020e-03 4.299e-01
## Zip45207 -2.081e-01 5.523e-01
## Zip45208 -1.241e+00 8.227e-01
## Zip45209 6.810e-01 3.989e-01
## Zip45211 -1.040e-02 5.838e-01
## Zip45212 -7.703e-01 5.647e-01
## Zip45213 3.641e-01 4.722e-01
## Zip45214 2.887e-01 4.659e-01
## Zip45215 1.338e+00 6.261e-01
## Zip45216 1.888e-01 4.689e-01
## Zip45217 -5.271e-01 3.993e-01
## Zip45219 -9.632e-01 7.703e-01
## Zip45220 -1.163e+00 8.688e-01
## Zip45223 -8.985e-01 4.742e-01
## Zip45224 8.373e-02 4.682e-01
## Zip45225 -6.584e-01 5.457e-01
## Zip45226 -1.563e+00 8.461e-01
## Zip45227 -1.734e+00 9.116e-01
## Zip45229 -5.306e-01 4.816e-01
## Zip45230 1.829e-01 3.988e-01
## Zip45231 1.704e-01 5.381e-01
## Zip45232 -5.240e-01 4.727e-01
## Zip45233 4.878e-01 6.543e-01
## Zip45236 9.459e-01 6.027e-01
## Zip45237 -8.268e-01 4.448e-01
## Zip45238 -2.279e-01 6.076e-01
## Zip45239 2.081e-01 7.605e-01
## Zip45240 -2.314e-01 4.652e-01
## Zip45242 1.168e-01 4.324e-01
## Zip45243 1.358e+00 5.662e-01
## Zip45244 3.416e-01 4.643e-01
## Zip45245 4.190e-01 4.626e-01
## Zip45246 -1.058e-01 5.359e-01
## Zip45248 2.288e-01 6.919e-01
## Zip45249 5.166e-01 4.558e-01
## Zip45251 5.166e-02 7.593e-01
## Zip45255 3.887e-01 4.233e-01
## Bedrooms 8.942e-03 3.454e-02
## Bathrooms 1.279e-01 3.465e-02
## Stories 7.164e-02 5.056e-02
## SqFt 2.528e-04 4.865e-05
## LotSqFt 1.728e-07 2.812e-06
## NeighborhoodAVONDALE -7.661e-02 3.249e-01
## NeighborhoodBOND HILL 2.429e-01 3.155e-01
## NeighborhoodCAMP WASHINGTON NA NA
## NeighborhoodCHEVIOT -5.494e-01 5.199e-01
## NeighborhoodCLEVES NA NA
## NeighborhoodCLIFTON 1.546e+00 7.961e-01
## NeighborhoodCLIFTON HTS-UNIVERSITY HTS-FAIRVIEW 1.046e+00 6.936e-01
## NeighborhoodCOLERAIN TOWNSHIP -3.160e-01 5.396e-01
## NeighborhoodCOLLEGE HILL NA NA
## NeighborhoodCOLUMBIA TOWNSHIP -2.459e+00 6.358e-01
## NeighborhoodCOLUMBIA TUSCULUM 2.186e+00 7.782e-01
## NeighborhoodCORRYVILLE 9.129e-01 7.130e-01
## NeighborhoodDEER PARK -6.583e-01 5.205e-01
## NeighborhoodDELHI TOWNSHIP -1.748e-01 5.080e-01
## NeighborhoodEAST END 1.901e+00 8.473e-01
## NeighborhoodEAST PRICE HILL -5.641e-01 5.388e-01
## NeighborhoodEAST WALNUT HILLS 7.647e-01 3.347e-01
## NeighborhoodEVANSTON 4.641e-01 3.500e-01
## NeighborhoodFOREST PARK NA NA
## NeighborhoodForestville -3.002e-01 2.587e-01
## NeighborhoodGLENDALE 6.212e-01 5.400e-01
## NeighborhoodGREEN TOWNSHIP 4.092e-02 5.038e-01
## NeighborhoodHARTWELL NA NA
## NeighborhoodHYDE PARK 2.032e+00 7.259e-01
## NeighborhoodINDIAN HILL -4.481e-01 4.305e-01
## NeighborhoodKENNEDY HEIGHTS 3.497e-01 3.818e-01
## NeighborhoodLINCOLN HEIGHTS -1.292e-01 6.844e-01
## NeighborhoodLINWOOD 2.029e+00 7.918e-01
## NeighborhoodMack South -7.124e-01 6.552e-01
## NeighborhoodMADEIRA -7.037e-01 5.643e-01
## NeighborhoodMADISONVILLE 2.064e+00 8.429e-01
## NeighborhoodMARIEMONT 2.483e+00 9.282e-01
## NeighborhoodMIAMI TOWNSHIP 1.018e-01 6.931e-01
## NeighborhoodMONTGOMERY 1.077e-01 3.339e-01
## NeighborhoodMOUNT ADAMS 9.374e-01 6.622e-01
## NeighborhoodMOUNT AIRY -4.579e-01 6.857e-01
## NeighborhoodMOUNT AUBURN 5.556e-01 6.629e-01
## NeighborhoodMOUNT LOOKOUT 2.048e+00 7.462e-01
## NeighborhoodMOUNT WASHINGTON -2.345e-01 1.535e-01
## NeighborhoodNORTH AVONDALE 5.564e-01 2.663e-01
## NeighborhoodNORTH COLLEGE HILL -5.647e-01 7.625e-01
## NeighborhoodNORTHSIDE 8.968e-01 2.924e-01
## NeighborhoodNORWOOD 1.140e+00 4.676e-01
## NeighborhoodOAKLEY NA NA
## NeighborhoodOVER-THE-RHINE 1.217e+00 6.671e-01
## NeighborhoodPLEASANT RIDGE NA NA
## NeighborhoodREADING -1.025e+00 6.252e-01
## NeighborhoodROSELAWN NA NA
## NeighborhoodSaylor Park -5.561e-01 6.607e-01
## NeighborhoodSOUTH CUMMINSVILLE NA NA
## NeighborhoodSPRINGDALE NA NA
## NeighborhoodSPRINGFIELD TOWNSHIP NA NA
## NeighborhoodST. BERNARD NA NA
## NeighborhoodSYCAMORE TOWNSHIP -5.686e-01 3.885e-01
## NeighborhoodSYMMES TOWNSHIP NA NA
## NeighborhoodUnion Township -3.484e-01 2.408e-01
## NeighborhoodWALNUT HILLS NA NA
## NeighborhoodWEST END NA NA
## NeighborhoodWEST PRICE HILL -5.717e-03 4.852e-01
## NeighborhoodWESTWOOD -1.783e-01 4.304e-01
## NeighborhoodWITHAMSVILLE -4.636e-01 3.774e-01
## NeighborhoodWYOMING -6.082e-01 4.581e-01
## Age -3.290e-03 9.082e-04
## t value Pr(>|t|)
## (Intercept) 28.711 < 2e-16 ***
## Zip45202 -0.302 0.763050
## Zip45203 1.221 0.223533
## Zip45204 0.161 0.872397
## Zip45205 -0.451 0.652494
## Zip45206 -0.014 0.988841
## Zip45207 -0.377 0.706668
## Zip45208 -1.509 0.132803
## Zip45209 1.707 0.089256 .
## Zip45211 -0.018 0.985802
## Zip45212 -1.364 0.173911
## Zip45213 0.771 0.441419
## Zip45214 0.620 0.536116
## Zip45215 2.137 0.033728 *
## Zip45216 0.403 0.687692
## Zip45217 -1.320 0.188253
## Zip45219 -1.250 0.212496
## Zip45220 -1.339 0.182034
## Zip45223 -1.895 0.059432 .
## Zip45224 0.179 0.858235
## Zip45225 -1.207 0.228916
## Zip45226 -1.848 0.066013 .
## Zip45227 -1.902 0.058522 .
## Zip45229 -1.102 0.271779
## Zip45230 0.459 0.646947
## Zip45231 0.317 0.751765
## Zip45232 -1.108 0.268877
## Zip45233 0.745 0.456787
## Zip45236 1.570 0.117962
## Zip45237 -1.859 0.064396 .
## Zip45238 -0.375 0.707980
## Zip45239 0.274 0.784675
## Zip45240 -0.497 0.619432
## Zip45242 0.270 0.787379
## Zip45243 2.399 0.017283 *
## Zip45244 0.736 0.462681
## Zip45245 0.906 0.366158
## Zip45246 -0.198 0.843618
## Zip45248 0.331 0.741261
## Zip45249 1.133 0.258350
## Zip45251 0.068 0.945821
## Zip45255 0.918 0.359445
## Bedrooms 0.259 0.795970
## Bathrooms 3.691 0.000283 ***
## Stories 1.417 0.157912
## SqFt 5.196 4.67e-07 ***
## LotSqFt 0.061 0.951054
## NeighborhoodAVONDALE -0.236 0.813806
## NeighborhoodBOND HILL 0.770 0.442242
## NeighborhoodCAMP WASHINGTON NA NA
## NeighborhoodCHEVIOT -1.057 0.291758
## NeighborhoodCLEVES NA NA
## NeighborhoodCLIFTON 1.943 0.053362 .
## NeighborhoodCLIFTON HTS-UNIVERSITY HTS-FAIRVIEW 1.508 0.133105
## NeighborhoodCOLERAIN TOWNSHIP -0.586 0.558796
## NeighborhoodCOLLEGE HILL NA NA
## NeighborhoodCOLUMBIA TOWNSHIP -3.867 0.000145 ***
## NeighborhoodCOLUMBIA TUSCULUM 2.809 0.005428 **
## NeighborhoodCORRYVILLE 1.280 0.201803
## NeighborhoodDEER PARK -1.265 0.207323
## NeighborhoodDELHI TOWNSHIP -0.344 0.731136
## NeighborhoodEAST END 2.243 0.025888 *
## NeighborhoodEAST PRICE HILL -1.047 0.296252
## NeighborhoodEAST WALNUT HILLS 2.285 0.023303 *
## NeighborhoodEVANSTON 1.326 0.186309
## NeighborhoodFOREST PARK NA NA
## NeighborhoodForestville -1.161 0.247110
## NeighborhoodGLENDALE 1.150 0.251290
## NeighborhoodGREEN TOWNSHIP 0.081 0.935351
## NeighborhoodHARTWELL NA NA
## NeighborhoodHYDE PARK 2.800 0.005575 **
## NeighborhoodINDIAN HILL -1.041 0.299069
## NeighborhoodKENNEDY HEIGHTS 0.916 0.360747
## NeighborhoodLINCOLN HEIGHTS -0.189 0.850396
## NeighborhoodLINWOOD 2.563 0.011058 *
## NeighborhoodMack South -1.087 0.278113
## NeighborhoodMADEIRA -1.247 0.213716
## NeighborhoodMADISONVILLE 2.448 0.015142 *
## NeighborhoodMARIEMONT 2.675 0.008042 **
## NeighborhoodMIAMI TOWNSHIP 0.147 0.883407
## NeighborhoodMONTGOMERY 0.323 0.747348
## NeighborhoodMOUNT ADAMS 1.416 0.158308
## NeighborhoodMOUNT AIRY -0.668 0.505011
## NeighborhoodMOUNT AUBURN 0.838 0.402863
## NeighborhoodMOUNT LOOKOUT 2.744 0.006567 **
## NeighborhoodMOUNT WASHINGTON -1.528 0.128056
## NeighborhoodNORTH AVONDALE 2.089 0.037833 *
## NeighborhoodNORTH COLLEGE HILL -0.741 0.459737
## NeighborhoodNORTHSIDE 3.067 0.002435 **
## NeighborhoodNORWOOD 2.438 0.015555 *
## NeighborhoodOAKLEY NA NA
## NeighborhoodOVER-THE-RHINE 1.824 0.069475 .
## NeighborhoodPLEASANT RIDGE NA NA
## NeighborhoodREADING -1.639 0.102630
## NeighborhoodROSELAWN NA NA
## NeighborhoodSaylor Park -0.842 0.400861
## NeighborhoodSOUTH CUMMINSVILLE NA NA
## NeighborhoodSPRINGDALE NA NA
## NeighborhoodSPRINGFIELD TOWNSHIP NA NA
## NeighborhoodST. BERNARD NA NA
## NeighborhoodSYCAMORE TOWNSHIP -1.463 0.144800
## NeighborhoodSYMMES TOWNSHIP NA NA
## NeighborhoodUnion Township -1.447 0.149383
## NeighborhoodWALNUT HILLS NA NA
## NeighborhoodWEST END NA NA
## NeighborhoodWEST PRICE HILL -0.012 0.990609
## NeighborhoodWESTWOOD -0.414 0.679046
## NeighborhoodWITHAMSVILLE -1.229 0.220579
## NeighborhoodWYOMING -1.328 0.185633
## Age -3.622 0.000364 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3775 on 218 degrees of freedom
## Multiple R-squared: 0.8304, Adjusted R-squared: 0.7572
## F-statistic: 11.35 on 94 and 218 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model1)
## Warning: not plotting observations with leverage one:
## 14, 25, 39, 53, 65, 71, 77, 123, 142, 144, 150, 232, 233, 234, 235, 240, 243, 246, 254, 272, 290, 298, 306, 312
## Warning: not plotting observations with leverage one:
## 14, 25, 39, 53, 65, 71, 77, 123, 142, 144, 150, 232, 233, 234, 235, 240, 243, 246, 254, 272, 290, 298, 306, 312
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow = c(1,1))
Adjusted R-square is 75.7% and p-value is also significant but p-values for most of Neighborhood and Zipcode are insignificant. Some are also showing NA for parameter estimates.
Residual Analysis for LINE Assumptions such as normality and equi-variance assumption is also not good
str(housing_data)
## 'data.frame': 313 obs. of 9 variables:
## $ Zip : Factor w/ 42 levels "45002","45202",..: 25 25 25 25 25 42 42 16 24 30 ...
## $ SalePrice : int 212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
## $ Bedrooms : int 3 7 3 3 3 3 4 4 5 5 ...
## $ Bathrooms : int 2 5 4 1 2 2 2 2 3 3 ...
## $ Stories : int 2 2 2 2 2 1 2 1 2 3 ...
## $ SqFt : int 1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
## $ LotSqFt : int 6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
## $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ Age : num 60 15 32 82 60 78 27 73 109 83 ...
As we can see that neighbourhood has 63 Levels and zip has 42 Levels therefore instead of choosing one variable at a time for the SalesPrice, we will run the regsubset with default method exhaustive for all the variables.
We also identified that limitation of regsubset is that it works with only 8 variables at a time.Response Variable is SalePrice and taken all other covariates but Zip. Executing regsubsets with Neighborhood and Zip but it stuck for an while processing. Hence, we divided the Zip & Neighborhood in two different regsubsets, keeping all other variables.
data1 <- regsubsets((SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Neighborhood, data = housing_data, really.big=T)
Adj_R_sq <- summary(data1)$adjr2
RSS <- summary(data1)$rss
Adj_R_sq
## [1] 0.4738299 0.5251060 0.5720530 0.6104688 0.6354247 0.6574585 0.6766102
## [8] 0.6877442
RSS
## [1] 9.267728e+12 8.337678e+12 7.489197e+12 6.794847e+12 6.338877e+12
## [6] 5.936376e+12 5.586155e+12 5.376143e+12
Here adjusted R-sqaure is in 68.8% but Residual sum of sqaures is 10^12. As, this model will be of no use. Hence, log transformation is done on SalePrice.
After log(SalePrice), again executed regsubset for the above model.
data2 <- regsubsets(log(SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Neighborhood, data = housing_data, really.big=T)
mb2 <-summary(data2)
Adj_R_sq <- summary(data2)$adjr2
RSS <- summary(data2)$rss
Adj_R_sq
## [1] 0.3847604 0.4493202 0.4869157 0.5192738 0.5495711 0.5763099 0.5955645
## [8] 0.6140310
RSS
## [1] 112.32841 100.21801 93.07480 86.92274 81.18009 76.11226 72.41592
## [8] 68.88283
AIC <- 313*log(RSS/313) + (1:8)*2
AIC
## [1] -318.7550 -352.4617 -373.6063 -393.0104 -412.4039 -430.5801 -444.1622
## [8] -457.8182
par(mfrow=c(1,1))
plot(AIC,main="AIC plot without Zip")
Here adjusted R-sqaure is 61.4% and Residual sum of sqaures is 68.9 . Now, after checking for how many variables adj R-square is higher and for how many variables aic is on the lower side. It has been identified that, 8 co-variates will be used.
Another noticeable insight is, few neighbourhood factors has more weight than LotSqFt and Age. Therefore, these covariates will be dropped in the subsequent models.
data3 <- regsubsets(log(SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Zip, data = housing_data,really.big=T)
mb3 <-summary(data3)
Adj_R_sq <- mb3$adjr2
RSS <- mb3$rss
Adj_R_sq
## [1] 0.3847604 0.4493202 0.5080247 0.5412928 0.5699893 0.5966092 0.6224446
## [8] 0.6363916
RSS
## [1] 112.32841 100.21801 89.24558 82.94136 77.50014 72.46566 67.60291
## [8] 64.89219
AIC <- 313*log(RSS/313) + (1:8)*2
AIC
## [1] -318.7550 -352.4617 -386.7559 -407.6857 -426.9240 -445.9473 -465.6888
## [8] -476.4980
par(mfrow=c(1,1))
plot(AIC,main="AIC plot without Neighborhood")
Here adjusted R-sqaure is 63.6% and Residual sum of sqaures is 64.9 . Now, after checking for how many variables adj R-square is higher and for how many variables aic is also on the lower side.
Noticeable insight: few zip factors has more weight than LotSqFt and Age. Therefore, these covariates will be dropped in the subsequent models.
housing_data_slim <- read.csv('Project_2.csv',h = T)
attach(housing_data_slim)
## The following objects are masked _by_ .GlobalEnv:
##
## Bathrooms, Bedrooms, Neighborhood, SalePrice, SqFt, Stories
## The following objects are masked from neighbourhood_data:
##
## Address, Bathrooms, Bedrooms, DateSold, ï..Index, LotSqFt,
## Neighborhood, SalePrice, SqFt, Stories, Year, Zip
housing_data_slim$Zip_Ind
## [1] others others others others others others others five others others
## [11] others others others others others others others others others others
## [21] others others others others others others others others six six
## [31] six six others others others others others six others one
## [41] one one one one others others others others others others
## [51] others others others others others three three three three three
## [61] three three three four others others others six others others
## [71] others others others others others others others four four four
## [81] four four four others others others others others others others
## [91] others four four four three three three three three six
## [101] three six six six six others others others others others
## [111] others others others others others five others others others others
## [121] others others others others others others others others others others
## [131] others four others others others others others others others others
## [141] others others five five five five five five five five
## [151] five five others others others others others others others others
## [161] others others others others others one others others others others
## [171] one others others others others others others others others others
## [181] others others others others others others others others one three
## [191] others others three others others others three others three three
## [201] others others others others others others others others five four
## [211] others others five others others others others three three three
## [221] others others three others others others others three three others
## [231] six one others others others others others others others others
## [241] others four others others others others others five others others
## [251] others others others others others others others others others others
## [261] three others one others others others others others others others
## [271] others six four others others others others others others others
## [281] others others others others others others others others others others
## [291] others others others others others others others others
## Levels: five four one others six three
head(housing_data_slim)
## ï..Index Address DateSold Year Zip Neighborhood
## 1 1 1337 Voll Rd 9/27/2019 1959 45230 ANDERSON TOWNSHIP
## 2 2 5786 Brookstone Dr 8/9/2019 2004 45230 ANDERSON TOWNSHIP
## 3 3 6160 Woodlark Dr 8/28/2019 1987 45230 ANDERSON TOWNSHIP
## 4 4 6265 Salem Rd 10/11/2019 1937 45230 ANDERSON TOWNSHIP
## 5 5 7099 Petri Dr 9/13/2019 1959 45230 ANDERSON TOWNSHIP
## 6 6 7621 FOREST RD 10/24/2019 1941 45255 ANDERSON TOWNSHIP
## SalePrice Bedrooms Bathrooms Stories SqFt LotSqFt Zip_Ind
## 1 212000 3 2 2 1384 6011 others
## 2 972500 7 5 2 4628 34412 others
## 3 420000 3 4 2 2634 17424 others
## 4 150000 3 1 2 1580 23958 others
## 5 125001 3 2 2 1404 8276 others
## 6 259000 3 2 1 1678 48918 others
## Neighborhood_Indi
## 1 others
## 2 others
## 3 others
## 4 others
## 5 others
## 6 others
Bedrooms <- housing_data_slim$Bedrooms
Bathrooms <- housing_data_slim$Bathrooms
Stories <- housing_data_slim$Stories
Sqft <- housing_data_slim$SqFt
LotSqFt <- housing_data_slim$LotSqFt
Zip_Ind <- housing_data_slim$Zip_Ind
Neighborhood_Indi <- housing_data_slim$Neighborhood_Indi
Age <- (2019-housing_data_slim$Year)
housing_data_slim <- cbind(housing_data_slim,Age)
housing_data_slim$SalePrice <- log(housing_data_slim$SalePrice)
str(housing_data_slim)
## 'data.frame': 298 obs. of 15 variables:
## $ ï..Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Address : Factor w/ 298 levels "1006 Rutledge Ave",..: 33 217 229 235 257 266 279 39 274 45 ...
## $ DateSold : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
## $ Year : int 1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
## $ Zip : int 45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
## $ Neighborhood : Factor w/ 62 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ SalePrice : num 12.3 13.8 12.9 11.9 11.7 ...
## $ Bedrooms : int 3 7 3 3 3 3 4 4 5 5 ...
## $ Bathrooms : int 2 5 4 1 2 2 2 2 3 3 ...
## $ Stories : int 2 2 2 2 2 1 2 1 2 3 ...
## $ SqFt : int 1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
## $ LotSqFt : int 6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
## $ Zip_Ind : Factor w/ 6 levels "five","four",..: 4 4 4 4 4 4 4 1 4 4 ...
## $ Neighborhood_Indi: Factor w/ 6 levels "NE1","NE2","NE3",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Age : num 60 15 32 82 60 78 27 73 109 83 ...
After analysis of the above regsubsets, few of the critical values of Neighbourhood have been converted into indicator variables,
Similarly created critical dummy variables for ZipCode
After following the above steps, executed regsubset with other covariates:
data4 <- regsubsets(SalePrice ~ SqFt+Bedrooms + Bathrooms + Stories+ LotSqFt + Age+Zip_Ind + Neighborhood_Indi ,data = housing_data_slim,really.big=T)
mb4 <-summary(data4)
Adj_R_sq <- mb4$adjr2
RSS <- mb4$rss
Adj_R_sq
## [1] 0.3895949 0.4687107 0.5249482 0.5761214 0.6205908 0.6333232 0.6455643
## [8] 0.6515818
RSS
## [1] 106.48001 92.36583 82.30886 73.19262 65.29034 62.88321 60.57504
## [8] 59.34128
AIC <- 313*log(RSS/313) + (1:8)*2
AIC
## [1] -335.4910 -377.9996 -412.0818 -446.8230 -480.5834 -490.3412 -500.0462
## [8] -504.4871
par(mfrow=c(1,1))
plot(AIC,main="AIC with only 6 levels of Zip and 5 levels of neighborhood")
Above regsubset has adjusted R-sqaure is 61.0% and and Residual sum of sqaures is 69.5.
Now based on the data4 regsubset we will draw important co-variates.
Model 4 covariates: Age , SqFt , Bathrooms , Stories , Zip , Neighborhood
model4 <- lm(SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + Neighborhood_Indi, data = housing_data_slim)
summary(model4)
##
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind +
## Neighborhood_Indi, data = housing_data_slim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57192 -0.21099 0.03266 0.27120 1.00406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.078e+01 3.653e-01 29.519 < 2e-16 ***
## Age -3.756e-03 8.040e-04 -4.671 4.62e-06 ***
## SqFt 2.942e-04 3.527e-05 8.341 3.24e-15 ***
## Bathrooms 1.271e-01 3.591e-02 3.540 0.000467 ***
## Zip_Indfour 8.739e-01 2.893e-01 3.020 0.002753 **
## Zip_Indone -7.790e-02 3.376e-01 -0.231 0.817665
## Zip_Indothers 2.079e-01 2.638e-01 0.788 0.431372
## Zip_Indsix 6.685e-01 3.020e-01 2.214 0.027645 *
## Zip_Indthree 7.761e-01 3.438e-01 2.258 0.024727 *
## Neighborhood_IndiNE2 2.203e-01 3.688e-01 0.597 0.550817
## Neighborhood_IndiNE3 8.987e-01 3.450e-01 2.605 0.009683 **
## Neighborhood_IndiNE5 6.914e-01 3.104e-01 2.228 0.026691 *
## Neighborhood_IndiNE6 7.908e-01 3.186e-01 2.482 0.013650 *
## Neighborhood_Indiothers 5.392e-01 2.249e-01 2.397 0.017177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4529 on 284 degrees of freedom
## Multiple R-squared: 0.6672, Adjusted R-squared: 0.6519
## F-statistic: 43.79 on 13 and 284 DF, p-value: < 2.2e-16
plot(model4)
KCV4<-cv.lm(data=housing_data_slim, model4, m=3, seed=123)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 21.5 21.5 104.93 < 2e-16 ***
## SqFt 1 55.9 55.9 272.62 < 2e-16 ***
## Bathrooms 1 8.4 8.4 40.89 6.6e-10 ***
## Zip_Ind 5 28.9 5.8 28.14 < 2e-16 ***
## Neighborhood_Indi 5 2.1 0.4 2.02 0.075 .
## Residuals 284 58.3 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = housing_data_slim, model4, m = 3, seed = 123):
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
##
## fold 1
## Observations in test set: 99
## 4 10 11 15 16 18 20 22
## Predicted 11.81 12.347 11.944 12.393 12.1354 13.238 11.9269 11.793
## cvpred 11.81 12.333 11.915 12.297 12.1366 13.157 11.9780 11.817
## SalePrice 11.92 12.181 11.290 12.992 12.1548 13.422 11.9117 12.150
## CV residual 0.11 -0.152 -0.625 0.696 0.0182 0.266 -0.0663 0.333
## 23 25 28 36 38 40 45 46 48
## Predicted 11.822 11.665 12.160 11.80 12.200 10.743 11.34 11.951 12.501
## cvpred 11.816 11.707 12.205 11.81 12.262 10.655 10.94 11.946 12.487
## SalePrice 12.170 11.451 12.001 11.39 12.084 9.903 12.21 12.787 12.601
## CV residual 0.354 -0.256 -0.204 -0.42 -0.177 -0.752 1.27 0.842 0.115
## 56 60 61 66 69 79 81 85 86
## Predicted 12.364 12.884 12.456 11.681 12.100 13.751 14.050 12.130 11.883
## cvpred 12.259 12.736 12.360 11.693 12.103 13.872 14.118 12.134 11.922
## SalePrice 11.864 13.591 12.654 12.151 12.734 13.240 14.017 11.835 11.258
## CV residual -0.395 0.855 0.294 0.459 0.631 -0.632 -0.101 -0.299 -0.664
## 88 90 94 95 100 102 103 104 109
## Predicted 12.146 11.749 12.252 13.84 13.383 13.074 12.30 12.46 11.752
## cvpred 12.132 11.726 12.385 13.74 13.559 13.315 12.54 12.73 11.763
## SalePrice 12.278 11.951 11.736 13.30 13.448 12.700 12.23 12.32 11.225
## CV residual 0.147 0.225 -0.649 -0.44 -0.111 -0.615 -0.31 -0.41 -0.538
## 111 114 116 118 122 125 126 129 130
## Predicted 11.639 11.685 11.993 12.714 12.293 11.560 11.742 11.980 12.314
## cvpred 11.649 11.719 11.927 12.690 12.268 11.580 11.780 12.125 12.423
## SalePrice 11.814 11.607 11.581 12.936 12.667 10.714 11.983 11.290 12.938
## CV residual 0.165 -0.112 -0.346 0.246 0.399 -0.866 0.202 -0.835 0.516
## 136 140 142 148 153 154 159 160 164
## Predicted 11.70 11.75 12.370 11.536 12.1490 11.742 12.644 13.6749 12.301
## cvpred 11.72 11.76 12.338 11.546 12.2211 11.759 12.622 13.5622 12.317
## SalePrice 11.95 10.22 12.206 11.138 12.1389 12.072 12.914 13.6158 12.605
## CV residual 0.23 -1.54 -0.131 -0.409 -0.0822 0.313 0.292 0.0536 0.288
## 165 171 172 174 175 179 180 181 182
## Predicted 12.115 11.535 11.794 11.790 11.955 11.852 11.786 11.979 11.768
## cvpred 12.159 11.780 11.803 11.852 11.958 11.885 11.801 11.987 11.786
## SalePrice 12.595 11.512 11.156 11.435 11.905 11.590 11.430 11.775 11.225
## CV residual 0.435 -0.268 -0.646 -0.417 -0.053 -0.295 -0.372 -0.212 -0.561
## 190 192 193 194 197 199 212 213 216
## Predicted 13.753 11.799 12.576 12.2311 12.71 13.305 11.970 11.807 11.830
## cvpred 13.666 11.809 12.447 12.2871 12.60 13.202 12.029 11.749 11.835
## SalePrice 13.769 11.925 12.899 12.2620 13.65 13.705 11.891 12.231 11.951
## CV residual 0.104 0.116 0.452 -0.0251 1.05 0.503 -0.138 0.482 0.116
## 221 222 226 234 244 245 247 249 259
## Predicted 12.1787 11.797 11.913 12.694 11.9695 12.224 12.143 12.2 11.89
## cvpred 12.1618 11.824 11.959 12.678 11.9968 12.242 12.192 12.3 11.90
## SalePrice 12.1172 12.128 11.608 12.401 11.9184 12.155 12.061 12.1 10.69
## CV residual -0.0445 0.304 -0.351 -0.277 -0.0784 -0.087 -0.131 -0.2 -1.21
## 260 262 263 265 268 269 270 273 274
## Predicted 11.829 12.039 11.138 11.815 12.202 12.4 11.561 12.547 11.7503
## cvpred 11.850 12.095 11.053 11.814 12.192 12.4 11.586 12.635 11.7431
## SalePrice 11.212 11.683 11.839 12.087 12.014 12.7 11.898 12.388 11.7668
## CV residual -0.638 -0.412 0.786 0.273 -0.178 0.3 0.313 -0.247 0.0237
## 279 280 282 283 285 286 290 292
## Predicted 12.416 11.622 12.05901 11.71092 13.983 11.995 12.284 11.881
## cvpred 12.404 11.631 12.13052 11.74335 13.809 12.023 12.275 11.856
## SalePrice 13.395 11.327 12.12757 11.74721 14.130 12.278 13.028 12.150
## CV residual 0.991 -0.304 -0.00295 0.00385 0.322 0.255 0.753 0.293
## 293 294
## Predicted 12.102 12.167
## cvpred 12.169 12.161
## SalePrice 12.044 12.780
## CV residual -0.126 0.618
##
## Sum of squares = 22.6 Mean square = 0.23 n = 99
##
## fold 2
## Observations in test set: 100
## 2 5 9 17 24 26 27 29 30
## Predicted 13.471 11.972 12.23 12.18 11.837 11.666 12.016 12.16 12.451
## cvpred 13.490 11.969 12.28 12.23 11.860 11.677 12.018 11.70 12.003
## SalePrice 13.788 11.736 11.17 12.74 11.653 11.327 11.156 12.72 12.560
## CV residual 0.298 -0.233 -1.11 0.51 -0.208 -0.351 -0.862 1.02 0.557
## 31 32 33 34 35 37 41 43 49
## Predicted 13.128 13.34 12.141 12.15 11.8245 12.203 10.71 10.7944 12.951
## cvpred 12.610 12.85 12.181 12.20 11.8292 12.189 10.87 10.9429 13.038
## SalePrice 13.253 13.31 12.297 11.74 11.9184 12.532 9.68 10.9151 13.365
## CV residual 0.644 0.46 0.116 -0.46 0.0892 0.343 -1.19 -0.0278 0.326
## 50 57 63 68 74 77 78 80 82
## Predicted 12.586 13.278 13.329 12.256 11.729 12.55 13.75 12.919 13.018
## cvpred 12.652 13.261 13.330 11.769 11.731 12.56 13.73 12.948 12.976
## SalePrice 13.006 13.050 13.209 12.211 11.982 12.79 13.82 13.262 13.218
## CV residual 0.355 -0.211 -0.121 0.442 0.251 0.23 0.09 0.314 0.242
## 89 91 97 98 99 106 110 112
## Predicted 11.824 11.85 12.7951 12.69 12.6118 11.942 11.7804 11.7535
## cvpred 11.885 11.88 12.8453 12.76 12.6828 11.938 11.7738 11.7415
## SalePrice 11.002 10.45 12.9342 12.58 12.7426 12.044 11.7361 11.8130
## CV residual -0.883 -1.43 0.0889 -0.18 0.0598 0.105 -0.0377 0.0715
## 115 117 119 120 123 128 133 135 138
## Predicted 11.969 12.24 11.8605 12.92 11.69 11.785 11.785 12.495 12.01
## cvpred 11.969 12.27 11.8611 13.00 11.68 11.812 11.835 12.545 12.02
## SalePrice 11.884 12.91 11.8494 12.63 10.45 11.935 12.424 12.861 10.99
## CV residual -0.085 0.64 -0.0117 -0.37 -1.23 0.123 0.589 0.316 -1.03
## 139 141 145 146 150 155 157 161
## Predicted 11.95 11.887 11.22084 11.699 11.567 13.05 12.4761 11.314
## cvpred 11.96 11.908 11.07765 11.512 11.407 13.06 12.4865 11.349
## SalePrice 10.37 11.635 11.08598 11.884 12.128 13.61 12.5602 11.608
## CV residual -1.58 -0.273 0.00833 0.373 0.721 0.55 0.0737 0.259
## 162 168 173 177 183 184 189 191
## Predicted 12.1612 12.034 12.241 11.952 12.193 12.439 11.6641 11.930
## cvpred 12.1486 12.041 12.274 11.971 12.175 12.457 11.7541 11.923
## SalePrice 12.2303 11.735 11.842 11.831 12.946 12.995 11.6869 11.608
## CV residual 0.0816 -0.305 -0.431 -0.139 0.771 0.537 -0.0672 -0.315
## 195 196 198 201 203 204 206 207 208
## Predicted 12.185 12.04 11.80108 11.784 12.40 12.426 11.612 11.691 12.405
## cvpred 12.184 12.09 11.80300 11.798 12.40 12.484 11.645 11.718 12.412
## SalePrice 11.983 11.05 11.80932 12.144 12.58 12.633 12.297 11.884 12.524
## CV residual -0.201 -1.04 0.00632 0.346 0.18 0.149 0.652 0.166 0.112
## 209 211 214 218 224 225 229 230 231
## Predicted 11.329 12.09 12.067 14.179 12.10 11.862 13.3055 12.40 12.59
## cvpred 11.172 12.14 12.101 14.257 12.11 11.893 13.3582 12.40 12.15
## SalePrice 11.878 12.35 12.384 13.790 11.97 11.736 13.2963 12.58 12.52
## CV residual 0.705 0.21 0.284 -0.467 -0.14 -0.157 -0.0618 0.18 0.37
## 232 233 237 238 239 240 242 243
## Predicted 10.8390 12.0 11.777 11.680 11.9429 12.2473 12.312 12.085
## cvpred 10.9812 12.4 11.774 11.672 11.9257 12.2759 12.287 12.093
## SalePrice 10.9133 12.0 12.139 12.020 11.8565 12.3014 11.362 11.884
## CV residual -0.0679 -0.4 0.365 0.348 -0.0692 0.0255 -0.924 -0.209
## 246 253 254 258 261 272 277 278
## Predicted 12.3051 12.6656 11.742 12.674 13.761 12.840 12.011 12.169
## cvpred 12.2865 12.6657 11.761 12.661 13.824 12.324 11.995 12.206
## SalePrice 12.3863 12.6440 11.983 13.305 13.377 12.995 12.128 12.588
## CV residual 0.0998 -0.0217 0.222 0.643 -0.447 0.671 0.133 0.382
## 284 288 289 291 296 297
## Predicted 12.33 12.489 11.783 14.57 12.15957 12.175
## cvpred 12.37 12.485 11.788 14.64 12.14461 12.164
## SalePrice 12.24 12.760 11.983 13.20 12.13886 12.310
## CV residual -0.13 0.275 0.195 -1.43 -0.00575 0.146
##
## Sum of squares = 26.1 Mean square = 0.26 n = 100
##
## fold 3
## Observations in test set: 99
## 1 3 6 7 8 12 13 14
## Predicted 11.97 12.693 11.985 12.419 11.6380 11.676 12.150 11.9814
## cvpred 11.93 12.672 11.966 12.445 11.6111 11.665 12.156 11.9885
## SalePrice 12.26 12.948 12.465 12.821 11.6263 10.872 11.813 11.9512
## CV residual 0.33 0.276 0.498 0.376 0.0151 -0.792 -0.343 -0.0373
## 19 21 39 42 44 47 51 52
## Predicted 11.613 12.4088 11.499 10.991 10.927 12.123 12.400 13.066
## cvpred 11.593 12.4304 11.731 10.902 10.830 12.099 12.386 13.096
## SalePrice 11.849 12.4568 10.840 11.708 10.977 12.560 12.065 13.459
## CV residual 0.256 0.0264 -0.891 0.806 0.147 0.461 -0.321 0.363
## 53 54 55 58 59 62 64 65 67
## Predicted 11.685 12.104 11.70 12.581 12.6264 13.2751 13.74 12.247 11.955
## cvpred 11.659 12.082 11.69 12.580 12.6345 13.3392 13.44 12.267 11.939
## SalePrice 11.951 12.405 11.81 12.301 12.6115 13.2963 14.45 13.251 12.848
## CV residual 0.292 0.323 0.12 -0.278 -0.0229 -0.0429 1.01 0.984 0.909
## 70 71 72 73 75 76 83 84 87
## Predicted 11.89 11.801 11.99 12.041 11.824 12.606 13.443 11.987 11.870
## cvpred 11.88 11.816 11.99 12.050 11.820 12.564 13.237 11.964 11.864
## SalePrice 12.49 11.857 12.21 12.403 12.073 12.707 13.346 11.835 11.408
## CV residual 0.61 0.041 0.22 0.353 0.253 0.143 0.109 -0.129 -0.457
## 92 93 96 101 105 107 108 113
## Predicted 12.228 12.602 13.0519 13.7287 13.147 11.8991 12.054 12.6333
## cvpred 11.981 12.375 13.0357 13.8450 13.078 11.8598 12.061 12.5978
## SalePrice 12.445 12.506 13.1224 13.7820 12.975 11.8776 11.350 12.5099
## CV residual 0.464 0.131 0.0867 -0.0631 -0.103 0.0177 -0.711 -0.0879
## 121 124 127 131 132 134 137 143
## Predicted 11.64 12.098 12.025 12.3385 12.680 12.91 11.999 11.3052
## cvpred 11.62 12.063 11.987 12.3613 12.464 13.04 11.985 11.4548
## SalePrice 11.29 12.692 12.324 12.4049 13.275 11.92 11.884 11.4773
## CV residual -0.33 0.629 0.337 0.0436 0.811 -1.13 -0.101 0.0225
## 144 147 149 151 152 156 158 163 166
## Predicted 11.1728 11.372 12.32 11.037 11.37 12.77 13.584 12.399 11.523
## cvpred 11.3185 11.578 12.56 11.223 11.54 12.81 13.657 12.374 11.233
## SalePrice 11.2960 12.035 11.23 11.082 10.39 13.06 13.420 11.884 11.728
## CV residual -0.0224 0.457 -1.33 -0.141 -1.14 0.25 -0.237 -0.489 0.495
## 167 169 170 176 178 185 186 187 188
## Predicted 12.006 12.279 11.774 12.217 11.775 11.823 12.303 12.362 14.540
## cvpred 11.982 12.243 11.768 12.177 11.729 11.833 12.274 12.383 14.726
## SalePrice 11.720 12.181 11.513 11.842 11.408 12.530 12.142 12.835 13.825
## CV residual -0.262 -0.062 -0.255 -0.335 -0.321 0.697 -0.133 0.451 -0.901
## 200 202 205 210 215 217 219 220 223
## Predicted 13.32 11.692 11.953 13.15 11.658 12.114 12.927 13.5986 12.970
## cvpred 13.36 11.681 11.932 12.94 11.645 12.083 12.934 13.6721 12.978
## SalePrice 13.30 11.142 12.177 13.38 11.513 12.196 12.612 13.6352 12.843
## CV residual -0.06 -0.539 0.246 0.44 -0.132 0.113 -0.323 -0.0369 -0.136
## 227 228 235 236 241 248 250 251
## Predicted 11.890 12.6595 11.927 12.782 11.758 11.858 12.4927 12.0800
## cvpred 11.886 12.6880 11.851 12.788 11.744 12.014 12.4730 12.0432
## SalePrice 11.608 12.6603 11.608 13.452 11.518 12.168 12.4875 12.1145
## CV residual -0.278 -0.0277 -0.243 0.664 -0.226 0.154 0.0145 0.0713
## 252 255 256 257 264 266 267 271
## Predicted 11.882 11.909 12.5677 11.921 12.1937 12.669 11.693 12.273
## cvpred 11.887 11.819 12.5235 11.930 12.1765 12.645 11.644 12.288
## SalePrice 12.301 12.572 12.4969 11.562 12.2061 12.808 12.201 12.612
## CV residual 0.414 0.754 -0.0266 -0.368 0.0296 0.162 0.557 0.323
## 275 276 281 287 295 298
## Predicted 11.920 11.9759 11.9176 12.644 11.972 12.430
## cvpred 11.881 11.9512 11.8828 12.709 11.948 12.456
## SalePrice 11.775 11.9083 11.9184 12.953 11.608 12.154
## CV residual -0.106 -0.0428 0.0356 0.243 -0.339 -0.302
##
## Sum of squares = 19.3 Mean square = 0.2 n = 99
##
## Overall (Sum over all 99 folds)
## ms
## 0.228
n<-dim(housing_data_slim)[1]
MSPE <- sum( ((SalePrice)-KCV4$cvpred)^2 )/n
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
PRESS <- sum(((SalePrice)-KCV4$cvpred)^2)
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
Pred_R_squared <- 1-sum(((SalePrice)-KCV4$cvpred)^2)/sum(((SalePrice)-mean((SalePrice)))^2)
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
MSPE
## [1] 1.38e+11
PRESS
## [1] 4.1e+13
Pred_R_squared
## [1] -1.32
As it is visible that Normality assumption is not satisfied in QQ Plot. The QQ plot is heavily tailed distribution. So, we will check the boxcox plot for best values of y Another noticeable insight about the model is, we have age & SqFt as continous variables, while bathrooms is nominal factor and zip_ind, neighbourhood indicator are categorical variables.
Therefore, transformation is only possible on continous,i.e., age & SqFt.
As of now boxcox plot is also non-converging.
boxcox(model4)
bcx<-boxcox(model4)
(lam <- bcx$x[which.max(bcx$y)])
## [1] 2
housing_data_slim$SalePrice <- (housing_data_slim$SalePrice ^ lam - 1) / lam
After transformation
housing_data_slim <- read.csv('Project.csv',h = T)
Bedrooms <- housing_data_slim$Bedrooms
Bathrooms <- housing_data_slim$Bathrooms
Stories <- housing_data_slim$Stories
Sqft <- housing_data_slim$SqFt
LotSqFt <- housing_data_slim$LotSqFt
Zip_Ind <- housing_data_slim$Zip_Ind
Neighborhood_Indi <- housing_data_slim$Neighborhood_Indi
Age <- (2019-housing_data_slim$Year)
housing_data_slim <- cbind(housing_data_slim,Age)
housing_data_slim$SalePrice <- log(housing_data_slim$SalePrice)
str(housing_data_slim)
## 'data.frame': 298 obs. of 15 variables:
## $ ï..Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Address : Factor w/ 298 levels "1006 Rutledge Ave",..: 33 217 229 235 257 266 279 39 274 45 ...
## $ DateSold : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
## $ Zip : int 45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
## $ Neighborhood : Factor w/ 62 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ SalePrice : num 12.3 13.8 12.9 11.9 11.7 ...
## $ Year : int 1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
## $ Bedrooms : int 3 7 3 3 3 3 4 4 5 5 ...
## $ Bathrooms : int 2 5 4 1 2 2 2 2 3 3 ...
## $ Stories : int 2 2 2 2 2 1 2 1 2 3 ...
## $ SqFt : int 1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
## $ LotSqFt : int 6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
## $ Zip_Ind : Factor w/ 6 levels "five","four",..: 4 4 4 4 4 4 4 1 4 4 ...
## $ Neighborhood_Indi: Factor w/ 6 levels "NE1","NE2","NE3",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Age : num 60 15 32 82 60 78 27 73 109 83 ...
SalePrice<-housing_data_slim$SalePrice
model4 <- lm(SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + Neighborhood_Indi, data = housing_data_slim)
summary(model4)
##
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind +
## Neighborhood_Indi, data = housing_data_slim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5719 -0.2110 0.0327 0.2712 1.0041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08e+01 3.65e-01 29.52 < 2e-16 ***
## Age -3.76e-03 8.04e-04 -4.67 4.6e-06 ***
## SqFt 2.94e-04 3.53e-05 8.34 3.2e-15 ***
## Bathrooms 1.27e-01 3.59e-02 3.54 0.00047 ***
## Zip_Indfour 8.74e-01 2.89e-01 3.02 0.00275 **
## Zip_Indone -7.79e-02 3.38e-01 -0.23 0.81767
## Zip_Indothers 2.08e-01 2.64e-01 0.79 0.43137
## Zip_Indsix 6.68e-01 3.02e-01 2.21 0.02765 *
## Zip_Indthree 7.76e-01 3.44e-01 2.26 0.02473 *
## Neighborhood_IndiNE2 2.20e-01 3.69e-01 0.60 0.55082
## Neighborhood_IndiNE3 8.99e-01 3.45e-01 2.60 0.00968 **
## Neighborhood_IndiNE5 6.91e-01 3.10e-01 2.23 0.02669 *
## Neighborhood_IndiNE6 7.91e-01 3.19e-01 2.48 0.01365 *
## Neighborhood_Indiothers 5.39e-01 2.25e-01 2.40 0.01718 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.453 on 284 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.652
## F-statistic: 43.8 on 13 and 284 DF, p-value: <2e-16
F-Test
As p-value of the model,2 * 10^(-16), is less than 5 percent then we reject the null hypothesis and say that the model has a non-zero slope
T-test
T-test tests whether jth particular covariate has a nonzero slope.
It has been identified that categorical variable,Zipcode has factor Zip_others and Neighborhood_3 > 5%.
Therefore, null hypothesis succeed for these two factors.
Standard Error
For Zip_Others,Neighborhood_IndiNE2 has high wiggle room, because Beta1 < 2*Std Error.
KCV4<-cv.lm(data=housing_data_slim, model4, m=3, seed=123)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 21.5 21.5 104.93 < 2e-16 ***
## SqFt 1 55.9 55.9 272.62 < 2e-16 ***
## Bathrooms 1 8.4 8.4 40.89 6.6e-10 ***
## Zip_Ind 5 28.9 5.8 28.14 < 2e-16 ***
## Neighborhood_Indi 5 2.1 0.4 2.02 0.075 .
## Residuals 284 58.3 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = housing_data_slim, model4, m = 3, seed = 123):
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
##
## fold 1
## Observations in test set: 99
## 4 10 11 15 16 18 20 22
## Predicted 11.81 12.347 11.944 12.393 12.1354 13.238 11.9269 11.793
## cvpred 11.81 12.333 11.915 12.297 12.1366 13.157 11.9780 11.817
## SalePrice 11.92 12.181 11.290 12.992 12.1548 13.422 11.9117 12.150
## CV residual 0.11 -0.152 -0.625 0.696 0.0182 0.266 -0.0663 0.333
## 23 25 28 36 38 40 45 46 48
## Predicted 11.822 11.665 12.160 11.80 12.200 10.743 11.34 11.951 12.501
## cvpred 11.816 11.707 12.205 11.81 12.262 10.655 10.94 11.946 12.487
## SalePrice 12.170 11.451 12.001 11.39 12.084 9.903 12.21 12.787 12.601
## CV residual 0.354 -0.256 -0.204 -0.42 -0.177 -0.752 1.27 0.842 0.115
## 56 60 61 66 69 79 81 85 86
## Predicted 12.364 12.884 12.456 11.681 12.100 13.751 14.050 12.130 11.883
## cvpred 12.259 12.736 12.360 11.693 12.103 13.872 14.118 12.134 11.922
## SalePrice 11.864 13.591 12.654 12.151 12.734 13.240 14.017 11.835 11.258
## CV residual -0.395 0.855 0.294 0.459 0.631 -0.632 -0.101 -0.299 -0.664
## 88 90 94 95 100 102 103 104 109
## Predicted 12.146 11.749 12.252 13.84 13.383 13.074 12.30 12.46 11.752
## cvpred 12.132 11.726 12.385 13.74 13.559 13.315 12.54 12.73 11.763
## SalePrice 12.278 11.951 11.736 13.30 13.448 12.700 12.23 12.32 11.225
## CV residual 0.147 0.225 -0.649 -0.44 -0.111 -0.615 -0.31 -0.41 -0.538
## 111 114 116 118 122 125 126 129 130
## Predicted 11.639 11.685 11.993 12.714 12.293 11.560 11.742 11.980 12.314
## cvpred 11.649 11.719 11.927 12.690 12.268 11.580 11.780 12.125 12.423
## SalePrice 11.814 11.607 11.581 12.936 12.667 10.714 11.983 11.290 12.938
## CV residual 0.165 -0.112 -0.346 0.246 0.399 -0.866 0.202 -0.835 0.516
## 136 140 142 148 153 154 159 160 164
## Predicted 11.70 11.75 12.370 11.536 12.1490 11.742 12.644 13.6749 12.301
## cvpred 11.72 11.76 12.338 11.546 12.2211 11.759 12.622 13.5622 12.317
## SalePrice 11.95 10.22 12.206 11.138 12.1389 12.072 12.914 13.6158 12.605
## CV residual 0.23 -1.54 -0.131 -0.409 -0.0822 0.313 0.292 0.0536 0.288
## 165 171 172 174 175 179 180 181 182
## Predicted 12.115 11.535 11.794 11.790 11.955 11.852 11.786 11.979 11.768
## cvpred 12.159 11.780 11.803 11.852 11.958 11.885 11.801 11.987 11.786
## SalePrice 12.595 11.512 11.156 11.435 11.905 11.590 11.430 11.775 11.225
## CV residual 0.435 -0.268 -0.646 -0.417 -0.053 -0.295 -0.372 -0.212 -0.561
## 190 192 193 194 197 199 212 213 216
## Predicted 13.753 11.799 12.576 12.2311 12.71 13.305 11.970 11.807 11.830
## cvpred 13.666 11.809 12.447 12.2871 12.60 13.202 12.029 11.749 11.835
## SalePrice 13.769 11.925 12.899 12.2620 13.65 13.705 11.891 12.231 11.951
## CV residual 0.104 0.116 0.452 -0.0251 1.05 0.503 -0.138 0.482 0.116
## 221 222 226 234 244 245 247 249 259
## Predicted 12.1787 11.797 11.913 12.694 11.9695 12.224 12.143 12.2 11.89
## cvpred 12.1618 11.824 11.959 12.678 11.9968 12.242 12.192 12.3 11.90
## SalePrice 12.1172 12.128 11.608 12.401 11.9184 12.155 12.061 12.1 10.69
## CV residual -0.0445 0.304 -0.351 -0.277 -0.0784 -0.087 -0.131 -0.2 -1.21
## 260 262 263 265 268 269 270 273 274
## Predicted 11.829 12.039 11.138 11.815 12.202 12.4 11.561 12.547 11.7503
## cvpred 11.850 12.095 11.053 11.814 12.192 12.4 11.586 12.635 11.7431
## SalePrice 11.212 11.683 11.839 12.087 12.014 12.7 11.898 12.388 11.7668
## CV residual -0.638 -0.412 0.786 0.273 -0.178 0.3 0.313 -0.247 0.0237
## 279 280 282 283 285 286 290 292
## Predicted 12.416 11.622 12.05901 11.71092 13.983 11.995 12.284 11.881
## cvpred 12.404 11.631 12.13052 11.74335 13.809 12.023 12.275 11.856
## SalePrice 13.395 11.327 12.12757 11.74721 14.130 12.278 13.028 12.150
## CV residual 0.991 -0.304 -0.00295 0.00385 0.322 0.255 0.753 0.293
## 293 294
## Predicted 12.102 12.167
## cvpred 12.169 12.161
## SalePrice 12.044 12.780
## CV residual -0.126 0.618
##
## Sum of squares = 22.6 Mean square = 0.23 n = 99
##
## fold 2
## Observations in test set: 100
## 2 5 9 17 24 26 27 29 30
## Predicted 13.471 11.972 12.23 12.18 11.837 11.666 12.016 12.16 12.451
## cvpred 13.490 11.969 12.28 12.23 11.860 11.677 12.018 11.70 12.003
## SalePrice 13.788 11.736 11.17 12.74 11.653 11.327 11.156 12.72 12.560
## CV residual 0.298 -0.233 -1.11 0.51 -0.208 -0.351 -0.862 1.02 0.557
## 31 32 33 34 35 37 41 43 49
## Predicted 13.128 13.34 12.141 12.15 11.8245 12.203 10.71 10.7944 12.951
## cvpred 12.610 12.85 12.181 12.20 11.8292 12.189 10.87 10.9429 13.038
## SalePrice 13.253 13.31 12.297 11.74 11.9184 12.532 9.68 10.9151 13.365
## CV residual 0.644 0.46 0.116 -0.46 0.0892 0.343 -1.19 -0.0278 0.326
## 50 57 63 68 74 77 78 80 82
## Predicted 12.586 13.278 13.329 12.256 11.729 12.55 13.75 12.919 13.018
## cvpred 12.652 13.261 13.330 11.769 11.731 12.56 13.73 12.948 12.976
## SalePrice 13.006 13.050 13.209 12.211 11.982 12.79 13.82 13.262 13.218
## CV residual 0.355 -0.211 -0.121 0.442 0.251 0.23 0.09 0.314 0.242
## 89 91 97 98 99 106 110 112
## Predicted 11.824 11.85 12.7951 12.69 12.6118 11.942 11.7804 11.7535
## cvpred 11.885 11.88 12.8453 12.76 12.6828 11.938 11.7738 11.7415
## SalePrice 11.002 10.45 12.9342 12.58 12.7426 12.044 11.7361 11.8130
## CV residual -0.883 -1.43 0.0889 -0.18 0.0598 0.105 -0.0377 0.0715
## 115 117 119 120 123 128 133 135 138
## Predicted 11.969 12.24 11.8605 12.92 11.69 11.785 11.785 12.495 12.01
## cvpred 11.969 12.27 11.8611 13.00 11.68 11.812 11.835 12.545 12.02
## SalePrice 11.884 12.91 11.8494 12.63 10.45 11.935 12.424 12.861 10.99
## CV residual -0.085 0.64 -0.0117 -0.37 -1.23 0.123 0.589 0.316 -1.03
## 139 141 145 146 150 155 157 161
## Predicted 11.95 11.887 11.22084 11.699 11.567 13.05 12.4761 11.314
## cvpred 11.96 11.908 11.07765 11.512 11.407 13.06 12.4865 11.349
## SalePrice 10.37 11.635 11.08598 11.884 12.128 13.61 12.5602 11.608
## CV residual -1.58 -0.273 0.00833 0.373 0.721 0.55 0.0737 0.259
## 162 168 173 177 183 184 189 191
## Predicted 12.1612 12.034 12.241 11.952 12.193 12.439 11.6641 11.930
## cvpred 12.1486 12.041 12.274 11.971 12.175 12.457 11.7541 11.923
## SalePrice 12.2303 11.735 11.842 11.831 12.946 12.995 11.6869 11.608
## CV residual 0.0816 -0.305 -0.431 -0.139 0.771 0.537 -0.0672 -0.315
## 195 196 198 201 203 204 206 207 208
## Predicted 12.185 12.04 11.80108 11.784 12.40 12.426 11.612 11.691 12.405
## cvpred 12.184 12.09 11.80300 11.798 12.40 12.484 11.645 11.718 12.412
## SalePrice 11.983 11.05 11.80932 12.144 12.58 12.633 12.297 11.884 12.524
## CV residual -0.201 -1.04 0.00632 0.346 0.18 0.149 0.652 0.166 0.112
## 209 211 214 218 224 225 229 230 231
## Predicted 11.329 12.09 12.067 14.179 12.10 11.862 13.3055 12.40 12.59
## cvpred 11.172 12.14 12.101 14.257 12.11 11.893 13.3582 12.40 12.15
## SalePrice 11.878 12.35 12.384 13.790 11.97 11.736 13.2963 12.58 12.52
## CV residual 0.705 0.21 0.284 -0.467 -0.14 -0.157 -0.0618 0.18 0.37
## 232 233 237 238 239 240 242 243
## Predicted 10.8390 12.0 11.777 11.680 11.9429 12.2473 12.312 12.085
## cvpred 10.9812 12.4 11.774 11.672 11.9257 12.2759 12.287 12.093
## SalePrice 10.9133 12.0 12.139 12.020 11.8565 12.3014 11.362 11.884
## CV residual -0.0679 -0.4 0.365 0.348 -0.0692 0.0255 -0.924 -0.209
## 246 253 254 258 261 272 277 278
## Predicted 12.3051 12.6656 11.742 12.674 13.761 12.840 12.011 12.169
## cvpred 12.2865 12.6657 11.761 12.661 13.824 12.324 11.995 12.206
## SalePrice 12.3863 12.6440 11.983 13.305 13.377 12.995 12.128 12.588
## CV residual 0.0998 -0.0217 0.222 0.643 -0.447 0.671 0.133 0.382
## 284 288 289 291 296 297
## Predicted 12.33 12.489 11.783 14.57 12.15957 12.175
## cvpred 12.37 12.485 11.788 14.64 12.14461 12.164
## SalePrice 12.24 12.760 11.983 13.20 12.13886 12.310
## CV residual -0.13 0.275 0.195 -1.43 -0.00575 0.146
##
## Sum of squares = 26.1 Mean square = 0.26 n = 100
##
## fold 3
## Observations in test set: 99
## 1 3 6 7 8 12 13 14
## Predicted 11.97 12.693 11.985 12.419 11.6380 11.676 12.150 11.9814
## cvpred 11.93 12.672 11.966 12.445 11.6111 11.665 12.156 11.9885
## SalePrice 12.26 12.948 12.465 12.821 11.6263 10.872 11.813 11.9512
## CV residual 0.33 0.276 0.498 0.376 0.0151 -0.792 -0.343 -0.0373
## 19 21 39 42 44 47 51 52
## Predicted 11.613 12.4088 11.499 10.991 10.927 12.123 12.400 13.066
## cvpred 11.593 12.4304 11.731 10.902 10.830 12.099 12.386 13.096
## SalePrice 11.849 12.4568 10.840 11.708 10.977 12.560 12.065 13.459
## CV residual 0.256 0.0264 -0.891 0.806 0.147 0.461 -0.321 0.363
## 53 54 55 58 59 62 64 65 67
## Predicted 11.685 12.104 11.70 12.581 12.6264 13.2751 13.74 12.247 11.955
## cvpred 11.659 12.082 11.69 12.580 12.6345 13.3392 13.44 12.267 11.939
## SalePrice 11.951 12.405 11.81 12.301 12.6115 13.2963 14.45 13.251 12.848
## CV residual 0.292 0.323 0.12 -0.278 -0.0229 -0.0429 1.01 0.984 0.909
## 70 71 72 73 75 76 83 84 87
## Predicted 11.89 11.801 11.99 12.041 11.824 12.606 13.443 11.987 11.870
## cvpred 11.88 11.816 11.99 12.050 11.820 12.564 13.237 11.964 11.864
## SalePrice 12.49 11.857 12.21 12.403 12.073 12.707 13.346 11.835 11.408
## CV residual 0.61 0.041 0.22 0.353 0.253 0.143 0.109 -0.129 -0.457
## 92 93 96 101 105 107 108 113
## Predicted 12.228 12.602 13.0519 13.7287 13.147 11.8991 12.054 12.6333
## cvpred 11.981 12.375 13.0357 13.8450 13.078 11.8598 12.061 12.5978
## SalePrice 12.445 12.506 13.1224 13.7820 12.975 11.8776 11.350 12.5099
## CV residual 0.464 0.131 0.0867 -0.0631 -0.103 0.0177 -0.711 -0.0879
## 121 124 127 131 132 134 137 143
## Predicted 11.64 12.098 12.025 12.3385 12.680 12.91 11.999 11.3052
## cvpred 11.62 12.063 11.987 12.3613 12.464 13.04 11.985 11.4548
## SalePrice 11.29 12.692 12.324 12.4049 13.275 11.92 11.884 11.4773
## CV residual -0.33 0.629 0.337 0.0436 0.811 -1.13 -0.101 0.0225
## 144 147 149 151 152 156 158 163 166
## Predicted 11.1728 11.372 12.32 11.037 11.37 12.77 13.584 12.399 11.523
## cvpred 11.3185 11.578 12.56 11.223 11.54 12.81 13.657 12.374 11.233
## SalePrice 11.2960 12.035 11.23 11.082 10.39 13.06 13.420 11.884 11.728
## CV residual -0.0224 0.457 -1.33 -0.141 -1.14 0.25 -0.237 -0.489 0.495
## 167 169 170 176 178 185 186 187 188
## Predicted 12.006 12.279 11.774 12.217 11.775 11.823 12.303 12.362 14.540
## cvpred 11.982 12.243 11.768 12.177 11.729 11.833 12.274 12.383 14.726
## SalePrice 11.720 12.181 11.513 11.842 11.408 12.530 12.142 12.835 13.825
## CV residual -0.262 -0.062 -0.255 -0.335 -0.321 0.697 -0.133 0.451 -0.901
## 200 202 205 210 215 217 219 220 223
## Predicted 13.32 11.692 11.953 13.15 11.658 12.114 12.927 13.5986 12.970
## cvpred 13.36 11.681 11.932 12.94 11.645 12.083 12.934 13.6721 12.978
## SalePrice 13.30 11.142 12.177 13.38 11.513 12.196 12.612 13.6352 12.843
## CV residual -0.06 -0.539 0.246 0.44 -0.132 0.113 -0.323 -0.0369 -0.136
## 227 228 235 236 241 248 250 251
## Predicted 11.890 12.6595 11.927 12.782 11.758 11.858 12.4927 12.0800
## cvpred 11.886 12.6880 11.851 12.788 11.744 12.014 12.4730 12.0432
## SalePrice 11.608 12.6603 11.608 13.452 11.518 12.168 12.4875 12.1145
## CV residual -0.278 -0.0277 -0.243 0.664 -0.226 0.154 0.0145 0.0713
## 252 255 256 257 264 266 267 271
## Predicted 11.882 11.909 12.5677 11.921 12.1937 12.669 11.693 12.273
## cvpred 11.887 11.819 12.5235 11.930 12.1765 12.645 11.644 12.288
## SalePrice 12.301 12.572 12.4969 11.562 12.2061 12.808 12.201 12.612
## CV residual 0.414 0.754 -0.0266 -0.368 0.0296 0.162 0.557 0.323
## 275 276 281 287 295 298
## Predicted 11.920 11.9759 11.9176 12.644 11.972 12.430
## cvpred 11.881 11.9512 11.8828 12.709 11.948 12.456
## SalePrice 11.775 11.9083 11.9184 12.953 11.608 12.154
## CV residual -0.106 -0.0428 0.0356 0.243 -0.339 -0.302
##
## Sum of squares = 19.3 Mean square = 0.2 n = 99
##
## Overall (Sum over all 99 folds)
## ms
## 0.228
n<-dim(housing_data_slim)[1]
MSPE4 <- sum( ((SalePrice)-KCV4$cvpred)^2 )/n
PRESS4 <- sum(((SalePrice)-KCV4$cvpred)^2)
Pred_R_squared4 <- 1-sum(((SalePrice)-KCV4$cvpred)^2)/sum(((SalePrice)-mean((SalePrice)))^2)
MSPE4
## [1] 0.228
PRESS4
## [1] 68
Pred_R_squared4
## [1] 0.611
sapply(housing_data_slim,function(x) length(x))
## ï..Index Address DateSold Zip
## 298 298 298 298
## Neighborhood SalePrice Year Bedrooms
## 298 298 298 298
## Bathrooms Stories SqFt LotSqFt
## 298 298 298 298
## Zip_Ind Neighborhood_Indi Age
## 298 298 298
boxcox(model4)
Transformation of response variable will keep in check of inequal variance, non-normality.
But, After doing boxcox transformation, there is no significant change in the model.Therefore our descriptive analysis of using log in SalePrice was correct,though non-normality of qq plot is still not resolved.
vif(model4)
## Age SqFt Bathrooms
## 1.21 1.90 2.23
## Zip_Indfour Zip_Indone Zip_Indothers
## 5.44 5.37 19.54
## Zip_Indsix Zip_Indthree Neighborhood_IndiNE2
## 5.53 13.67 7.64
## Neighborhood_IndiNE3 Neighborhood_IndiNE5 Neighborhood_IndiNE6
## 1.72 7.94 7.05
## Neighborhood_Indiothers
## 11.37
After checking the Variance Inflation Factors, it has been observed that VIF for Zip_others, Zip_Indtwo, Zip_Three, Neighborhood_IndiNE4, and Neighborhood_Indiothers are greater than 10. So, instead of doing ridge regression these categories can be removed from dataset. I’ve kept these dataset in separate file for refernce.
out = lm.ridge(SalePrice ~ Age + SqFt + Bathrooms + Zip_Ind + Neighborhood_Indi,lambda=.1,data = housing_data_slim)
out
## Age SqFt
## 10.792873 -0.003756 0.000294
## Bathrooms Zip_Indfour Zip_Indone
## 0.127199 0.867885 -0.086595
## Zip_Indothers Zip_Indsix Zip_Indthree
## 0.202131 0.662472 0.770132
## Neighborhood_IndiNE2 Neighborhood_IndiNE3 Neighborhood_IndiNE5
## 0.210186 0.894220 0.687092
## Neighborhood_IndiNE6 Neighborhood_Indiothers
## 0.786450 0.534870
vif(out) As, we were not able to install glmnet giving error, therefore, we had used MASS::ridge
observations_for_pred=data.frame(Bathrooms=c(1,2,4,4), Age=c(25,32,29,36),SqFt = c(2900,2500,1750,1350),Zip_Ind = c("one","one","four","three"),Neighborhood_Indi = c("NE1","NE5","NE2","NE1"))
predict(model4,observations_for_pred,interval="prediction", level=0.95, type="response")
## fit lwr upr
## 1 11.6 10.6 12.6
## 2 12.3 11.2 13.3
## 3 12.8 11.7 13.9
## 4 12.3 11.2 13.4
max(hatvalues(model4))
## [1] 0.338
min(hatvalues(model4))
## [1] 0.00496
x_new = c(1,25,2528,2,0,1,0,0,0,0,0,0,0,1)
x_new_1 = c(1,50,252,2,0,0,0,1,0,0,0,1,0,0)
x_new_2 = c(1,52,25285,4,0,0,0,0,1,0,1,0,0,0)
x_new_3 = c(1,92,3590,5,0,0,0,1,0,0,0,1,0,0)
X=model.matrix(model4)
t(x_new)%*%solve(t(X)%*%X)%*%x_new
## [,1]
## [1,] 0.235
t(x_new_1)%*%solve(t(X)%*%X)%*%x_new_1
## [,1]
## [1,] 0.176
# Above limit of max(hatvalues) so its a extrapolation
t(x_new_2)%*%solve(t(X)%*%X)%*%x_new_2
## [,1]
## [1,] 3.56
t(x_new_3)%*%solve(t(X)%*%X)%*%x_new_3
## [,1]
## [1,] 0.182
summary(model4)
##
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind +
## Neighborhood_Indi, data = housing_data_slim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5719 -0.2110 0.0327 0.2712 1.0041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08e+01 3.65e-01 29.52 < 2e-16 ***
## Age -3.76e-03 8.04e-04 -4.67 4.6e-06 ***
## SqFt 2.94e-04 3.53e-05 8.34 3.2e-15 ***
## Bathrooms 1.27e-01 3.59e-02 3.54 0.00047 ***
## Zip_Indfour 8.74e-01 2.89e-01 3.02 0.00275 **
## Zip_Indone -7.79e-02 3.38e-01 -0.23 0.81767
## Zip_Indothers 2.08e-01 2.64e-01 0.79 0.43137
## Zip_Indsix 6.68e-01 3.02e-01 2.21 0.02765 *
## Zip_Indthree 7.76e-01 3.44e-01 2.26 0.02473 *
## Neighborhood_IndiNE2 2.20e-01 3.69e-01 0.60 0.55082
## Neighborhood_IndiNE3 8.99e-01 3.45e-01 2.60 0.00968 **
## Neighborhood_IndiNE5 6.91e-01 3.10e-01 2.23 0.02669 *
## Neighborhood_IndiNE6 7.91e-01 3.19e-01 2.48 0.01365 *
## Neighborhood_Indiothers 5.39e-01 2.25e-01 2.40 0.01718 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.453 on 284 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.652
## F-statistic: 43.8 on 13 and 284 DF, p-value: <2e-16