Title: “Cincinnati Housing Model”

Author: “Ajinkya Prashant Dalvi”

Date: “12/24/2019”

Part1: Defining the Data

Executive Summary

To predict the price of the house according to the covariates by creating the best Multivariate Linear Regression Model.

Understanding the interaction between various regressor variable.

We have in total 313 observations as part of Cincinnati Housing data set. The data was collected by all the classmates by adding a minimum of 15 entries into a google sheet as part of a data gathering task. The classmates used “Zillow” website to collect the housing data.

We have one response variable and six covariates.

List of independent variables:

X1:Age

X2:SqFt

X3:Bathrooms

X4:Zip

X5:Neighborhood

Dependent Variable:

y1:SalePrice

Our goal is to understand and study regression model derived out of this dataset. We will be applying the stastical methods learned during class to get the best regression model out of the dataset. The objective is to create the model with minimum number of variables without compromising with the accuracy of prediction. Moreover, we will also study how different regressor variables are dependent on each other by doing covariance analysis.

SalesPrice = 10.8 - 3.7610^-3 Age + 2.9410^-4 SqFt + 0.127 Bathrooms + 0.874 * Zip_Indfour - 0.0079 * Zip_Indone + .208 * Zip_Indothers + .668 * Zip_Indsix + .776 * Zip_Indthree + .201 * Neighborhood_IndiNE2 +.899 * Neighborhood_IndiNE3 +0.691 * Neighborhood_IndiNE5 + 0.53 * Neighborhood_Indiothers + 0.79 * Neighborhood_IndiNE6

Data Preparation And Cleansing

Prepare data set:

• The data set was downloaded from the class google sheet into Excel CSV.

• Data columns were formatted as applicable in Excel CSV.

•Duplicate values based were identified and removed using Excel remove duplicates feature on the address column.

Eliminate bad data based on the following criteria:

•Street addresses that included apartment #s

•Street addresses outside the I275 loop

•High # of stories that were determined to be multi-family dwellings according to Zillow

•More than 2 obvious errors due to not trusting the data collector (e.g. wrong year or unrealistic sq ft on any measurement column)

• Sale date earlier than 3 months ago to reduce extrapolation effects

•Missing values due to poor data collection methods

Add neighborhood variable to data set:

• The file of 2019 to date sales was downloaded from the Hamilton County Ohio Auditor https://www.hamiltoncountyauditor.org/transfer_download_menu.asp

• Data columns were split or concatenated to match formatting between files. VLOOKUP was used to add the neighborhood based on the street address.

•Missing neighborhoods were manually collected from Zillow and Google. Most of the missing values were in Clermont County and therefore not available from the Hamilton County Auditor.

Libraries Used For this model

library(MASS)

library(car)

library(psych)

library(dplyr)

library(DAAG)

library(leaps)

Loading the dataset

Using the domain knowledge Zip should a nominal factor.

library(MASS)

library(psych)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(DAAG)
## Loading required package: lattice
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:psych':
## 
##     cities
## The following object is masked from 'package:MASS':
## 
##     hills
library(leaps)


neighbourhood_data <- read.csv('Project_1.csv',h = T)

head(neighbourhood_data)
##   ï..Index            Address   DateSold Year   Zip SalePrice Bedrooms
## 1        1       1337 Voll Rd  9/27/2019 1959 45230    212000        3
## 2        2 5786 Brookstone Dr   8/9/2019 2004 45230    972500        7
## 3        3   6160 Woodlark Dr  8/28/2019 1987 45230    420000        3
## 4        4      6265 Salem Rd 10/11/2019 1937 45230    150000        3
## 5        5      7099 Petri Dr  9/13/2019 1959 45230    125001        3
## 6        6     7621 FOREST RD 10/24/2019 1941 45255    259000        3
##   Bathrooms Stories SqFt LotSqFt      Neighborhood
## 1         2       2 1384    6011 ANDERSON TOWNSHIP
## 2         5       2 4628   34412 ANDERSON TOWNSHIP
## 3         4       2 2634   17424 ANDERSON TOWNSHIP
## 4         1       2 1580   23958 ANDERSON TOWNSHIP
## 5         2       2 1404    8276 ANDERSON TOWNSHIP
## 6         2       1 1678   48918 ANDERSON TOWNSHIP
attach(neighbourhood_data)

str(neighbourhood_data)
## 'data.frame':    313 obs. of  12 variables:
##  $ ï..Index    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Address     : Factor w/ 313 levels "1006 Rutledge Ave",..: 33 232 244 250 272 281 294 39 289 45 ...
##  $ DateSold    : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
##  $ Year        : int  1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
##  $ Zip         : int  45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
##  $ SalePrice   : int  212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
##  $ Bedrooms    : int  3 7 3 3 3 3 4 4 5 5 ...
##  $ Bathrooms   : int  2 5 4 1 2 2 2 2 3 3 ...
##  $ Stories     : int  2 2 2 2 2 1 2 1 2 3 ...
##  $ SqFt        : int  1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
##  $ LotSqFt     : int  6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
##  $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...

Using the domain knowledge Zip should a nominal factor.

neighbourhood_data$Zip <- as.factor(neighbourhood_data$Zip)

Year Built in itself has no weightage, therefore, transform the feature to subtract current year with built date to give the property age

Append Age to dataframe

Age <- (2019-neighbourhood_data$Year)

neighbourhood_data <- cbind(neighbourhood_data,Age)

head(neighbourhood_data)
##   ï..Index            Address   DateSold Year   Zip SalePrice Bedrooms
## 1        1       1337 Voll Rd  9/27/2019 1959 45230    212000        3
## 2        2 5786 Brookstone Dr   8/9/2019 2004 45230    972500        7
## 3        3   6160 Woodlark Dr  8/28/2019 1987 45230    420000        3
## 4        4      6265 Salem Rd 10/11/2019 1937 45230    150000        3
## 5        5      7099 Petri Dr  9/13/2019 1959 45230    125001        3
## 6        6     7621 FOREST RD 10/24/2019 1941 45255    259000        3
##   Bathrooms Stories SqFt LotSqFt      Neighborhood Age
## 1         2       2 1384    6011 ANDERSON TOWNSHIP  60
## 2         5       2 4628   34412 ANDERSON TOWNSHIP  15
## 3         4       2 2634   17424 ANDERSON TOWNSHIP  32
## 4         1       2 1580   23958 ANDERSON TOWNSHIP  82
## 5         2       2 1404    8276 ANDERSON TOWNSHIP  60
## 6         2       1 1678   48918 ANDERSON TOWNSHIP  78

Assigned the variables

SalePrice <- neighbourhood_data$SalePrice

Bedrooms <- neighbourhood_data$Bedrooms

Bathrooms <- neighbourhood_data$Bathrooms

Stories <- neighbourhood_data$Stories

SqFt <- neighbourhood_data$SqFt

LoftSqft <- neighbourhood_data$LotSqFt

Neighborhood <- neighbourhood_data$Neighborhood

We will exclude variables such as index,address and DateSold using our domain knowledge.

housing_data <- neighbourhood_data[,5:13]

Final Structure of the dataset

str(housing_data)
## 'data.frame':    313 obs. of  9 variables:
##  $ Zip         : Factor w/ 42 levels "45002","45202",..: 25 25 25 25 25 42 42 16 24 30 ...
##  $ SalePrice   : int  212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
##  $ Bedrooms    : int  3 7 3 3 3 3 4 4 5 5 ...
##  $ Bathrooms   : int  2 5 4 1 2 2 2 2 3 3 ...
##  $ Stories     : int  2 2 2 2 2 1 2 1 2 3 ...
##  $ SqFt        : int  1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
##  $ LotSqFt     : int  6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
##  $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ Age         : num  60 15 32 82 60 78 27 73 109 83 ...

Check for null or empty values

colSums(is.na(housing_data))
##          Zip    SalePrice     Bedrooms    Bathrooms      Stories 
##            0            0            0            0            0 
##         SqFt      LotSqFt Neighborhood          Age 
##            0            0            0            0

Descriptive Statistics of housing_data

pairs.panels(housing_data)

The graph depicts that Bedrooms, Bathrooms and Stories are ordinal factors.SQFt and Bathrooms are showing strong correlation.SqFt and LotSqFt is also showing good relation. Other variables are not highly correlated.

Response Variable, SalePrice is not normally distributed.

Also, the behavior of such distribution can be smoothen by taking log.

#Part2 : Building Model

Model with response SalePrice and having all subsets as covariates.

model1 <- lm(log(SalePrice) ~ ., data=housing_data)

summary(model1)
## 
## Call:
## lm(formula = log(SalePrice) ~ ., data = housing_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1789 -0.1061  0.0000  0.1407  1.0797 
## 
## Coefficients: (15 not defined because of singularities)
##                                                   Estimate Std. Error
## (Intercept)                                      1.138e+01  3.962e-01
## Zip45202                                        -2.190e-01  7.256e-01
## Zip45203                                         5.769e-01  4.726e-01
## Zip45204                                         1.140e-01  7.090e-01
## Zip45205                                        -2.895e-01  6.419e-01
## Zip45206                                        -6.020e-03  4.299e-01
## Zip45207                                        -2.081e-01  5.523e-01
## Zip45208                                        -1.241e+00  8.227e-01
## Zip45209                                         6.810e-01  3.989e-01
## Zip45211                                        -1.040e-02  5.838e-01
## Zip45212                                        -7.703e-01  5.647e-01
## Zip45213                                         3.641e-01  4.722e-01
## Zip45214                                         2.887e-01  4.659e-01
## Zip45215                                         1.338e+00  6.261e-01
## Zip45216                                         1.888e-01  4.689e-01
## Zip45217                                        -5.271e-01  3.993e-01
## Zip45219                                        -9.632e-01  7.703e-01
## Zip45220                                        -1.163e+00  8.688e-01
## Zip45223                                        -8.985e-01  4.742e-01
## Zip45224                                         8.373e-02  4.682e-01
## Zip45225                                        -6.584e-01  5.457e-01
## Zip45226                                        -1.563e+00  8.461e-01
## Zip45227                                        -1.734e+00  9.116e-01
## Zip45229                                        -5.306e-01  4.816e-01
## Zip45230                                         1.829e-01  3.988e-01
## Zip45231                                         1.704e-01  5.381e-01
## Zip45232                                        -5.240e-01  4.727e-01
## Zip45233                                         4.878e-01  6.543e-01
## Zip45236                                         9.459e-01  6.027e-01
## Zip45237                                        -8.268e-01  4.448e-01
## Zip45238                                        -2.279e-01  6.076e-01
## Zip45239                                         2.081e-01  7.605e-01
## Zip45240                                        -2.314e-01  4.652e-01
## Zip45242                                         1.168e-01  4.324e-01
## Zip45243                                         1.358e+00  5.662e-01
## Zip45244                                         3.416e-01  4.643e-01
## Zip45245                                         4.190e-01  4.626e-01
## Zip45246                                        -1.058e-01  5.359e-01
## Zip45248                                         2.288e-01  6.919e-01
## Zip45249                                         5.166e-01  4.558e-01
## Zip45251                                         5.166e-02  7.593e-01
## Zip45255                                         3.887e-01  4.233e-01
## Bedrooms                                         8.942e-03  3.454e-02
## Bathrooms                                        1.279e-01  3.465e-02
## Stories                                          7.164e-02  5.056e-02
## SqFt                                             2.528e-04  4.865e-05
## LotSqFt                                          1.728e-07  2.812e-06
## NeighborhoodAVONDALE                            -7.661e-02  3.249e-01
## NeighborhoodBOND HILL                            2.429e-01  3.155e-01
## NeighborhoodCAMP WASHINGTON                             NA         NA
## NeighborhoodCHEVIOT                             -5.494e-01  5.199e-01
## NeighborhoodCLEVES                                      NA         NA
## NeighborhoodCLIFTON                              1.546e+00  7.961e-01
## NeighborhoodCLIFTON HTS-UNIVERSITY HTS-FAIRVIEW  1.046e+00  6.936e-01
## NeighborhoodCOLERAIN TOWNSHIP                   -3.160e-01  5.396e-01
## NeighborhoodCOLLEGE HILL                                NA         NA
## NeighborhoodCOLUMBIA TOWNSHIP                   -2.459e+00  6.358e-01
## NeighborhoodCOLUMBIA TUSCULUM                    2.186e+00  7.782e-01
## NeighborhoodCORRYVILLE                           9.129e-01  7.130e-01
## NeighborhoodDEER PARK                           -6.583e-01  5.205e-01
## NeighborhoodDELHI TOWNSHIP                      -1.748e-01  5.080e-01
## NeighborhoodEAST END                             1.901e+00  8.473e-01
## NeighborhoodEAST PRICE HILL                     -5.641e-01  5.388e-01
## NeighborhoodEAST WALNUT HILLS                    7.647e-01  3.347e-01
## NeighborhoodEVANSTON                             4.641e-01  3.500e-01
## NeighborhoodFOREST PARK                                 NA         NA
## NeighborhoodForestville                         -3.002e-01  2.587e-01
## NeighborhoodGLENDALE                             6.212e-01  5.400e-01
## NeighborhoodGREEN TOWNSHIP                       4.092e-02  5.038e-01
## NeighborhoodHARTWELL                                    NA         NA
## NeighborhoodHYDE PARK                            2.032e+00  7.259e-01
## NeighborhoodINDIAN HILL                         -4.481e-01  4.305e-01
## NeighborhoodKENNEDY HEIGHTS                      3.497e-01  3.818e-01
## NeighborhoodLINCOLN  HEIGHTS                    -1.292e-01  6.844e-01
## NeighborhoodLINWOOD                              2.029e+00  7.918e-01
## NeighborhoodMack South                          -7.124e-01  6.552e-01
## NeighborhoodMADEIRA                             -7.037e-01  5.643e-01
## NeighborhoodMADISONVILLE                         2.064e+00  8.429e-01
## NeighborhoodMARIEMONT                            2.483e+00  9.282e-01
## NeighborhoodMIAMI TOWNSHIP                       1.018e-01  6.931e-01
## NeighborhoodMONTGOMERY                           1.077e-01  3.339e-01
## NeighborhoodMOUNT ADAMS                          9.374e-01  6.622e-01
## NeighborhoodMOUNT AIRY                          -4.579e-01  6.857e-01
## NeighborhoodMOUNT AUBURN                         5.556e-01  6.629e-01
## NeighborhoodMOUNT LOOKOUT                        2.048e+00  7.462e-01
## NeighborhoodMOUNT WASHINGTON                    -2.345e-01  1.535e-01
## NeighborhoodNORTH AVONDALE                       5.564e-01  2.663e-01
## NeighborhoodNORTH COLLEGE HILL                  -5.647e-01  7.625e-01
## NeighborhoodNORTHSIDE                            8.968e-01  2.924e-01
## NeighborhoodNORWOOD                              1.140e+00  4.676e-01
## NeighborhoodOAKLEY                                      NA         NA
## NeighborhoodOVER-THE-RHINE                       1.217e+00  6.671e-01
## NeighborhoodPLEASANT RIDGE                              NA         NA
## NeighborhoodREADING                             -1.025e+00  6.252e-01
## NeighborhoodROSELAWN                                    NA         NA
## NeighborhoodSaylor Park                         -5.561e-01  6.607e-01
## NeighborhoodSOUTH CUMMINSVILLE                          NA         NA
## NeighborhoodSPRINGDALE                                  NA         NA
## NeighborhoodSPRINGFIELD TOWNSHIP                        NA         NA
## NeighborhoodST. BERNARD                                 NA         NA
## NeighborhoodSYCAMORE TOWNSHIP                   -5.686e-01  3.885e-01
## NeighborhoodSYMMES TOWNSHIP                             NA         NA
## NeighborhoodUnion Township                      -3.484e-01  2.408e-01
## NeighborhoodWALNUT HILLS                                NA         NA
## NeighborhoodWEST END                                    NA         NA
## NeighborhoodWEST PRICE HILL                     -5.717e-03  4.852e-01
## NeighborhoodWESTWOOD                            -1.783e-01  4.304e-01
## NeighborhoodWITHAMSVILLE                        -4.636e-01  3.774e-01
## NeighborhoodWYOMING                             -6.082e-01  4.581e-01
## Age                                             -3.290e-03  9.082e-04
##                                                 t value Pr(>|t|)    
## (Intercept)                                      28.711  < 2e-16 ***
## Zip45202                                         -0.302 0.763050    
## Zip45203                                          1.221 0.223533    
## Zip45204                                          0.161 0.872397    
## Zip45205                                         -0.451 0.652494    
## Zip45206                                         -0.014 0.988841    
## Zip45207                                         -0.377 0.706668    
## Zip45208                                         -1.509 0.132803    
## Zip45209                                          1.707 0.089256 .  
## Zip45211                                         -0.018 0.985802    
## Zip45212                                         -1.364 0.173911    
## Zip45213                                          0.771 0.441419    
## Zip45214                                          0.620 0.536116    
## Zip45215                                          2.137 0.033728 *  
## Zip45216                                          0.403 0.687692    
## Zip45217                                         -1.320 0.188253    
## Zip45219                                         -1.250 0.212496    
## Zip45220                                         -1.339 0.182034    
## Zip45223                                         -1.895 0.059432 .  
## Zip45224                                          0.179 0.858235    
## Zip45225                                         -1.207 0.228916    
## Zip45226                                         -1.848 0.066013 .  
## Zip45227                                         -1.902 0.058522 .  
## Zip45229                                         -1.102 0.271779    
## Zip45230                                          0.459 0.646947    
## Zip45231                                          0.317 0.751765    
## Zip45232                                         -1.108 0.268877    
## Zip45233                                          0.745 0.456787    
## Zip45236                                          1.570 0.117962    
## Zip45237                                         -1.859 0.064396 .  
## Zip45238                                         -0.375 0.707980    
## Zip45239                                          0.274 0.784675    
## Zip45240                                         -0.497 0.619432    
## Zip45242                                          0.270 0.787379    
## Zip45243                                          2.399 0.017283 *  
## Zip45244                                          0.736 0.462681    
## Zip45245                                          0.906 0.366158    
## Zip45246                                         -0.198 0.843618    
## Zip45248                                          0.331 0.741261    
## Zip45249                                          1.133 0.258350    
## Zip45251                                          0.068 0.945821    
## Zip45255                                          0.918 0.359445    
## Bedrooms                                          0.259 0.795970    
## Bathrooms                                         3.691 0.000283 ***
## Stories                                           1.417 0.157912    
## SqFt                                              5.196 4.67e-07 ***
## LotSqFt                                           0.061 0.951054    
## NeighborhoodAVONDALE                             -0.236 0.813806    
## NeighborhoodBOND HILL                             0.770 0.442242    
## NeighborhoodCAMP WASHINGTON                          NA       NA    
## NeighborhoodCHEVIOT                              -1.057 0.291758    
## NeighborhoodCLEVES                                   NA       NA    
## NeighborhoodCLIFTON                               1.943 0.053362 .  
## NeighborhoodCLIFTON HTS-UNIVERSITY HTS-FAIRVIEW   1.508 0.133105    
## NeighborhoodCOLERAIN TOWNSHIP                    -0.586 0.558796    
## NeighborhoodCOLLEGE HILL                             NA       NA    
## NeighborhoodCOLUMBIA TOWNSHIP                    -3.867 0.000145 ***
## NeighborhoodCOLUMBIA TUSCULUM                     2.809 0.005428 ** 
## NeighborhoodCORRYVILLE                            1.280 0.201803    
## NeighborhoodDEER PARK                            -1.265 0.207323    
## NeighborhoodDELHI TOWNSHIP                       -0.344 0.731136    
## NeighborhoodEAST END                              2.243 0.025888 *  
## NeighborhoodEAST PRICE HILL                      -1.047 0.296252    
## NeighborhoodEAST WALNUT HILLS                     2.285 0.023303 *  
## NeighborhoodEVANSTON                              1.326 0.186309    
## NeighborhoodFOREST PARK                              NA       NA    
## NeighborhoodForestville                          -1.161 0.247110    
## NeighborhoodGLENDALE                              1.150 0.251290    
## NeighborhoodGREEN TOWNSHIP                        0.081 0.935351    
## NeighborhoodHARTWELL                                 NA       NA    
## NeighborhoodHYDE PARK                             2.800 0.005575 ** 
## NeighborhoodINDIAN HILL                          -1.041 0.299069    
## NeighborhoodKENNEDY HEIGHTS                       0.916 0.360747    
## NeighborhoodLINCOLN  HEIGHTS                     -0.189 0.850396    
## NeighborhoodLINWOOD                               2.563 0.011058 *  
## NeighborhoodMack South                           -1.087 0.278113    
## NeighborhoodMADEIRA                              -1.247 0.213716    
## NeighborhoodMADISONVILLE                          2.448 0.015142 *  
## NeighborhoodMARIEMONT                             2.675 0.008042 ** 
## NeighborhoodMIAMI TOWNSHIP                        0.147 0.883407    
## NeighborhoodMONTGOMERY                            0.323 0.747348    
## NeighborhoodMOUNT ADAMS                           1.416 0.158308    
## NeighborhoodMOUNT AIRY                           -0.668 0.505011    
## NeighborhoodMOUNT AUBURN                          0.838 0.402863    
## NeighborhoodMOUNT LOOKOUT                         2.744 0.006567 ** 
## NeighborhoodMOUNT WASHINGTON                     -1.528 0.128056    
## NeighborhoodNORTH AVONDALE                        2.089 0.037833 *  
## NeighborhoodNORTH COLLEGE HILL                   -0.741 0.459737    
## NeighborhoodNORTHSIDE                             3.067 0.002435 ** 
## NeighborhoodNORWOOD                               2.438 0.015555 *  
## NeighborhoodOAKLEY                                   NA       NA    
## NeighborhoodOVER-THE-RHINE                        1.824 0.069475 .  
## NeighborhoodPLEASANT RIDGE                           NA       NA    
## NeighborhoodREADING                              -1.639 0.102630    
## NeighborhoodROSELAWN                                 NA       NA    
## NeighborhoodSaylor Park                          -0.842 0.400861    
## NeighborhoodSOUTH CUMMINSVILLE                       NA       NA    
## NeighborhoodSPRINGDALE                               NA       NA    
## NeighborhoodSPRINGFIELD TOWNSHIP                     NA       NA    
## NeighborhoodST. BERNARD                              NA       NA    
## NeighborhoodSYCAMORE TOWNSHIP                    -1.463 0.144800    
## NeighborhoodSYMMES TOWNSHIP                          NA       NA    
## NeighborhoodUnion Township                       -1.447 0.149383    
## NeighborhoodWALNUT HILLS                             NA       NA    
## NeighborhoodWEST END                                 NA       NA    
## NeighborhoodWEST PRICE HILL                      -0.012 0.990609    
## NeighborhoodWESTWOOD                             -0.414 0.679046    
## NeighborhoodWITHAMSVILLE                         -1.229 0.220579    
## NeighborhoodWYOMING                              -1.328 0.185633    
## Age                                              -3.622 0.000364 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3775 on 218 degrees of freedom
## Multiple R-squared:  0.8304, Adjusted R-squared:  0.7572 
## F-statistic: 11.35 on 94 and 218 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))

plot(model1)
## Warning: not plotting observations with leverage one:
##   14, 25, 39, 53, 65, 71, 77, 123, 142, 144, 150, 232, 233, 234, 235, 240, 243, 246, 254, 272, 290, 298, 306, 312

## Warning: not plotting observations with leverage one:
##   14, 25, 39, 53, 65, 71, 77, 123, 142, 144, 150, 232, 233, 234, 235, 240, 243, 246, 254, 272, 290, 298, 306, 312
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow = c(1,1))

Adjusted R-square is 75.7% and p-value is also significant but p-values for most of Neighborhood and Zipcode are insignificant. Some are also showing NA for parameter estimates.

Residual Analysis for LINE Assumptions such as normality and equi-variance assumption is also not good

str(housing_data)
## 'data.frame':    313 obs. of  9 variables:
##  $ Zip         : Factor w/ 42 levels "45002","45202",..: 25 25 25 25 25 42 42 16 24 30 ...
##  $ SalePrice   : int  212000 972500 420000 150000 125001 259000 370000 112000 70752 195000 ...
##  $ Bedrooms    : int  3 7 3 3 3 3 4 4 5 5 ...
##  $ Bathrooms   : int  2 5 4 1 2 2 2 2 3 3 ...
##  $ Stories     : int  2 2 2 2 2 1 2 1 2 3 ...
##  $ SqFt        : int  1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
##  $ LotSqFt     : int  6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
##  $ Neighborhood: Factor w/ 63 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ Age         : num  60 15 32 82 60 78 27 73 109 83 ...

As we can see that neighbourhood has 63 Levels and zip has 42 Levels therefore instead of choosing one variable at a time for the SalesPrice, we will run the regsubset with default method exhaustive for all the variables.

We also identified that limitation of regsubset is that it works with only 8 variables at a time.Response Variable is SalePrice and taken all other covariates but Zip. Executing regsubsets with Neighborhood and Zip but it stuck for an while processing. Hence, we divided the Zip & Neighborhood in two different regsubsets, keeping all other variables.

data1 <- regsubsets((SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Neighborhood, data = housing_data, really.big=T)


Adj_R_sq <- summary(data1)$adjr2

RSS <- summary(data1)$rss

Adj_R_sq
## [1] 0.4738299 0.5251060 0.5720530 0.6104688 0.6354247 0.6574585 0.6766102
## [8] 0.6877442
RSS
## [1] 9.267728e+12 8.337678e+12 7.489197e+12 6.794847e+12 6.338877e+12
## [6] 5.936376e+12 5.586155e+12 5.376143e+12

Here adjusted R-sqaure is in 68.8% but Residual sum of sqaures is 10^12. As, this model will be of no use. Hence, log transformation is done on SalePrice.

After log(SalePrice), again executed regsubset for the above model.

data2 <- regsubsets(log(SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Neighborhood, data = housing_data, really.big=T)

mb2 <-summary(data2)

Adj_R_sq <- summary(data2)$adjr2

RSS <- summary(data2)$rss

Adj_R_sq
## [1] 0.3847604 0.4493202 0.4869157 0.5192738 0.5495711 0.5763099 0.5955645
## [8] 0.6140310
RSS
## [1] 112.32841 100.21801  93.07480  86.92274  81.18009  76.11226  72.41592
## [8]  68.88283
AIC <- 313*log(RSS/313) + (1:8)*2

AIC
## [1] -318.7550 -352.4617 -373.6063 -393.0104 -412.4039 -430.5801 -444.1622
## [8] -457.8182
par(mfrow=c(1,1))

plot(AIC,main="AIC plot without Zip")

Here adjusted R-sqaure is 61.4% and Residual sum of sqaures is 68.9 . Now, after checking for how many variables adj R-square is higher and for how many variables aic is on the lower side. It has been identified that, 8 co-variates will be used.

Another noticeable insight is, few neighbourhood factors has more weight than LotSqFt and Age. Therefore, these covariates will be dropped in the subsequent models.

data3 <- regsubsets(log(SalePrice) ~ Bedrooms + Bathrooms + Stories+ SqFt + Age+ LotSqFt+ Zip, data = housing_data,really.big=T)

mb3 <-summary(data3)

Adj_R_sq <- mb3$adjr2

RSS <- mb3$rss

Adj_R_sq
## [1] 0.3847604 0.4493202 0.5080247 0.5412928 0.5699893 0.5966092 0.6224446
## [8] 0.6363916
RSS
## [1] 112.32841 100.21801  89.24558  82.94136  77.50014  72.46566  67.60291
## [8]  64.89219
AIC <- 313*log(RSS/313) + (1:8)*2

AIC
## [1] -318.7550 -352.4617 -386.7559 -407.6857 -426.9240 -445.9473 -465.6888
## [8] -476.4980
par(mfrow=c(1,1))

plot(AIC,main="AIC plot without Neighborhood")

Here adjusted R-sqaure is 63.6% and Residual sum of sqaures is 64.9 . Now, after checking for how many variables adj R-square is higher and for how many variables aic is also on the lower side.

Noticeable insight: few zip factors has more weight than LotSqFt and Age. Therefore, these covariates will be dropped in the subsequent models.

New Model

housing_data_slim <- read.csv('Project_2.csv',h = T)

attach(housing_data_slim)
## The following objects are masked _by_ .GlobalEnv:
## 
##     Bathrooms, Bedrooms, Neighborhood, SalePrice, SqFt, Stories
## The following objects are masked from neighbourhood_data:
## 
##     Address, Bathrooms, Bedrooms, DateSold, ï..Index, LotSqFt,
##     Neighborhood, SalePrice, SqFt, Stories, Year, Zip
housing_data_slim$Zip_Ind
##   [1] others others others others others others others five   others others
##  [11] others others others others others others others others others others
##  [21] others others others others others others others others six    six   
##  [31] six    six    others others others others others six    others one   
##  [41] one    one    one    one    others others others others others others
##  [51] others others others others others three  three  three  three  three 
##  [61] three  three  three  four   others others others six    others others
##  [71] others others others others others others others four   four   four  
##  [81] four   four   four   others others others others others others others
##  [91] others four   four   four   three  three  three  three  three  six   
## [101] three  six    six    six    six    others others others others others
## [111] others others others others others five   others others others others
## [121] others others others others others others others others others others
## [131] others four   others others others others others others others others
## [141] others others five   five   five   five   five   five   five   five  
## [151] five   five   others others others others others others others others
## [161] others others others others others one    others others others others
## [171] one    others others others others others others others others others
## [181] others others others others others others others others one    three 
## [191] others others three  others others others three  others three  three 
## [201] others others others others others others others others five   four  
## [211] others others five   others others others others three  three  three 
## [221] others others three  others others others others three  three  others
## [231] six    one    others others others others others others others others
## [241] others four   others others others others others five   others others
## [251] others others others others others others others others others others
## [261] three  others one    others others others others others others others
## [271] others six    four   others others others others others others others
## [281] others others others others others others others others others others
## [291] others others others others others others others others
## Levels: five four one others six three
head(housing_data_slim)
##   ï..Index            Address   DateSold Year   Zip      Neighborhood
## 1        1       1337 Voll Rd  9/27/2019 1959 45230 ANDERSON TOWNSHIP
## 2        2 5786 Brookstone Dr   8/9/2019 2004 45230 ANDERSON TOWNSHIP
## 3        3   6160 Woodlark Dr  8/28/2019 1987 45230 ANDERSON TOWNSHIP
## 4        4      6265 Salem Rd 10/11/2019 1937 45230 ANDERSON TOWNSHIP
## 5        5      7099 Petri Dr  9/13/2019 1959 45230 ANDERSON TOWNSHIP
## 6        6     7621 FOREST RD 10/24/2019 1941 45255 ANDERSON TOWNSHIP
##   SalePrice Bedrooms Bathrooms Stories SqFt LotSqFt Zip_Ind
## 1    212000        3         2       2 1384    6011  others
## 2    972500        7         5       2 4628   34412  others
## 3    420000        3         4       2 2634   17424  others
## 4    150000        3         1       2 1580   23958  others
## 5    125001        3         2       2 1404    8276  others
## 6    259000        3         2       1 1678   48918  others
##   Neighborhood_Indi
## 1            others
## 2            others
## 3            others
## 4            others
## 5            others
## 6            others
Bedrooms <- housing_data_slim$Bedrooms

Bathrooms <- housing_data_slim$Bathrooms

Stories <- housing_data_slim$Stories

Sqft <- housing_data_slim$SqFt

LotSqFt <- housing_data_slim$LotSqFt

Zip_Ind <- housing_data_slim$Zip_Ind

Neighborhood_Indi <- housing_data_slim$Neighborhood_Indi

Age <- (2019-housing_data_slim$Year)

housing_data_slim <- cbind(housing_data_slim,Age)

housing_data_slim$SalePrice <- log(housing_data_slim$SalePrice)

str(housing_data_slim)
## 'data.frame':    298 obs. of  15 variables:
##  $ ï..Index         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Address          : Factor w/ 298 levels "1006 Rutledge Ave",..: 33 217 229 235 257 266 279 39 274 45 ...
##  $ DateSold         : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
##  $ Year             : int  1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
##  $ Zip              : int  45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
##  $ Neighborhood     : Factor w/ 62 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ SalePrice        : num  12.3 13.8 12.9 11.9 11.7 ...
##  $ Bedrooms         : int  3 7 3 3 3 3 4 4 5 5 ...
##  $ Bathrooms        : int  2 5 4 1 2 2 2 2 3 3 ...
##  $ Stories          : int  2 2 2 2 2 1 2 1 2 3 ...
##  $ SqFt             : int  1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
##  $ LotSqFt          : int  6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
##  $ Zip_Ind          : Factor w/ 6 levels "five","four",..: 4 4 4 4 4 4 4 1 4 4 ...
##  $ Neighborhood_Indi: Factor w/ 6 levels "NE1","NE2","NE3",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Age              : num  60 15 32 82 60 78 27 73 109 83 ...

After analysis of the above regsubsets, few of the critical values of Neighbourhood have been converted into indicator variables,

Similarly created critical dummy variables for ZipCode

After following the above steps, executed regsubset with other covariates:

data4 <- regsubsets(SalePrice ~ SqFt+Bedrooms + Bathrooms + Stories+ LotSqFt + Age+Zip_Ind + Neighborhood_Indi ,data = housing_data_slim,really.big=T)

mb4 <-summary(data4)

Adj_R_sq <- mb4$adjr2

RSS <- mb4$rss

Adj_R_sq
## [1] 0.3895949 0.4687107 0.5249482 0.5761214 0.6205908 0.6333232 0.6455643
## [8] 0.6515818
RSS
## [1] 106.48001  92.36583  82.30886  73.19262  65.29034  62.88321  60.57504
## [8]  59.34128
AIC <- 313*log(RSS/313) + (1:8)*2

AIC
## [1] -335.4910 -377.9996 -412.0818 -446.8230 -480.5834 -490.3412 -500.0462
## [8] -504.4871
par(mfrow=c(1,1))

plot(AIC,main="AIC with only 6 levels of Zip and 5 levels of neighborhood")

Above regsubset has adjusted R-sqaure is 61.0% and and Residual sum of sqaures is 69.5.

Now based on the data4 regsubset we will draw important co-variates.

Model 4 covariates: Age , SqFt , Bathrooms , Stories , Zip , Neighborhood

model4 <- lm(SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + Neighborhood_Indi, data = housing_data_slim)

summary(model4)
## 
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + 
##     Neighborhood_Indi, data = housing_data_slim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57192 -0.21099  0.03266  0.27120  1.00406 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.078e+01  3.653e-01  29.519  < 2e-16 ***
## Age                     -3.756e-03  8.040e-04  -4.671 4.62e-06 ***
## SqFt                     2.942e-04  3.527e-05   8.341 3.24e-15 ***
## Bathrooms                1.271e-01  3.591e-02   3.540 0.000467 ***
## Zip_Indfour              8.739e-01  2.893e-01   3.020 0.002753 ** 
## Zip_Indone              -7.790e-02  3.376e-01  -0.231 0.817665    
## Zip_Indothers            2.079e-01  2.638e-01   0.788 0.431372    
## Zip_Indsix               6.685e-01  3.020e-01   2.214 0.027645 *  
## Zip_Indthree             7.761e-01  3.438e-01   2.258 0.024727 *  
## Neighborhood_IndiNE2     2.203e-01  3.688e-01   0.597 0.550817    
## Neighborhood_IndiNE3     8.987e-01  3.450e-01   2.605 0.009683 ** 
## Neighborhood_IndiNE5     6.914e-01  3.104e-01   2.228 0.026691 *  
## Neighborhood_IndiNE6     7.908e-01  3.186e-01   2.482 0.013650 *  
## Neighborhood_Indiothers  5.392e-01  2.249e-01   2.397 0.017177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4529 on 284 degrees of freedom
## Multiple R-squared:  0.6672, Adjusted R-squared:  0.6519 
## F-statistic: 43.79 on 13 and 284 DF,  p-value: < 2.2e-16
plot(model4)

KCV4<-cv.lm(data=housing_data_slim, model4, m=3, seed=123)
## Analysis of Variance Table
## 
## Response: SalePrice
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## Age                 1   21.5    21.5  104.93 < 2e-16 ***
## SqFt                1   55.9    55.9  272.62 < 2e-16 ***
## Bathrooms           1    8.4     8.4   40.89 6.6e-10 ***
## Zip_Ind             5   28.9     5.8   28.14 < 2e-16 ***
## Neighborhood_Indi   5    2.1     0.4    2.02   0.075 .  
## Residuals         284   58.3     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = housing_data_slim, model4, m = 3, seed = 123): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

## 
## fold 1 
## Observations in test set: 99 
##                 4     10     11     15      16     18      20     22
## Predicted   11.81 12.347 11.944 12.393 12.1354 13.238 11.9269 11.793
## cvpred      11.81 12.333 11.915 12.297 12.1366 13.157 11.9780 11.817
## SalePrice   11.92 12.181 11.290 12.992 12.1548 13.422 11.9117 12.150
## CV residual  0.11 -0.152 -0.625  0.696  0.0182  0.266 -0.0663  0.333
##                 23     25     28    36     38     40    45     46     48
## Predicted   11.822 11.665 12.160 11.80 12.200 10.743 11.34 11.951 12.501
## cvpred      11.816 11.707 12.205 11.81 12.262 10.655 10.94 11.946 12.487
## SalePrice   12.170 11.451 12.001 11.39 12.084  9.903 12.21 12.787 12.601
## CV residual  0.354 -0.256 -0.204 -0.42 -0.177 -0.752  1.27  0.842  0.115
##                 56     60     61     66     69     79     81     85     86
## Predicted   12.364 12.884 12.456 11.681 12.100 13.751 14.050 12.130 11.883
## cvpred      12.259 12.736 12.360 11.693 12.103 13.872 14.118 12.134 11.922
## SalePrice   11.864 13.591 12.654 12.151 12.734 13.240 14.017 11.835 11.258
## CV residual -0.395  0.855  0.294  0.459  0.631 -0.632 -0.101 -0.299 -0.664
##                 88     90     94    95    100    102   103   104    109
## Predicted   12.146 11.749 12.252 13.84 13.383 13.074 12.30 12.46 11.752
## cvpred      12.132 11.726 12.385 13.74 13.559 13.315 12.54 12.73 11.763
## SalePrice   12.278 11.951 11.736 13.30 13.448 12.700 12.23 12.32 11.225
## CV residual  0.147  0.225 -0.649 -0.44 -0.111 -0.615 -0.31 -0.41 -0.538
##                111    114    116    118    122    125    126    129    130
## Predicted   11.639 11.685 11.993 12.714 12.293 11.560 11.742 11.980 12.314
## cvpred      11.649 11.719 11.927 12.690 12.268 11.580 11.780 12.125 12.423
## SalePrice   11.814 11.607 11.581 12.936 12.667 10.714 11.983 11.290 12.938
## CV residual  0.165 -0.112 -0.346  0.246  0.399 -0.866  0.202 -0.835  0.516
##               136   140    142    148     153    154    159     160    164
## Predicted   11.70 11.75 12.370 11.536 12.1490 11.742 12.644 13.6749 12.301
## cvpred      11.72 11.76 12.338 11.546 12.2211 11.759 12.622 13.5622 12.317
## SalePrice   11.95 10.22 12.206 11.138 12.1389 12.072 12.914 13.6158 12.605
## CV residual  0.23 -1.54 -0.131 -0.409 -0.0822  0.313  0.292  0.0536  0.288
##                165    171    172    174    175    179    180    181    182
## Predicted   12.115 11.535 11.794 11.790 11.955 11.852 11.786 11.979 11.768
## cvpred      12.159 11.780 11.803 11.852 11.958 11.885 11.801 11.987 11.786
## SalePrice   12.595 11.512 11.156 11.435 11.905 11.590 11.430 11.775 11.225
## CV residual  0.435 -0.268 -0.646 -0.417 -0.053 -0.295 -0.372 -0.212 -0.561
##                190    192    193     194   197    199    212    213    216
## Predicted   13.753 11.799 12.576 12.2311 12.71 13.305 11.970 11.807 11.830
## cvpred      13.666 11.809 12.447 12.2871 12.60 13.202 12.029 11.749 11.835
## SalePrice   13.769 11.925 12.899 12.2620 13.65 13.705 11.891 12.231 11.951
## CV residual  0.104  0.116  0.452 -0.0251  1.05  0.503 -0.138  0.482  0.116
##                 221    222    226    234     244    245    247  249   259
## Predicted   12.1787 11.797 11.913 12.694 11.9695 12.224 12.143 12.2 11.89
## cvpred      12.1618 11.824 11.959 12.678 11.9968 12.242 12.192 12.3 11.90
## SalePrice   12.1172 12.128 11.608 12.401 11.9184 12.155 12.061 12.1 10.69
## CV residual -0.0445  0.304 -0.351 -0.277 -0.0784 -0.087 -0.131 -0.2 -1.21
##                260    262    263    265    268  269    270    273     274
## Predicted   11.829 12.039 11.138 11.815 12.202 12.4 11.561 12.547 11.7503
## cvpred      11.850 12.095 11.053 11.814 12.192 12.4 11.586 12.635 11.7431
## SalePrice   11.212 11.683 11.839 12.087 12.014 12.7 11.898 12.388 11.7668
## CV residual -0.638 -0.412  0.786  0.273 -0.178  0.3  0.313 -0.247  0.0237
##                279    280      282      283    285    286    290    292
## Predicted   12.416 11.622 12.05901 11.71092 13.983 11.995 12.284 11.881
## cvpred      12.404 11.631 12.13052 11.74335 13.809 12.023 12.275 11.856
## SalePrice   13.395 11.327 12.12757 11.74721 14.130 12.278 13.028 12.150
## CV residual  0.991 -0.304 -0.00295  0.00385  0.322  0.255  0.753  0.293
##                293    294
## Predicted   12.102 12.167
## cvpred      12.169 12.161
## SalePrice   12.044 12.780
## CV residual -0.126  0.618
## 
## Sum of squares = 22.6    Mean square = 0.23    n = 99 
## 
## fold 2 
## Observations in test set: 100 
##                  2      5     9    17     24     26     27    29     30
## Predicted   13.471 11.972 12.23 12.18 11.837 11.666 12.016 12.16 12.451
## cvpred      13.490 11.969 12.28 12.23 11.860 11.677 12.018 11.70 12.003
## SalePrice   13.788 11.736 11.17 12.74 11.653 11.327 11.156 12.72 12.560
## CV residual  0.298 -0.233 -1.11  0.51 -0.208 -0.351 -0.862  1.02  0.557
##                 31    32     33    34      35     37    41      43     49
## Predicted   13.128 13.34 12.141 12.15 11.8245 12.203 10.71 10.7944 12.951
## cvpred      12.610 12.85 12.181 12.20 11.8292 12.189 10.87 10.9429 13.038
## SalePrice   13.253 13.31 12.297 11.74 11.9184 12.532  9.68 10.9151 13.365
## CV residual  0.644  0.46  0.116 -0.46  0.0892  0.343 -1.19 -0.0278  0.326
##                 50     57     63     68     74    77    78     80     82
## Predicted   12.586 13.278 13.329 12.256 11.729 12.55 13.75 12.919 13.018
## cvpred      12.652 13.261 13.330 11.769 11.731 12.56 13.73 12.948 12.976
## SalePrice   13.006 13.050 13.209 12.211 11.982 12.79 13.82 13.262 13.218
## CV residual  0.355 -0.211 -0.121  0.442  0.251  0.23  0.09  0.314  0.242
##                 89    91      97    98      99    106     110     112
## Predicted   11.824 11.85 12.7951 12.69 12.6118 11.942 11.7804 11.7535
## cvpred      11.885 11.88 12.8453 12.76 12.6828 11.938 11.7738 11.7415
## SalePrice   11.002 10.45 12.9342 12.58 12.7426 12.044 11.7361 11.8130
## CV residual -0.883 -1.43  0.0889 -0.18  0.0598  0.105 -0.0377  0.0715
##                115   117     119   120   123    128    133    135   138
## Predicted   11.969 12.24 11.8605 12.92 11.69 11.785 11.785 12.495 12.01
## cvpred      11.969 12.27 11.8611 13.00 11.68 11.812 11.835 12.545 12.02
## SalePrice   11.884 12.91 11.8494 12.63 10.45 11.935 12.424 12.861 10.99
## CV residual -0.085  0.64 -0.0117 -0.37 -1.23  0.123  0.589  0.316 -1.03
##               139    141      145    146    150   155     157    161
## Predicted   11.95 11.887 11.22084 11.699 11.567 13.05 12.4761 11.314
## cvpred      11.96 11.908 11.07765 11.512 11.407 13.06 12.4865 11.349
## SalePrice   10.37 11.635 11.08598 11.884 12.128 13.61 12.5602 11.608
## CV residual -1.58 -0.273  0.00833  0.373  0.721  0.55  0.0737  0.259
##                 162    168    173    177    183    184     189    191
## Predicted   12.1612 12.034 12.241 11.952 12.193 12.439 11.6641 11.930
## cvpred      12.1486 12.041 12.274 11.971 12.175 12.457 11.7541 11.923
## SalePrice   12.2303 11.735 11.842 11.831 12.946 12.995 11.6869 11.608
## CV residual  0.0816 -0.305 -0.431 -0.139  0.771  0.537 -0.0672 -0.315
##                195   196      198    201   203    204    206    207    208
## Predicted   12.185 12.04 11.80108 11.784 12.40 12.426 11.612 11.691 12.405
## cvpred      12.184 12.09 11.80300 11.798 12.40 12.484 11.645 11.718 12.412
## SalePrice   11.983 11.05 11.80932 12.144 12.58 12.633 12.297 11.884 12.524
## CV residual -0.201 -1.04  0.00632  0.346  0.18  0.149  0.652  0.166  0.112
##                209   211    214    218   224    225     229   230   231
## Predicted   11.329 12.09 12.067 14.179 12.10 11.862 13.3055 12.40 12.59
## cvpred      11.172 12.14 12.101 14.257 12.11 11.893 13.3582 12.40 12.15
## SalePrice   11.878 12.35 12.384 13.790 11.97 11.736 13.2963 12.58 12.52
## CV residual  0.705  0.21  0.284 -0.467 -0.14 -0.157 -0.0618  0.18  0.37
##                 232  233    237    238     239     240    242    243
## Predicted   10.8390 12.0 11.777 11.680 11.9429 12.2473 12.312 12.085
## cvpred      10.9812 12.4 11.774 11.672 11.9257 12.2759 12.287 12.093
## SalePrice   10.9133 12.0 12.139 12.020 11.8565 12.3014 11.362 11.884
## CV residual -0.0679 -0.4  0.365  0.348 -0.0692  0.0255 -0.924 -0.209
##                 246     253    254    258    261    272    277    278
## Predicted   12.3051 12.6656 11.742 12.674 13.761 12.840 12.011 12.169
## cvpred      12.2865 12.6657 11.761 12.661 13.824 12.324 11.995 12.206
## SalePrice   12.3863 12.6440 11.983 13.305 13.377 12.995 12.128 12.588
## CV residual  0.0998 -0.0217  0.222  0.643 -0.447  0.671  0.133  0.382
##               284    288    289   291      296    297
## Predicted   12.33 12.489 11.783 14.57 12.15957 12.175
## cvpred      12.37 12.485 11.788 14.64 12.14461 12.164
## SalePrice   12.24 12.760 11.983 13.20 12.13886 12.310
## CV residual -0.13  0.275  0.195 -1.43 -0.00575  0.146
## 
## Sum of squares = 26.1    Mean square = 0.26    n = 100 
## 
## fold 3 
## Observations in test set: 99 
##                 1      3      6      7       8     12     13      14
## Predicted   11.97 12.693 11.985 12.419 11.6380 11.676 12.150 11.9814
## cvpred      11.93 12.672 11.966 12.445 11.6111 11.665 12.156 11.9885
## SalePrice   12.26 12.948 12.465 12.821 11.6263 10.872 11.813 11.9512
## CV residual  0.33  0.276  0.498  0.376  0.0151 -0.792 -0.343 -0.0373
##                 19      21     39     42     44     47     51     52
## Predicted   11.613 12.4088 11.499 10.991 10.927 12.123 12.400 13.066
## cvpred      11.593 12.4304 11.731 10.902 10.830 12.099 12.386 13.096
## SalePrice   11.849 12.4568 10.840 11.708 10.977 12.560 12.065 13.459
## CV residual  0.256  0.0264 -0.891  0.806  0.147  0.461 -0.321  0.363
##                 53     54    55     58      59      62    64     65     67
## Predicted   11.685 12.104 11.70 12.581 12.6264 13.2751 13.74 12.247 11.955
## cvpred      11.659 12.082 11.69 12.580 12.6345 13.3392 13.44 12.267 11.939
## SalePrice   11.951 12.405 11.81 12.301 12.6115 13.2963 14.45 13.251 12.848
## CV residual  0.292  0.323  0.12 -0.278 -0.0229 -0.0429  1.01  0.984  0.909
##                70     71    72     73     75     76     83     84     87
## Predicted   11.89 11.801 11.99 12.041 11.824 12.606 13.443 11.987 11.870
## cvpred      11.88 11.816 11.99 12.050 11.820 12.564 13.237 11.964 11.864
## SalePrice   12.49 11.857 12.21 12.403 12.073 12.707 13.346 11.835 11.408
## CV residual  0.61  0.041  0.22  0.353  0.253  0.143  0.109 -0.129 -0.457
##                 92     93      96     101    105     107    108     113
## Predicted   12.228 12.602 13.0519 13.7287 13.147 11.8991 12.054 12.6333
## cvpred      11.981 12.375 13.0357 13.8450 13.078 11.8598 12.061 12.5978
## SalePrice   12.445 12.506 13.1224 13.7820 12.975 11.8776 11.350 12.5099
## CV residual  0.464  0.131  0.0867 -0.0631 -0.103  0.0177 -0.711 -0.0879
##               121    124    127     131    132   134    137     143
## Predicted   11.64 12.098 12.025 12.3385 12.680 12.91 11.999 11.3052
## cvpred      11.62 12.063 11.987 12.3613 12.464 13.04 11.985 11.4548
## SalePrice   11.29 12.692 12.324 12.4049 13.275 11.92 11.884 11.4773
## CV residual -0.33  0.629  0.337  0.0436  0.811 -1.13 -0.101  0.0225
##                 144    147   149    151   152   156    158    163    166
## Predicted   11.1728 11.372 12.32 11.037 11.37 12.77 13.584 12.399 11.523
## cvpred      11.3185 11.578 12.56 11.223 11.54 12.81 13.657 12.374 11.233
## SalePrice   11.2960 12.035 11.23 11.082 10.39 13.06 13.420 11.884 11.728
## CV residual -0.0224  0.457 -1.33 -0.141 -1.14  0.25 -0.237 -0.489  0.495
##                167    169    170    176    178    185    186    187    188
## Predicted   12.006 12.279 11.774 12.217 11.775 11.823 12.303 12.362 14.540
## cvpred      11.982 12.243 11.768 12.177 11.729 11.833 12.274 12.383 14.726
## SalePrice   11.720 12.181 11.513 11.842 11.408 12.530 12.142 12.835 13.825
## CV residual -0.262 -0.062 -0.255 -0.335 -0.321  0.697 -0.133  0.451 -0.901
##               200    202    205   210    215    217    219     220    223
## Predicted   13.32 11.692 11.953 13.15 11.658 12.114 12.927 13.5986 12.970
## cvpred      13.36 11.681 11.932 12.94 11.645 12.083 12.934 13.6721 12.978
## SalePrice   13.30 11.142 12.177 13.38 11.513 12.196 12.612 13.6352 12.843
## CV residual -0.06 -0.539  0.246  0.44 -0.132  0.113 -0.323 -0.0369 -0.136
##                227     228    235    236    241    248     250     251
## Predicted   11.890 12.6595 11.927 12.782 11.758 11.858 12.4927 12.0800
## cvpred      11.886 12.6880 11.851 12.788 11.744 12.014 12.4730 12.0432
## SalePrice   11.608 12.6603 11.608 13.452 11.518 12.168 12.4875 12.1145
## CV residual -0.278 -0.0277 -0.243  0.664 -0.226  0.154  0.0145  0.0713
##                252    255     256    257     264    266    267    271
## Predicted   11.882 11.909 12.5677 11.921 12.1937 12.669 11.693 12.273
## cvpred      11.887 11.819 12.5235 11.930 12.1765 12.645 11.644 12.288
## SalePrice   12.301 12.572 12.4969 11.562 12.2061 12.808 12.201 12.612
## CV residual  0.414  0.754 -0.0266 -0.368  0.0296  0.162  0.557  0.323
##                275     276     281    287    295    298
## Predicted   11.920 11.9759 11.9176 12.644 11.972 12.430
## cvpred      11.881 11.9512 11.8828 12.709 11.948 12.456
## SalePrice   11.775 11.9083 11.9184 12.953 11.608 12.154
## CV residual -0.106 -0.0428  0.0356  0.243 -0.339 -0.302
## 
## Sum of squares = 19.3    Mean square = 0.2    n = 99 
## 
## Overall (Sum over all 99 folds) 
##    ms 
## 0.228
n<-dim(housing_data_slim)[1]

MSPE <- sum( ((SalePrice)-KCV4$cvpred)^2 )/n
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
PRESS <- sum(((SalePrice)-KCV4$cvpred)^2)
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
Pred_R_squared <- 1-sum(((SalePrice)-KCV4$cvpred)^2)/sum(((SalePrice)-mean((SalePrice)))^2)
## Warning in (SalePrice) - KCV4$cvpred: longer object length is not a
## multiple of shorter object length
MSPE
## [1] 1.38e+11
PRESS
## [1] 4.1e+13
Pred_R_squared
## [1] -1.32

As it is visible that Normality assumption is not satisfied in QQ Plot. The QQ plot is heavily tailed distribution. So, we will check the boxcox plot for best values of y Another noticeable insight about the model is, we have age & SqFt as continous variables, while bathrooms is nominal factor and zip_ind, neighbourhood indicator are categorical variables.

Therefore, transformation is only possible on continous,i.e., age & SqFt.

As of now boxcox plot is also non-converging.

boxcox(model4)

bcx<-boxcox(model4)

(lam <- bcx$x[which.max(bcx$y)])
## [1] 2
housing_data_slim$SalePrice <- (housing_data_slim$SalePrice  ^ lam - 1) / lam

Final Model

After transformation

housing_data_slim <- read.csv('Project.csv',h = T)
Bedrooms <- housing_data_slim$Bedrooms
Bathrooms <- housing_data_slim$Bathrooms
Stories <- housing_data_slim$Stories
Sqft <- housing_data_slim$SqFt
LotSqFt <- housing_data_slim$LotSqFt
Zip_Ind <- housing_data_slim$Zip_Ind
Neighborhood_Indi <- housing_data_slim$Neighborhood_Indi
Age <- (2019-housing_data_slim$Year)
housing_data_slim <- cbind(housing_data_slim,Age)
housing_data_slim$SalePrice <- log(housing_data_slim$SalePrice)
str(housing_data_slim)
## 'data.frame':    298 obs. of  15 variables:
##  $ ï..Index         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Address          : Factor w/ 298 levels "1006 Rutledge Ave",..: 33 217 229 235 257 266 279 39 274 45 ...
##  $ DateSold         : Factor w/ 78 levels "10/1/2019","10/10/2019",..: 71 56 50 3 59 13 18 76 27 3 ...
##  $ Zip              : int  45230 45230 45230 45230 45230 45255 45255 45217 45229 45237 ...
##  $ Neighborhood     : Factor w/ 62 levels "ANDERSON TOWNSHIP",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ SalePrice        : num  12.3 13.8 12.9 11.9 11.7 ...
##  $ Year             : int  1959 2004 1987 1937 1959 1941 1992 1946 1910 1936 ...
##  $ Bedrooms         : int  3 7 3 3 3 3 4 4 5 5 ...
##  $ Bathrooms        : int  2 5 4 1 2 2 2 2 3 3 ...
##  $ Stories          : int  2 2 2 2 2 1 2 1 2 3 ...
##  $ SqFt             : int  1384 4628 2634 1580 1404 1678 2504 1142 2480 2542 ...
##  $ LotSqFt          : int  6011 34412 17424 23958 8276 48918 27443 4922 5619 13939 ...
##  $ Zip_Ind          : Factor w/ 6 levels "five","four",..: 4 4 4 4 4 4 4 1 4 4 ...
##  $ Neighborhood_Indi: Factor w/ 6 levels "NE1","NE2","NE3",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Age              : num  60 15 32 82 60 78 27 73 109 83 ...
SalePrice<-housing_data_slim$SalePrice
model4 <- lm(SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + Neighborhood_Indi, data = housing_data_slim)
summary(model4)
## 
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + 
##     Neighborhood_Indi, data = housing_data_slim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5719 -0.2110  0.0327  0.2712  1.0041 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.08e+01   3.65e-01   29.52  < 2e-16 ***
## Age                     -3.76e-03   8.04e-04   -4.67  4.6e-06 ***
## SqFt                     2.94e-04   3.53e-05    8.34  3.2e-15 ***
## Bathrooms                1.27e-01   3.59e-02    3.54  0.00047 ***
## Zip_Indfour              8.74e-01   2.89e-01    3.02  0.00275 ** 
## Zip_Indone              -7.79e-02   3.38e-01   -0.23  0.81767    
## Zip_Indothers            2.08e-01   2.64e-01    0.79  0.43137    
## Zip_Indsix               6.68e-01   3.02e-01    2.21  0.02765 *  
## Zip_Indthree             7.76e-01   3.44e-01    2.26  0.02473 *  
## Neighborhood_IndiNE2     2.20e-01   3.69e-01    0.60  0.55082    
## Neighborhood_IndiNE3     8.99e-01   3.45e-01    2.60  0.00968 ** 
## Neighborhood_IndiNE5     6.91e-01   3.10e-01    2.23  0.02669 *  
## Neighborhood_IndiNE6     7.91e-01   3.19e-01    2.48  0.01365 *  
## Neighborhood_Indiothers  5.39e-01   2.25e-01    2.40  0.01718 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.453 on 284 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.652 
## F-statistic: 43.8 on 13 and 284 DF,  p-value: <2e-16

F-Test

As p-value of the model,2 * 10^(-16), is less than 5 percent then we reject the null hypothesis and say that the model has a non-zero slope

T-test

T-test tests whether jth particular covariate has a nonzero slope.

It has been identified that categorical variable,Zipcode has factor Zip_others and Neighborhood_3 > 5%.

Therefore, null hypothesis succeed for these two factors.

Standard Error

For Zip_Others,Neighborhood_IndiNE2 has high wiggle room, because Beta1 < 2*Std Error.

KCV4<-cv.lm(data=housing_data_slim, model4, m=3, seed=123)
## Analysis of Variance Table
## 
## Response: SalePrice
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## Age                 1   21.5    21.5  104.93 < 2e-16 ***
## SqFt                1   55.9    55.9  272.62 < 2e-16 ***
## Bathrooms           1    8.4     8.4   40.89 6.6e-10 ***
## Zip_Ind             5   28.9     5.8   28.14 < 2e-16 ***
## Neighborhood_Indi   5    2.1     0.4    2.02   0.075 .  
## Residuals         284   58.3     0.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = housing_data_slim, model4, m = 3, seed = 123): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

## 
## fold 1 
## Observations in test set: 99 
##                 4     10     11     15      16     18      20     22
## Predicted   11.81 12.347 11.944 12.393 12.1354 13.238 11.9269 11.793
## cvpred      11.81 12.333 11.915 12.297 12.1366 13.157 11.9780 11.817
## SalePrice   11.92 12.181 11.290 12.992 12.1548 13.422 11.9117 12.150
## CV residual  0.11 -0.152 -0.625  0.696  0.0182  0.266 -0.0663  0.333
##                 23     25     28    36     38     40    45     46     48
## Predicted   11.822 11.665 12.160 11.80 12.200 10.743 11.34 11.951 12.501
## cvpred      11.816 11.707 12.205 11.81 12.262 10.655 10.94 11.946 12.487
## SalePrice   12.170 11.451 12.001 11.39 12.084  9.903 12.21 12.787 12.601
## CV residual  0.354 -0.256 -0.204 -0.42 -0.177 -0.752  1.27  0.842  0.115
##                 56     60     61     66     69     79     81     85     86
## Predicted   12.364 12.884 12.456 11.681 12.100 13.751 14.050 12.130 11.883
## cvpred      12.259 12.736 12.360 11.693 12.103 13.872 14.118 12.134 11.922
## SalePrice   11.864 13.591 12.654 12.151 12.734 13.240 14.017 11.835 11.258
## CV residual -0.395  0.855  0.294  0.459  0.631 -0.632 -0.101 -0.299 -0.664
##                 88     90     94    95    100    102   103   104    109
## Predicted   12.146 11.749 12.252 13.84 13.383 13.074 12.30 12.46 11.752
## cvpred      12.132 11.726 12.385 13.74 13.559 13.315 12.54 12.73 11.763
## SalePrice   12.278 11.951 11.736 13.30 13.448 12.700 12.23 12.32 11.225
## CV residual  0.147  0.225 -0.649 -0.44 -0.111 -0.615 -0.31 -0.41 -0.538
##                111    114    116    118    122    125    126    129    130
## Predicted   11.639 11.685 11.993 12.714 12.293 11.560 11.742 11.980 12.314
## cvpred      11.649 11.719 11.927 12.690 12.268 11.580 11.780 12.125 12.423
## SalePrice   11.814 11.607 11.581 12.936 12.667 10.714 11.983 11.290 12.938
## CV residual  0.165 -0.112 -0.346  0.246  0.399 -0.866  0.202 -0.835  0.516
##               136   140    142    148     153    154    159     160    164
## Predicted   11.70 11.75 12.370 11.536 12.1490 11.742 12.644 13.6749 12.301
## cvpred      11.72 11.76 12.338 11.546 12.2211 11.759 12.622 13.5622 12.317
## SalePrice   11.95 10.22 12.206 11.138 12.1389 12.072 12.914 13.6158 12.605
## CV residual  0.23 -1.54 -0.131 -0.409 -0.0822  0.313  0.292  0.0536  0.288
##                165    171    172    174    175    179    180    181    182
## Predicted   12.115 11.535 11.794 11.790 11.955 11.852 11.786 11.979 11.768
## cvpred      12.159 11.780 11.803 11.852 11.958 11.885 11.801 11.987 11.786
## SalePrice   12.595 11.512 11.156 11.435 11.905 11.590 11.430 11.775 11.225
## CV residual  0.435 -0.268 -0.646 -0.417 -0.053 -0.295 -0.372 -0.212 -0.561
##                190    192    193     194   197    199    212    213    216
## Predicted   13.753 11.799 12.576 12.2311 12.71 13.305 11.970 11.807 11.830
## cvpred      13.666 11.809 12.447 12.2871 12.60 13.202 12.029 11.749 11.835
## SalePrice   13.769 11.925 12.899 12.2620 13.65 13.705 11.891 12.231 11.951
## CV residual  0.104  0.116  0.452 -0.0251  1.05  0.503 -0.138  0.482  0.116
##                 221    222    226    234     244    245    247  249   259
## Predicted   12.1787 11.797 11.913 12.694 11.9695 12.224 12.143 12.2 11.89
## cvpred      12.1618 11.824 11.959 12.678 11.9968 12.242 12.192 12.3 11.90
## SalePrice   12.1172 12.128 11.608 12.401 11.9184 12.155 12.061 12.1 10.69
## CV residual -0.0445  0.304 -0.351 -0.277 -0.0784 -0.087 -0.131 -0.2 -1.21
##                260    262    263    265    268  269    270    273     274
## Predicted   11.829 12.039 11.138 11.815 12.202 12.4 11.561 12.547 11.7503
## cvpred      11.850 12.095 11.053 11.814 12.192 12.4 11.586 12.635 11.7431
## SalePrice   11.212 11.683 11.839 12.087 12.014 12.7 11.898 12.388 11.7668
## CV residual -0.638 -0.412  0.786  0.273 -0.178  0.3  0.313 -0.247  0.0237
##                279    280      282      283    285    286    290    292
## Predicted   12.416 11.622 12.05901 11.71092 13.983 11.995 12.284 11.881
## cvpred      12.404 11.631 12.13052 11.74335 13.809 12.023 12.275 11.856
## SalePrice   13.395 11.327 12.12757 11.74721 14.130 12.278 13.028 12.150
## CV residual  0.991 -0.304 -0.00295  0.00385  0.322  0.255  0.753  0.293
##                293    294
## Predicted   12.102 12.167
## cvpred      12.169 12.161
## SalePrice   12.044 12.780
## CV residual -0.126  0.618
## 
## Sum of squares = 22.6    Mean square = 0.23    n = 99 
## 
## fold 2 
## Observations in test set: 100 
##                  2      5     9    17     24     26     27    29     30
## Predicted   13.471 11.972 12.23 12.18 11.837 11.666 12.016 12.16 12.451
## cvpred      13.490 11.969 12.28 12.23 11.860 11.677 12.018 11.70 12.003
## SalePrice   13.788 11.736 11.17 12.74 11.653 11.327 11.156 12.72 12.560
## CV residual  0.298 -0.233 -1.11  0.51 -0.208 -0.351 -0.862  1.02  0.557
##                 31    32     33    34      35     37    41      43     49
## Predicted   13.128 13.34 12.141 12.15 11.8245 12.203 10.71 10.7944 12.951
## cvpred      12.610 12.85 12.181 12.20 11.8292 12.189 10.87 10.9429 13.038
## SalePrice   13.253 13.31 12.297 11.74 11.9184 12.532  9.68 10.9151 13.365
## CV residual  0.644  0.46  0.116 -0.46  0.0892  0.343 -1.19 -0.0278  0.326
##                 50     57     63     68     74    77    78     80     82
## Predicted   12.586 13.278 13.329 12.256 11.729 12.55 13.75 12.919 13.018
## cvpred      12.652 13.261 13.330 11.769 11.731 12.56 13.73 12.948 12.976
## SalePrice   13.006 13.050 13.209 12.211 11.982 12.79 13.82 13.262 13.218
## CV residual  0.355 -0.211 -0.121  0.442  0.251  0.23  0.09  0.314  0.242
##                 89    91      97    98      99    106     110     112
## Predicted   11.824 11.85 12.7951 12.69 12.6118 11.942 11.7804 11.7535
## cvpred      11.885 11.88 12.8453 12.76 12.6828 11.938 11.7738 11.7415
## SalePrice   11.002 10.45 12.9342 12.58 12.7426 12.044 11.7361 11.8130
## CV residual -0.883 -1.43  0.0889 -0.18  0.0598  0.105 -0.0377  0.0715
##                115   117     119   120   123    128    133    135   138
## Predicted   11.969 12.24 11.8605 12.92 11.69 11.785 11.785 12.495 12.01
## cvpred      11.969 12.27 11.8611 13.00 11.68 11.812 11.835 12.545 12.02
## SalePrice   11.884 12.91 11.8494 12.63 10.45 11.935 12.424 12.861 10.99
## CV residual -0.085  0.64 -0.0117 -0.37 -1.23  0.123  0.589  0.316 -1.03
##               139    141      145    146    150   155     157    161
## Predicted   11.95 11.887 11.22084 11.699 11.567 13.05 12.4761 11.314
## cvpred      11.96 11.908 11.07765 11.512 11.407 13.06 12.4865 11.349
## SalePrice   10.37 11.635 11.08598 11.884 12.128 13.61 12.5602 11.608
## CV residual -1.58 -0.273  0.00833  0.373  0.721  0.55  0.0737  0.259
##                 162    168    173    177    183    184     189    191
## Predicted   12.1612 12.034 12.241 11.952 12.193 12.439 11.6641 11.930
## cvpred      12.1486 12.041 12.274 11.971 12.175 12.457 11.7541 11.923
## SalePrice   12.2303 11.735 11.842 11.831 12.946 12.995 11.6869 11.608
## CV residual  0.0816 -0.305 -0.431 -0.139  0.771  0.537 -0.0672 -0.315
##                195   196      198    201   203    204    206    207    208
## Predicted   12.185 12.04 11.80108 11.784 12.40 12.426 11.612 11.691 12.405
## cvpred      12.184 12.09 11.80300 11.798 12.40 12.484 11.645 11.718 12.412
## SalePrice   11.983 11.05 11.80932 12.144 12.58 12.633 12.297 11.884 12.524
## CV residual -0.201 -1.04  0.00632  0.346  0.18  0.149  0.652  0.166  0.112
##                209   211    214    218   224    225     229   230   231
## Predicted   11.329 12.09 12.067 14.179 12.10 11.862 13.3055 12.40 12.59
## cvpred      11.172 12.14 12.101 14.257 12.11 11.893 13.3582 12.40 12.15
## SalePrice   11.878 12.35 12.384 13.790 11.97 11.736 13.2963 12.58 12.52
## CV residual  0.705  0.21  0.284 -0.467 -0.14 -0.157 -0.0618  0.18  0.37
##                 232  233    237    238     239     240    242    243
## Predicted   10.8390 12.0 11.777 11.680 11.9429 12.2473 12.312 12.085
## cvpred      10.9812 12.4 11.774 11.672 11.9257 12.2759 12.287 12.093
## SalePrice   10.9133 12.0 12.139 12.020 11.8565 12.3014 11.362 11.884
## CV residual -0.0679 -0.4  0.365  0.348 -0.0692  0.0255 -0.924 -0.209
##                 246     253    254    258    261    272    277    278
## Predicted   12.3051 12.6656 11.742 12.674 13.761 12.840 12.011 12.169
## cvpred      12.2865 12.6657 11.761 12.661 13.824 12.324 11.995 12.206
## SalePrice   12.3863 12.6440 11.983 13.305 13.377 12.995 12.128 12.588
## CV residual  0.0998 -0.0217  0.222  0.643 -0.447  0.671  0.133  0.382
##               284    288    289   291      296    297
## Predicted   12.33 12.489 11.783 14.57 12.15957 12.175
## cvpred      12.37 12.485 11.788 14.64 12.14461 12.164
## SalePrice   12.24 12.760 11.983 13.20 12.13886 12.310
## CV residual -0.13  0.275  0.195 -1.43 -0.00575  0.146
## 
## Sum of squares = 26.1    Mean square = 0.26    n = 100 
## 
## fold 3 
## Observations in test set: 99 
##                 1      3      6      7       8     12     13      14
## Predicted   11.97 12.693 11.985 12.419 11.6380 11.676 12.150 11.9814
## cvpred      11.93 12.672 11.966 12.445 11.6111 11.665 12.156 11.9885
## SalePrice   12.26 12.948 12.465 12.821 11.6263 10.872 11.813 11.9512
## CV residual  0.33  0.276  0.498  0.376  0.0151 -0.792 -0.343 -0.0373
##                 19      21     39     42     44     47     51     52
## Predicted   11.613 12.4088 11.499 10.991 10.927 12.123 12.400 13.066
## cvpred      11.593 12.4304 11.731 10.902 10.830 12.099 12.386 13.096
## SalePrice   11.849 12.4568 10.840 11.708 10.977 12.560 12.065 13.459
## CV residual  0.256  0.0264 -0.891  0.806  0.147  0.461 -0.321  0.363
##                 53     54    55     58      59      62    64     65     67
## Predicted   11.685 12.104 11.70 12.581 12.6264 13.2751 13.74 12.247 11.955
## cvpred      11.659 12.082 11.69 12.580 12.6345 13.3392 13.44 12.267 11.939
## SalePrice   11.951 12.405 11.81 12.301 12.6115 13.2963 14.45 13.251 12.848
## CV residual  0.292  0.323  0.12 -0.278 -0.0229 -0.0429  1.01  0.984  0.909
##                70     71    72     73     75     76     83     84     87
## Predicted   11.89 11.801 11.99 12.041 11.824 12.606 13.443 11.987 11.870
## cvpred      11.88 11.816 11.99 12.050 11.820 12.564 13.237 11.964 11.864
## SalePrice   12.49 11.857 12.21 12.403 12.073 12.707 13.346 11.835 11.408
## CV residual  0.61  0.041  0.22  0.353  0.253  0.143  0.109 -0.129 -0.457
##                 92     93      96     101    105     107    108     113
## Predicted   12.228 12.602 13.0519 13.7287 13.147 11.8991 12.054 12.6333
## cvpred      11.981 12.375 13.0357 13.8450 13.078 11.8598 12.061 12.5978
## SalePrice   12.445 12.506 13.1224 13.7820 12.975 11.8776 11.350 12.5099
## CV residual  0.464  0.131  0.0867 -0.0631 -0.103  0.0177 -0.711 -0.0879
##               121    124    127     131    132   134    137     143
## Predicted   11.64 12.098 12.025 12.3385 12.680 12.91 11.999 11.3052
## cvpred      11.62 12.063 11.987 12.3613 12.464 13.04 11.985 11.4548
## SalePrice   11.29 12.692 12.324 12.4049 13.275 11.92 11.884 11.4773
## CV residual -0.33  0.629  0.337  0.0436  0.811 -1.13 -0.101  0.0225
##                 144    147   149    151   152   156    158    163    166
## Predicted   11.1728 11.372 12.32 11.037 11.37 12.77 13.584 12.399 11.523
## cvpred      11.3185 11.578 12.56 11.223 11.54 12.81 13.657 12.374 11.233
## SalePrice   11.2960 12.035 11.23 11.082 10.39 13.06 13.420 11.884 11.728
## CV residual -0.0224  0.457 -1.33 -0.141 -1.14  0.25 -0.237 -0.489  0.495
##                167    169    170    176    178    185    186    187    188
## Predicted   12.006 12.279 11.774 12.217 11.775 11.823 12.303 12.362 14.540
## cvpred      11.982 12.243 11.768 12.177 11.729 11.833 12.274 12.383 14.726
## SalePrice   11.720 12.181 11.513 11.842 11.408 12.530 12.142 12.835 13.825
## CV residual -0.262 -0.062 -0.255 -0.335 -0.321  0.697 -0.133  0.451 -0.901
##               200    202    205   210    215    217    219     220    223
## Predicted   13.32 11.692 11.953 13.15 11.658 12.114 12.927 13.5986 12.970
## cvpred      13.36 11.681 11.932 12.94 11.645 12.083 12.934 13.6721 12.978
## SalePrice   13.30 11.142 12.177 13.38 11.513 12.196 12.612 13.6352 12.843
## CV residual -0.06 -0.539  0.246  0.44 -0.132  0.113 -0.323 -0.0369 -0.136
##                227     228    235    236    241    248     250     251
## Predicted   11.890 12.6595 11.927 12.782 11.758 11.858 12.4927 12.0800
## cvpred      11.886 12.6880 11.851 12.788 11.744 12.014 12.4730 12.0432
## SalePrice   11.608 12.6603 11.608 13.452 11.518 12.168 12.4875 12.1145
## CV residual -0.278 -0.0277 -0.243  0.664 -0.226  0.154  0.0145  0.0713
##                252    255     256    257     264    266    267    271
## Predicted   11.882 11.909 12.5677 11.921 12.1937 12.669 11.693 12.273
## cvpred      11.887 11.819 12.5235 11.930 12.1765 12.645 11.644 12.288
## SalePrice   12.301 12.572 12.4969 11.562 12.2061 12.808 12.201 12.612
## CV residual  0.414  0.754 -0.0266 -0.368  0.0296  0.162  0.557  0.323
##                275     276     281    287    295    298
## Predicted   11.920 11.9759 11.9176 12.644 11.972 12.430
## cvpred      11.881 11.9512 11.8828 12.709 11.948 12.456
## SalePrice   11.775 11.9083 11.9184 12.953 11.608 12.154
## CV residual -0.106 -0.0428  0.0356  0.243 -0.339 -0.302
## 
## Sum of squares = 19.3    Mean square = 0.2    n = 99 
## 
## Overall (Sum over all 99 folds) 
##    ms 
## 0.228
n<-dim(housing_data_slim)[1]
MSPE4 <- sum( ((SalePrice)-KCV4$cvpred)^2 )/n
PRESS4 <- sum(((SalePrice)-KCV4$cvpred)^2)
Pred_R_squared4 <- 1-sum(((SalePrice)-KCV4$cvpred)^2)/sum(((SalePrice)-mean((SalePrice)))^2)
MSPE4
## [1] 0.228
PRESS4
## [1] 68
Pred_R_squared4
## [1] 0.611
sapply(housing_data_slim,function(x) length(x))
##          ï..Index           Address          DateSold               Zip 
##               298               298               298               298 
##      Neighborhood         SalePrice              Year          Bedrooms 
##               298               298               298               298 
##         Bathrooms           Stories              SqFt           LotSqFt 
##               298               298               298               298 
##           Zip_Ind Neighborhood_Indi               Age 
##               298               298               298
boxcox(model4)

Transformation of response variable will keep in check of inequal variance, non-normality.

But, After doing boxcox transformation, there is no significant change in the model.Therefore our descriptive analysis of using log in SalePrice was correct,though non-normality of qq plot is still not resolved.

vif(model4)
##                     Age                    SqFt               Bathrooms 
##                    1.21                    1.90                    2.23 
##             Zip_Indfour              Zip_Indone           Zip_Indothers 
##                    5.44                    5.37                   19.54 
##              Zip_Indsix            Zip_Indthree    Neighborhood_IndiNE2 
##                    5.53                   13.67                    7.64 
##    Neighborhood_IndiNE3    Neighborhood_IndiNE5    Neighborhood_IndiNE6 
##                    1.72                    7.94                    7.05 
## Neighborhood_Indiothers 
##                   11.37

After checking the Variance Inflation Factors, it has been observed that VIF for Zip_others, Zip_Indtwo, Zip_Three, Neighborhood_IndiNE4, and Neighborhood_Indiothers are greater than 10. So, instead of doing ridge regression these categories can be removed from dataset. I’ve kept these dataset in separate file for refernce.

out = lm.ridge(SalePrice ~ Age + SqFt + Bathrooms + Zip_Ind + Neighborhood_Indi,lambda=.1,data = housing_data_slim)

out
##                                             Age                    SqFt 
##               10.792873               -0.003756                0.000294 
##               Bathrooms             Zip_Indfour              Zip_Indone 
##                0.127199                0.867885               -0.086595 
##           Zip_Indothers              Zip_Indsix            Zip_Indthree 
##                0.202131                0.662472                0.770132 
##    Neighborhood_IndiNE2    Neighborhood_IndiNE3    Neighborhood_IndiNE5 
##                0.210186                0.894220                0.687092 
##    Neighborhood_IndiNE6 Neighborhood_Indiothers 
##                0.786450                0.534870

vif(out) As, we were not able to install glmnet giving error, therefore, we had used MASS::ridge

Prediction Dataset

observations_for_pred=data.frame(Bathrooms=c(1,2,4,4), Age=c(25,32,29,36),SqFt = c(2900,2500,1750,1350),Zip_Ind = c("one","one","four","three"),Neighborhood_Indi = c("NE1","NE5","NE2","NE1"))
predict(model4,observations_for_pred,interval="prediction", level=0.95, type="response")
##    fit  lwr  upr
## 1 11.6 10.6 12.6
## 2 12.3 11.2 13.3
## 3 12.8 11.7 13.9
## 4 12.3 11.2 13.4
max(hatvalues(model4))
## [1] 0.338
min(hatvalues(model4))
## [1] 0.00496
x_new = c(1,25,2528,2,0,1,0,0,0,0,0,0,0,1)
x_new_1 = c(1,50,252,2,0,0,0,1,0,0,0,1,0,0)
x_new_2 = c(1,52,25285,4,0,0,0,0,1,0,1,0,0,0)
x_new_3 = c(1,92,3590,5,0,0,0,1,0,0,0,1,0,0)
X=model.matrix(model4)


t(x_new)%*%solve(t(X)%*%X)%*%x_new
##       [,1]
## [1,] 0.235
t(x_new_1)%*%solve(t(X)%*%X)%*%x_new_1
##       [,1]
## [1,] 0.176
# Above limit of max(hatvalues) so its a extrapolation
t(x_new_2)%*%solve(t(X)%*%X)%*%x_new_2
##      [,1]
## [1,] 3.56
t(x_new_3)%*%solve(t(X)%*%X)%*%x_new_3
##       [,1]
## [1,] 0.182
summary(model4)
## 
## Call:
## lm(formula = SalePrice ~ (Age + SqFt) + Bathrooms + Zip_Ind + 
##     Neighborhood_Indi, data = housing_data_slim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5719 -0.2110  0.0327  0.2712  1.0041 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.08e+01   3.65e-01   29.52  < 2e-16 ***
## Age                     -3.76e-03   8.04e-04   -4.67  4.6e-06 ***
## SqFt                     2.94e-04   3.53e-05    8.34  3.2e-15 ***
## Bathrooms                1.27e-01   3.59e-02    3.54  0.00047 ***
## Zip_Indfour              8.74e-01   2.89e-01    3.02  0.00275 ** 
## Zip_Indone              -7.79e-02   3.38e-01   -0.23  0.81767    
## Zip_Indothers            2.08e-01   2.64e-01    0.79  0.43137    
## Zip_Indsix               6.68e-01   3.02e-01    2.21  0.02765 *  
## Zip_Indthree             7.76e-01   3.44e-01    2.26  0.02473 *  
## Neighborhood_IndiNE2     2.20e-01   3.69e-01    0.60  0.55082    
## Neighborhood_IndiNE3     8.99e-01   3.45e-01    2.60  0.00968 ** 
## Neighborhood_IndiNE5     6.91e-01   3.10e-01    2.23  0.02669 *  
## Neighborhood_IndiNE6     7.91e-01   3.19e-01    2.48  0.01365 *  
## Neighborhood_Indiothers  5.39e-01   2.25e-01    2.40  0.01718 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.453 on 284 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.652 
## F-statistic: 43.8 on 13 and 284 DF,  p-value: <2e-16