Cincinnati Housing Price Prediction

Introduction

The goal of this project is to predict the house prices in city of Cincinnati by observing the potential predictive power of a number of variables on the house prices. Our assumption prior to starting this project is that the sold prices of houses should be correlated to and perhaps predicted by various external indicators. For example, the notion was that if a house has more bedrooms/bathrooms, then the house sale price would be higher than that of the houses with less number of bedrooms/bathrooms. Currently our data has 380 observations, we plan to add more observations to enhance our model. Through several iterations of this analysis we are hoping to eliminate unnecessary and irrelevant regressor variables and end up with a simplified regression equation to get the best possible prediction. We began with the following 16 covariates with 1 response variable: 1. Sold Price: Our response variable. 2. Zip Code 3. Sold Date 4. Number of bedrooms 5. Number of full bathrooms 6. Number of half bathrooms 7. Stories 8. Year Built 9. Sqft 10. Lot Size 11. Total number of rooms 12. School Distance 13. Parking Capacity 14. Parking Size 15. Basement Size 16. Lawn 17. Patio We created 18 different models using different transformations and finnaly decided to settle in for the 6th model.

Our observed Multiple R-squared is 0.7584 and Adjusted R-squared is 0.7538.

The Prediction R-squared is 0.7339898.

Final equation of model is : Sold_price= 9.206e+01+ 7.154e+00No_of_fullbath + -3.691e+00No_of_bedrooms+ 3.832e+00Parking_Capacity + -1.767e+00Zipcode_bin + 1.223e+00Total_rooms+ 1.019e-04med_price + 6.490e+00*No_of_halfbath

Data Preparation

#Loading the required packages
library(readxl)      # To read in the data
library(tidyverse)   # For general data manipulation and regression analysis

## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(knitr)       # To format tables
library(DAAG)

## Loading required package: lattice

library(ggplot2)     # To visualize data
library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:DAAG':
## 
##     cities

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(qwraps2)

## 
## Attaching package: 'qwraps2'

## The following object is masked from 'package:psych':
## 
##     logit

library(naniar)
library(formattable) # To format tables into currency
library(dplyr)

Loading the dataset

data <- read_xlsx("dataf.xlsx")
#removig the columns with URl and teammates names
#data <- data[,-c(1,2,3)] #already done on excel
data

Replacing all missing data with N/A

data <- data %>%replace_with_na_all(condition = ~.x %in% common_na_strings)

Coercing covariates to proper datatypes

data$Zipcode <- as.factor(data$Zipcode)
data$Lawn <- as.factor(data$Lawn)
data$Patio <- as.factor(data$Patio)
data$Year_built <- as.numeric(data$Year_built)
data$Parking_Capacity <- as.numeric(data$Parking_Capacity)
data$Lotsize_sqft <- as.numeric(data$Lotsize_sqft)
data$Sold_price <- as.numeric(data$Sold_price)
data$Sqft <- as.numeric(data$Sqft)
data$Total_rooms <- as.numeric(data$Total_rooms)

Desribing data

We see there are 381 total rows and some of the variables have lesser rows. It needs to be checked which variables are missing vslues and what percentage.

Checking for duplicate rows.

data[!duplicated(data[1:17]),]

We don’t find any duplicate rows. That’s good.

#checking for null values in the data
colSums(is.na(data))

##       Sold_price          Zipcode        Sold_date   No_of_bedrooms 
##                0                0                0                0 
##   No_of_fullbath          Stories       Year_built             Sqft 
##                0                0                0                0 
##     Lotsize_sqft   No_of_halfbath      Total_rooms  School_distance 
##                0               17               17                0 
## Parking_Capacity     Parking_Size    Basement_size             Lawn 
##               20              105              258               17 
##            Patio 
##               17

Parking_Size and Basement_size have a major chunk of data missing. Let’s calculate the percentage of missing data.

sum(is.na(data$Parking_Size))*100/nrow(data)

## [1] 27.55906

Parking_Size has 27.5 % missing values. We can consider imputing the missing values here.

sum(is.na(data$Basement_size))*100/nrow(data)

## [1] 67.71654

Basement_Size has 67% null values which is not a good sign for our predicting model and it’s not advisable to impute such large missing values. So we’ll drop this column from our dataset.

data <- data[,-15]

Meanwhile, Imputing the missing values in Parking_Size by median, we’ll do the same for Total_rooms since it has only a small precentage of missing data.

Imputing median into missing rows.

data$Parksize_imp <- ifelse(is.na(data$Parking_Size), median(data$Parking_Size, na.rm=TRUE), data$Parking_Size)
data$Total_rooms <- ifelse(is.na(data$Total_rooms), median(data$Total_rooms, na.rm=TRUE), data$Total_rooms)

We’ll replace the missing values in other column by 0, since it seems that if data about Half_Bathrooms, Parking capacity, Lawna and Patio is missing, a reasonable argument seems that it might not be present.

data[is.na(data)] <- 0
colSums(is.na(data))

##       Sold_price          Zipcode        Sold_date   No_of_bedrooms 
##                0                0                0                0 
##   No_of_fullbath          Stories       Year_built             Sqft 
##                0                0                0                0 
##     Lotsize_sqft   No_of_halfbath      Total_rooms  School_distance 
##                0                0                0                0 
## Parking_Capacity     Parking_Size             Lawn            Patio 
##                0                0                0                0 
##     Parksize_imp 
##                0

We don’t have any missing values in our dataset now.

Analysis

Proceeding with the analysis, we can strengthen our analysis by adding the data of median prices of houses per square feet. The size of the house surely plays a major role in price of the house and the data provided by zillow according to the zipcodes and median prices per month accross years has been taken into consideration.

#loading the median price per squarefeet dataset
price <-read_xlsx("MedianValuePerSqft.xlsx")
price

This dataset has pricing data from 1970s which is not useful for our analysis. But we want to filter this dataset for only zip codes of interest, which are Cincinnati zip codes.

price_f <- filter(price, Zip %in% unique(data$Zipcode))

With this information, we can calculate an average of the median prices from the last 11 months (January 2019 - November 2019) for each zip code. We will create a new variable for the mean price.

price_avgmed <- data.frame(ID=price_f[,2], Means=rowMeans(price_f[,7:16]))
price_avgmed

Merging both datasets

data$Zipcode <- as.character(data$Zipcode)
price_avgmed$Zip <- as.character(price_avgmed$Zip)
data1 <- left_join(x = data, y = price_avgmed, by = c("Zipcode" = "Zip"))

With the help of mean of the median price we can calculate a new column of data called med_price, which is the price of house calculated by multiplying the size of house(Sqft) and Average median (Means).

data2 <- data1 %>% mutate(med_price = Sqft * Means)
data2

data <- data2[,-18]
data$Zipcode <- as.factor(data$Zipcode )

removing sold_date variable

data <-data[,-3]

colSums(is.na(data))

##       Sold_price          Zipcode   No_of_bedrooms   No_of_fullbath 
##                0                0                0                0 
##          Stories       Year_built             Sqft     Lotsize_sqft 
##                0                0                0                0 
##   No_of_halfbath      Total_rooms  School_distance Parking_Capacity 
##                0                0                0                0 
##     Parking_Size             Lawn            Patio     Parksize_imp 
##                0                0                0                0 
##        med_price 
##                0

Instead of using the variable year_built, we can use it to get the age of the house which seems a more reasonable factor to put into our model.

#Getting age of the house
data$age <- 2019 - data$Year_built

Data Preview

dim(data)

## [1] 381  18

summary(data)

##    Sold_price         Zipcode    No_of_bedrooms  No_of_fullbath 
##  Min.   :  16000   45208  : 30   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 137800   45230  : 26   1st Qu.:3.000   1st Qu.:1.000  
##  Median : 202000   45245  : 23   Median :3.000   Median :2.000  
##  Mean   : 299996   45219  : 19   Mean   :3.247   Mean   :1.992  
##  3rd Qu.: 327900   45202  : 17   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :2700000   45217  : 17   Max.   :7.000   Max.   :7.000  
##                    (Other):249                                  
##     Stories        Year_built        Sqft       Lotsize_sqft   
##  Min.   :1.000   Min.   :1866   Min.   : 609   Min.   :   609  
##  1st Qu.:1.000   1st Qu.:1916   1st Qu.:1344   1st Qu.:  4791  
##  Median :2.000   Median :1946   Median :1776   Median :  7579  
##  Mean   :1.929   Mean   :1946   Mean   :2062   Mean   : 13131  
##  3rd Qu.:2.000   3rd Qu.:1978   3rd Qu.:2374   3rd Qu.: 12197  
##  Max.   :9.000   Max.   :2017   Max.   :8426   Max.   :150718  
##                                                                
##  No_of_halfbath    Total_rooms     School_distance Parking_Capacity
##  Min.   :0.0000   Min.   : 2.000   Min.   :0.100   Min.   :0.000   
##  1st Qu.:0.0000   1st Qu.: 6.000   1st Qu.:0.400   1st Qu.:1.000   
##  Median :1.0000   Median : 7.000   Median :0.600   Median :1.000   
##  Mean   :0.5984   Mean   : 7.115   Mean   :0.805   Mean   :1.346   
##  3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.:1.000   3rd Qu.:2.000   
##  Max.   :2.0000   Max.   :16.000   Max.   :4.100   Max.   :5.000   
##                                                                    
##   Parking_Size   Lawn     Patio      Parksize_imp      med_price      
##  Min.   :   0   0  :175   0  :110   Min.   :   0.0   Min.   :  47913  
##  1st Qu.:   0   1  :205   1  :270   1st Qu.: 250.0   1st Qu.: 145515  
##  Median : 228   Yes:  1   Yes:  1   Median : 329.5   Median : 209853  
##  Mean   : 237                       Mean   : 327.8   Mean   : 277377  
##  3rd Qu.: 410                       3rd Qu.: 410.0   3rd Qu.: 321988  
##  Max.   :1733                       Max.   :1733.0   Max.   :1733260  
##                                                                       
##       age        
##  Min.   :  2.00  
##  1st Qu.: 41.00  
##  Median : 73.00  
##  Mean   : 73.29  
##  3rd Qu.:103.00  
##  Max.   :153.00  
##

Visualizing the data.

ggplot(data, aes(Sold_price)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(data$Sold_price),
                            sd = sd(data$Sold_price))) + 
  ggtitle("Plot of SalePrice Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The Price data is positively skewed but considering the fact that the general population belongs poor to middle class, it is accepted these people will have cheaper houses. And there will be few costly houses.

boxplot(data$Sold_price)

That means most of the houses will also be smaller in size and we can expect size of house to show a similar behavior. Let’s check it.

ggplot(data, aes(Sqft)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "blue",
                args = list(mean = mean(data$Sqft),
                            sd = sd(data$Sold_price))) + 
  ggtitle("Plot of House Size Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is obviously expected behavior.

boxplot(data$Sqft)

ggplot(data, aes(Lotsize_sqft)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(data$Lotsize_sqft),
                            sd = sd(data$Lotsize_sqft))) + 
  ggtitle("Plot of Lotsize_sqft Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s check the age of the houses in Cincinnati and their distribution.

ggplot(data, aes(age)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "blue",
                args = list(mean = mean(data$age),
                            sd = sd(data$age))) + 
  ggtitle("Plot of age Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The ages of the house suprisingly follows a normal distribution.

pairs.panels(data[c("Sold_price","No_of_bedrooms","No_of_fullbath","No_of_halfbath","Stories","Year_built","Sqft","Lotsize_sqft","Parking_Size","Parking_Capacity","Lawn","Patio","Zipcode","age","med_price")],hist.col = "green",gap=0)

Breaking it into clusters to understand the relations better.

pairs.panels(data[c("Sold_price","No_of_bedrooms","No_of_fullbath","No_of_halfbath","Stories")],hist.col = "blue",gap=0)

pairs.panels(data[c("Sold_price","Sqft","Lotsize_sqft","Parking_Size","Parking_Capacity")],hist.col = "red",gap=0)

pairs.panels(data[c("Sold_price","Lawn","Patio","Zipcode","age","med_price")],hist.col = "green",gap=0)

We have a lot of Zipcodes in our dataset. What we can do is categorize the zipcodes into bins according to the neighborhoods i.e. best to worst.

This website provides a ranking of the best to worst zipcodes in cincinnati which is being used to create bins of the data.

data$Zipcode_bin <- 0

data[which(data$Zipcode==45040|data$Zipcode==45242|data$Zipcode==45243|data$Zipcode==45209|data$Zipcode==45249|data$Zipcode==45208|data$Zipcode==45226|data$Zipcode==45202|data$Zipcode==45069|data$Zipcode==45241|data$Zipcode==41075|data$Zipcode==45255),"Zipcode_bin"] <- 1


data[which(data$Zipcode==45174|data$Zipcode==41017|data$Zipcode==45230|
             data$Zipcode==45236|data$Zipcode==45039|data$Zipcode==45220|data$Zipcode==45140|data$Zipcode==45065|data$Zipcode==45213|data$Zipcode==41048|data$Zipcode==45246),"Zipcode_bin"] <- 2



data[which(data$Zipcode==41011|data$Zipcode==41091|data$Zipcode==45223|data$Zipcode==41073|data$Zipcode==41076|data$Zipcode==45150|data$Zipcode==45212|data$Zipcode==45248|data$Zipcode==45248|data$Zipcode==45036|data$Zipcode==41042|data$Zipcode==45215|data$Zipcode==41071|data$Zipcode==45050|data$Zipcode==45245|data$Zipcode==45014|data$Zipcode==41005),"Zipcode_bin"] <- 3


data[which(data$Zipcode==45206|data$Zipcode==45224|data$Zipcode==45238|data$Zipcode==41094|data$Zipcode==45247|data$Zipcode==41018|data$Zipcode==41001|data$Zipcode==41059|data$Zipcode==45211|data$Zipcode==45011|data$Zipcode==45152|data$Zipcode==45240|data$Zipcode==45217|data$Zipcode==45044|data$Zipcode==45034|data$Zipcode==45002|data$Zipcode==41051|data$Zipcode==45239|data$Zipcode==45218|data$Zipcode==45103|data$Zipcode==45252),"Zipcode_bin"] <- 4

data[which(data$Zipcode==47025|data$Zipcode==45204|data$Zipcode==41015|data$Zipcode==45231|data$Zipcode==41014|data$Zipcode==45013|data$Zipcode==45157|data$Zipcode==45030|data$Zipcode==41016|data$Zipcode==45005|data$Zipcode==45102|data$Zipcode==45216|data$Zipcode==45237),"Zipcode_bin"] <- 5

data$Zipcode_bin

##   [1] 3 3 3 3 5 1 1 1 0 1 0 0 0 2 0 5 5 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 4 1 2
##  [38] 1 2 2 0 1 0 1 3 1 1 2 0 2 4 3 1 5 0 3 0 4 2 2 4 2 1 2 0 1 2 1 4 2 0 0 0 0
##  [75] 2 5 1 0 1 3 0 2 4 3 1 2 2 2 2 2 1 1 1 1 1 1 1 4 4 4 4 3 1 3 1 4 1 4 5 4 1
## [112] 2 5 1 0 1 3 4 0 0 2 1 0 5 1 0 3 3 4 3 3 3 3 0 4 3 3 1 1 0 1 1 1 1 0 0 0 0
## [149] 2 0 2 2 2 1 3 4 1 1 3 4 1 2 0 1 1 1 4 4 1 1 0 2 2 2 4 0 0 0 0 0 0 0 0 0 0
## [186] 0 3 2 4 3 1 0 4 2 1 3 1 3 1 1 3 0 4 4 2 1 1 1 1 1 1 1 1 1 1 1 5 4 5 4 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 0 0 3 3 1 0 3 0 3 3 3 3 3 0 0 2 0 0 0 0 2 2 2 0 2
## [260] 2 0 4 1 2 1 4 1 3 4 4 0 4 4 1 4 3 0 4 1 0 2 2 3 0 3 1 5 5 2 4 4 4 0 0 0 1
## [297] 3 3 1 3 1 3 3 3 3 1 4 2 5 0 0 5 0 3 2 3 5 1 4 4 4 0 4 0 1 1 3 0 4 4 4 1 3
## [334] 4 2 1 1 1 1 1 1 1 1 1 1 1 5 5 1 4 4 4 0 1 2 2 2 0 3 1 1 2 4 4 1 4 1 1 0 1
## [371] 2 1 1 4 1 4 4 0 2 3 1

Modelling

We’ll now fit our first model.

model1<-(lm(Sold_price~No_of_fullbath+No_of_bedrooms+Sqft+Parksize_imp+Parking_Capacity+Zipcode_bin+Stories+Total_rooms+med_price+No_of_halfbath+School_distance+Lotsize_sqft+age, data = data))

summary(model1)

## 
## Call:
## lm(formula = Sold_price ~ No_of_fullbath + No_of_bedrooms + Sqft + 
##     Parksize_imp + Parking_Capacity + Zipcode_bin + Stories + 
##     Total_rooms + med_price + No_of_halfbath + School_distance + 
##     Lotsize_sqft + age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -682963  -59114    -145   51804 1358597 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.731e+04  4.886e+04  -1.378   0.1692    
## No_of_fullbath    6.171e+04  1.487e+04   4.151 4.12e-05 ***
## No_of_bedrooms   -2.521e+04  1.207e+04  -2.089   0.0374 *  
## Sqft             -4.854e+01  1.972e+01  -2.461   0.0143 *  
## Parksize_imp      9.990e+01  6.058e+01   1.649   0.1000 .  
## Parking_Capacity -4.847e+02  1.272e+04  -0.038   0.9696    
## Zipcode_bin      -3.101e+03  5.558e+03  -0.558   0.5772    
## Stories          -6.675e+03  8.952e+03  -0.746   0.4564    
## Total_rooms       7.940e+03  5.768e+03   1.377   0.1695    
## med_price         1.244e+00  8.292e-02  14.996  < 2e-16 ***
## No_of_halfbath    7.376e+03  1.593e+04   0.463   0.6437    
## School_distance  -1.698e+04  1.379e+04  -1.231   0.2192    
## Lotsize_sqft      9.936e-01  5.608e-01   1.772   0.0773 .  
## age               1.064e+02  2.793e+02   0.381   0.7034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159500 on 367 degrees of freedom
## Multiple R-squared:  0.7657, Adjusted R-squared:  0.7574 
## F-statistic: 92.26 on 13 and 367 DF,  p-value: < 2.2e-16

The F-statistic comes less than 0.05 which means the model is significant. The adjusted R-squared of 0.7573 suggests a strong model.

Our model should satisfy the following four criteria. 1)Linearity 2)Independent Errors 3)Normality assumption 4)Equal variance

Let’s see the model statistics.

plot(model1)

We can make the following observations from this model of ours.

1)The residuals vs Fitted Values graph indicates a nonconstant variance. 2)The normal Q-Q plot indicates that the distribution has heavy tails, but otherwise is symmetrical 3)Observations 373,372 and 215 seem to be the two identifiable outliers.

Let’s try some transformations on our model, which may potentially solve our LINE assumptions.

We saw that our Sold_price and Sqft(size of house) were positively skewed. Let’s try Log transformation of the variables which are skewed.

Log transformation

(MSRes=summary(model1)$sigma^2)

## [1] 25424537620

#log transformation of Sold_price
ggplot(data, aes(log(Sold_price))) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(log(data$Sold_price)),
                            sd = sd(log(data$Sold_price)))) + 
  ggtitle("Plot of log(SalePrice) Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#log transformation of Sqft
ggplot(data, aes(log(Sqft))) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(log(data$Sqft)),
                            sd = sd(log(data$Sqft)))) + 
  ggtitle("Plot of log(Sqft) Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(log(Lotsize_sqft))) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(log(data$Lotsize_sqft)),
                            sd = sd(log(data$Lotsize_sqft)))) + 
  ggtitle("Plot of log(Lotsize_sqft) Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(log(School_distance))) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(log(data$School_distance)),
                            sd = sd(log(data$School_distance)))) + 
  ggtitle("Plot of log(School_distance) Distribution") +
  theme(plot.title = element_text(size = 11))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s try modelling with our transformed variables and see if that works for us.

model2<-(lm(log(Sold_price)~No_of_fullbath+No_of_bedrooms+log(Sqft)+Parksize_imp+Parking_Capacity+Zipcode_bin+Stories+Total_rooms+med_price+No_of_halfbath+log(School_distance)+log(Lotsize_sqft)+age, data = data))

summary(model2)

## 
## Call:
## lm(formula = log(Sold_price) ~ No_of_fullbath + No_of_bedrooms + 
##     log(Sqft) + Parksize_imp + Parking_Capacity + Zipcode_bin + 
##     Stories + Total_rooms + med_price + No_of_halfbath + log(School_distance) + 
##     log(Lotsize_sqft) + age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85040 -0.20083  0.03226  0.27620  1.50606 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.040e+01  7.520e-01  13.835  < 2e-16 ***
## No_of_fullbath        1.532e-01  4.209e-02   3.639 0.000313 ***
## No_of_bedrooms       -1.018e-01  3.534e-02  -2.880 0.004214 ** 
## log(Sqft)             1.521e-01  1.130e-01   1.347 0.178899    
## Parksize_imp          1.142e-04  1.750e-04   0.653 0.514418    
## Parking_Capacity      8.226e-02  3.643e-02   2.258 0.024526 *  
## Zipcode_bin          -4.998e-02  1.599e-02  -3.126 0.001914 ** 
## Stories               1.712e-02  2.618e-02   0.654 0.513505    
## Total_rooms           2.819e-02  1.651e-02   1.707 0.088584 .  
## med_price             1.738e-06  2.016e-07   8.619  < 2e-16 ***
## No_of_halfbath        1.856e-01  4.598e-02   4.036 6.63e-05 ***
## log(School_distance) -2.478e-02  3.270e-02  -0.758 0.449124    
## log(Lotsize_sqft)    -7.914e-03  3.437e-02  -0.230 0.818021    
## age                  -9.195e-04  7.930e-04  -1.160 0.246997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4576 on 367 degrees of freedom
## Multiple R-squared:  0.6741, Adjusted R-squared:  0.6625 
## F-statistic: 58.39 on 13 and 367 DF,  p-value: < 2.2e-16

So, obviously log transformation doesn’t quite work on our data. The R-squared has decreased considerably and model2 is weaker than model1.

plot(model2)

Transformations

Boxcox Transformation

#using aliases for convenience
attach(data)
y<-Sold_price
x1<-No_of_fullbath
x2<-No_of_bedrooms
x3<-Sqft
x4<-Parksize_imp
x5<-Parking_Capacity
x6<-Zipcode_bin
x13<-Zipcode
x7<-Stories
x8<-Total_rooms
x9<-med_price
x10<-No_of_halfbath
x11<-School_distance
x12<-Lotsize_sqft
x14<-age

model3<-(lm(y ~ x1+x2+x3+x4+x5
            +x6+x7+x8+x9+x10+x11+x12+x14
            , data = data))
summary(model3)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x14, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -682963  -59114    -145   51804 1358597 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.731e+04  4.886e+04  -1.378   0.1692    
## x1           6.171e+04  1.487e+04   4.151 4.12e-05 ***
## x2          -2.521e+04  1.207e+04  -2.089   0.0374 *  
## x3          -4.854e+01  1.972e+01  -2.461   0.0143 *  
## x4           9.990e+01  6.058e+01   1.649   0.1000 .  
## x5          -4.847e+02  1.272e+04  -0.038   0.9696    
## x6          -3.101e+03  5.558e+03  -0.558   0.5772    
## x7          -6.675e+03  8.952e+03  -0.746   0.4564    
## x8           7.940e+03  5.768e+03   1.377   0.1695    
## x9           1.244e+00  8.292e-02  14.996  < 2e-16 ***
## x10          7.376e+03  1.593e+04   0.463   0.6437    
## x11         -1.698e+04  1.379e+04  -1.231   0.2192    
## x12          9.936e-01  5.608e-01   1.772   0.0773 .  
## x14          1.064e+02  2.793e+02   0.381   0.7034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159500 on 367 degrees of freedom
## Multiple R-squared:  0.7657, Adjusted R-squared:  0.7574 
## F-statistic: 92.26 on 13 and 367 DF,  p-value: < 2.2e-16

(MSRes=summary(model3)$sigma^2)

## [1] 25424537620

bc<-MASS::boxcox(y ~ x1+x2+x3+x4+x5
                 +x6+x7+x8+x9+x10+x11+x12+x14)

(lambda <- bc$x[which.max(bc$y)])

## [1] 0.3030303

data$y2 <- (data$Sold_price ^ lambda - 1) / lambda

model4<-(lm(y2 ~ x1+x2+x3+x4+x5
            +x6+x7+x8+x9+x10+x11+x12+x14,data=data))

summary(model4)

## 
## Call:
## lm(formula = y2 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x14, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.166  -8.489  -0.509  10.078  69.722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.291e+01  5.274e+00  17.618  < 2e-16 ***
## x1           7.605e+00  1.605e+00   4.740 3.06e-06 ***
## x2          -3.053e+00  1.303e+00  -2.344 0.019622 *  
## x3          -4.149e-03  2.129e-03  -1.949 0.052049 .  
## x4           9.989e-03  6.538e-03   1.528 0.127418    
## x5           2.800e+00  1.373e+00   2.040 0.042084 *  
## x6          -1.747e+00  5.998e-01  -2.913 0.003796 ** 
## x7           3.405e-01  9.662e-01   0.352 0.724714    
## x8           1.597e+00  6.225e-01   2.565 0.010700 *  
## x9           1.166e-04  8.950e-06  13.026  < 2e-16 ***
## x10          5.849e+00  1.720e+00   3.402 0.000744 ***
## x11         -2.094e+00  1.489e+00  -1.407 0.160379    
## x12         -7.300e-05  6.053e-05  -1.206 0.228575    
## x14         -2.115e-02  3.015e-02  -0.702 0.483380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.21 on 367 degrees of freedom
## Multiple R-squared:  0.7669, Adjusted R-squared:  0.7586 
## F-statistic: 92.87 on 13 and 367 DF,  p-value: < 2.2e-16

( MSRes=summary(model4)$sigma^2 )

## [1] 296.1689

This model seems good, and the Mean Squared Residual also looks okay.

model6<-(lm(y2 ~ No_of_fullbath+No_of_bedrooms+Parking_Capacity+Zipcode_bin+Total_rooms+med_price+No_of_halfbath,data=data))

summary(model6)

## 
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity + 
##     Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.663  -8.379  -0.069  10.464  67.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.206e+01  3.828e+00  24.051  < 2e-16 ***
## No_of_fullbath    7.154e+00  1.538e+00   4.652 4.58e-06 ***
## No_of_bedrooms   -3.691e+00  1.239e+00  -2.979 0.003079 ** 
## Parking_Capacity  3.832e+00  1.002e+00   3.825 0.000153 ***
## Zipcode_bin      -1.767e+00  5.959e-01  -2.965 0.003225 ** 
## Total_rooms       1.223e+00  5.975e-01   2.047 0.041349 *  
## med_price         1.019e-04  6.044e-06  16.866  < 2e-16 ***
## No_of_halfbath    6.490e+00  1.649e+00   3.937 9.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7538 
## F-statistic: 167.2 on 7 and 373 DF,  p-value: < 2.2e-16

Let’s check the performance of this model.

plot(model6)

## Issues

Let’s try to use the original Zipcode variable instead of Zipcode_bin

checking for variance inflation factors.

#install.packages("glmnet")
library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 3.0-1

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:formattable':
## 
##     area

## The following object is masked from 'package:DAAG':
## 
##     hills

## The following object is masked from 'package:dplyr':
## 
##     select

vif <- function(mod, ...) {
    if (any(is.na(coef(mod)))) 
        stop ("there are aliased coefficients in the model")
    v <- vcov(mod)
    assign <- attr(model.matrix(mod), "assign")
    if (names(coefficients(mod)[1]) == "(Intercept)") {
        v <- v[-1, -1]
        assign <- assign[-1]
    }
    else warning("No intercept: vifs may not be sensible.")
    terms <- labels(terms(mod))
    n.terms <- length(terms)
    if (n.terms < 2) stop("model contains fewer than 2 terms")
    R <- cov2cor(v)
    detR <- det(R)
    result <- matrix(0, n.terms, 3)
    rownames(result) <- terms
    colnames(result) <- c("GVIF", "Df", "GVIF^(1/(2*Df))")
    for (term in 1:n.terms) {
        subs <- which(assign == term)
        result[term, 1] <- det(as.matrix(R[subs, subs])) *
            det(as.matrix(R[-subs, -subs])) / detR
        result[term, 2] <- length(subs)
    }
    if (all(result[, 2] == 1)) result <- result[, 1]
    else result[, 3] <- result[, 1]^(1/(2 * result[, 2]))
    result
}

vif(model6)

##   No_of_fullbath   No_of_bedrooms Parking_Capacity      Zipcode_bin 
##         2.513842         1.620320         1.263289         1.097569 
##      Total_rooms        med_price   No_of_halfbath 
##         1.892211         2.199216         1.183733

We can see that vif values are less than 10, and we can say variance influence factors are not present in our variables. And we can say our dataset doesn’t suffer from multicollinearity.

We don’t need to perform ridge regression.

Variable Selection

We will try to use Stepwise regression to get out the best model.

1)Forward Selection

attach(data)

## The following objects are masked from data (pos = 6):
## 
##     age, Lawn, Lotsize_sqft, med_price, No_of_bedrooms, No_of_fullbath,
##     No_of_halfbath, Parking_Capacity, Parking_Size, Parksize_imp,
##     Patio, School_distance, Sold_price, Sqft, Stories, Total_rooms,
##     Year_built, Zipcode, Zipcode_bin

y<-Sold_price
x1<-No_of_fullbath
x2<-No_of_bedrooms
x3<-Sqft
x4<-Parksize_imp
x5<-Parking_Capacity
x6<-Zipcode_bin
x13<-Zipcode
x7<-Stories
x8<-Total_rooms
x9<-med_price
x10<-No_of_halfbath
x11<-School_distance
x12<-Lotsize_sqft
x14<-age

add1(lm(y~1,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")

add1(lm(y~x12,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")

add1(lm(y~x12+x1,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")

add1(lm(y~x12+x1+x3,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")

add1(lm(y~x12+x1+x3+x9,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")

model7<- lm(y~x12+x1+x3+x9,data=data)
summary(model7)

## 
## Call:
## lm(formula = y ~ x12 + x1 + x3 + x9, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -719150  -61286   -2491   55711 1363917 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.181e+04  2.178e+04  -3.756  0.00020 ***
## x12          1.238e+00  5.022e-01   2.465  0.01416 *  
## x1           6.342e+04  1.334e+04   4.754 2.85e-06 ***
## x3          -5.409e+01  1.663e+01  -3.253  0.00124 ** 
## x9           1.264e+00  7.241e-02  17.461  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 159900 on 376 degrees of freedom
## Multiple R-squared:  0.7586, Adjusted R-squared:  0.7561 
## F-statistic: 295.5 on 4 and 376 DF,  p-value: < 2.2e-16

Backward Selection

drop1(lm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14,data=data), test="F")

drop1(lm(y~x1+x2+x3+x4+x6+x7+x8+x9+x10+x11+x12+x14,data=data), test="F")

drop1(lm(y~x1+x2+x4+x3+x6+x7+x8+x9+x10+x11+x12,data=data), test="F")

drop1(lm(y~x1+x2+x4+x3+x8+x9+x6+x11+x12,data=data), test="F")

drop1(lm(y~x1+x2+x4+x3+x8+x9+x11+x12,data=data), test="F")

drop1(lm(y~x1+x2+x4+x6+x9+x11+x12,data=data), test="F")

drop1(lm(y~x1+x2+x4+x8+x9+x11,data=data), test="F")

drop1(lm(y~x1+x2+x4+x9+x11,data=data), test="F")

drop1(lm(y~x1+x2+x4+x9,data=data), test="F")

model8<-lm(y~x1+x2+x4+x9,data=data)
summary(model8)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x4 + x9, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -699320  -58165   -1467   53112 1392390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.085e+04  3.097e+04  -1.965  0.05015 .  
## x1           5.676e+04  1.336e+04   4.250  2.7e-05 ***
## x2          -2.886e+04  1.026e+04  -2.812  0.00518 ** 
## x4           9.843e+01  4.320e+01   2.279  0.02325 *  
## x9           1.115e+00  5.185e-02  21.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160200 on 376 degrees of freedom
## Multiple R-squared:  0.7578, Adjusted R-squared:  0.7552 
## F-statistic: 294.1 on 4 and 376 DF,  p-value: < 2.2e-16

We are getting the same results.

Model Validation

This is our final model.

summary(model6)

## 
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity + 
##     Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.663  -8.379  -0.069  10.464  67.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.206e+01  3.828e+00  24.051  < 2e-16 ***
## No_of_fullbath    7.154e+00  1.538e+00   4.652 4.58e-06 ***
## No_of_bedrooms   -3.691e+00  1.239e+00  -2.979 0.003079 ** 
## Parking_Capacity  3.832e+00  1.002e+00   3.825 0.000153 ***
## Zipcode_bin      -1.767e+00  5.959e-01  -2.965 0.003225 ** 
## Total_rooms       1.223e+00  5.975e-01   2.047 0.041349 *  
## med_price         1.019e-04  6.044e-06  16.866  < 2e-16 ***
## No_of_halfbath    6.490e+00  1.649e+00   3.937 9.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7538 
## F-statistic: 167.2 on 7 and 373 DF,  p-value: < 2.2e-16

MSRes=summary(model6)$sigma^2
MSRes

## [1] 302.0616

summary(model6)$sigma

## [1] 17.37992

library(DAAG)

Testing

df<-read_xlsx("testing.xlsx")
df

colSums(is.na(df))

##            house       Sold_price          Zipcode        Sold_date 
##                0                0                0                5 
##   No_of_bedrooms   No_of_fullbath          Stories       Year_built 
##                0                0                1                0 
##             Sqft     Lotsize_sqft   No_of_halfbath      Total_rooms 
##                0                4                0                0 
##  School_distance Parking_Capacity     Parking_Size    Basement_size 
##                5                0                8               10 
##             Lawn            Patio 
##                8                8

We drop the basement_size here because our model isn’t trained on this.

df

df$Parking_Capacity <- ifelse(is.na(df$Parking_Capacity), median(df$Parking_Capacity, na.rm=TRUE), df$Parking_Capacity)
df$Parking_Size <- ifelse(is.na(df$Parking_Size), median(df$Parking_Size, na.rm=TRUE), df$Parking_Size)

df <- df %>%replace_with_na_all(condition = ~.x %in% common_na_strings)

df$Zipcode <- as.factor(df$Zipcode)

df$Year_built <- as.numeric(df$Year_built)
df$Parking_Capacity <- as.numeric(df$Parking_Capacity)

df$Sold_price <- as.numeric(df$Sold_price)

df$Total_rooms <- as.numeric(df$Total_rooms)

price <-read_xlsx("MedianValuePerSqft.xlsx")
price

price_ft <- filter(price, Zip %in% unique(df$Zipcode))

price_avgmedt <- data.frame(ID=price_ft[,2], Means=rowMeans(price_ft[,7:16]))
price_avgmedt

df$Zipcode <- as.character(df$Zipcode)
price_avgmedt$Zip <- as.character(price_avgmedt$Zip)
df1 <- left_join(x = df, y = price_avgmedt, by = c("Zipcode" = "Zip"))

df2 <- df1 %>% mutate(med_price = Sqft * Means)
df2

df <- df2[,-18]
df$Zipcode <- as.factor(df$Zipcode )

df

df <-df[,-3]
df

df$med_price <- ifelse(is.na(df$med_price), median(df$med_price, na.rm=TRUE), df$med_price)

df$Zipcode_bin <- 0

df[which(df$Zipcode==45040|df$Zipcode==45242|df$Zipcode==45243|df$Zipcode==45209|df$Zipcode==45249|df$Zipcode==45208|df$Zipcode==45226|df$Zipcode==45202|df$Zipcode==45069|df$Zipcode==45241|df$Zipcode==41075|df$Zipcode==45255),"Zipcode_bin"] <- 1

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

df[which(df$Zipcode==45174|df$Zipcode==41017|df$Zipcode==45230|
             df$Zipcode==45236|df$Zipcode==45039|df$Zipcode==45220|df$Zipcode==45140|df$Zipcode==45065|df$Zipcode==45213|df$Zipcode==41048|df$Zipcode==45246),"Zipcode_bin"] <- 2

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

df[which(df$Zipcode==41011|df$Zipcode==41091|df$Zipcode==45223|df$Zipcode==41073|df$Zipcode==41076|df$Zipcode==45150|df$Zipcode==45212|df$Zipcode==45248|df$Zipcode==45248|df$Zipcode==45036|df$Zipcode==41042|df$Zipcode==45215|df$Zipcode==41071|df$Zipcode==45050|df$Zipcode==45245|df$Zipcode==45014|df$Zipcode==41005),"Zipcode_bin"] <- 3

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

df[which(df$Zipcode==45206|df$Zipcode==45224|df$Zipcode==45238|df$Zipcode==41094|df$Zipcode==45247|df$Zipcode==41018|df$Zipcode==41001|df$Zipcode==41059|df$Zipcode==45211|df$Zipcode==45011|df$Zipcode==45152|df$Zipcode==45240|df$Zipcode==45217|df$Zipcode==45044|df$Zipcode==45034|df$Zipcode==45002|df$Zipcode==41051|df$Zipcode==45239|df$Zipcode==45218|df$Zipcode==45103|df$Zipcode==45252),"Zipcode_bin"] <- 4

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

df[which(df$Zipcode==47025|df$Zipcode==45204|df$Zipcode==41015|df$Zipcode==45231|df$Zipcode==41014|df$Zipcode==45013|df$Zipcode==45157|df$Zipcode==45030|df$Zipcode==41016|df$Zipcode==45005|df$Zipcode==45102|df$Zipcode==45216|df$Zipcode==45237),"Zipcode_bin"] <- 5

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

## Warning: Unknown or uninitialised column: 'Zipcode'.

df

myvars <- df[c("No_of_fullbath", "No_of_bedrooms","Parking_Capacity","Zipcode_bin","Total_rooms","med_price","No_of_halfbath")]
myvars<-data.frame(myvars)

myvars

testing1 = predict(model6, myvars, interval = c("confidence"), level = 0.95, type="response")
testing1

##         fit      lwr      upr
## 1  162.6170 158.4699 166.7640
## 2  145.6362 139.5373 151.7350
## 3  123.5510 117.1402 129.9618
## 4  137.0497 133.3702 140.7292
## 5  163.0880 157.5779 168.5981
## 6  116.7299 108.8786 124.5813
## 7  175.3661 168.1912 182.5411
## 8  159.4850 154.5416 164.4284
## 9  143.1182 138.3019 147.9344
## 10 126.6522 122.9773 130.3271

model6

## 
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity + 
##     Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
## 
## Coefficients:
##      (Intercept)    No_of_fullbath    No_of_bedrooms  Parking_Capacity  
##       92.0586083         7.1544521        -3.6913642         3.8318037  
##      Zipcode_bin       Total_rooms         med_price    No_of_halfbath  
##       -1.7665257         1.2231665         0.0001019         6.4902319

summary(model6)

## 
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity + 
##     Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.663  -8.379  -0.069  10.464  67.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.206e+01  3.828e+00  24.051  < 2e-16 ***
## No_of_fullbath    7.154e+00  1.538e+00   4.652 4.58e-06 ***
## No_of_bedrooms   -3.691e+00  1.239e+00  -2.979 0.003079 ** 
## Parking_Capacity  3.832e+00  1.002e+00   3.825 0.000153 ***
## Zipcode_bin      -1.767e+00  5.959e-01  -2.965 0.003225 ** 
## Total_rooms       1.223e+00  5.975e-01   2.047 0.041349 *  
## med_price         1.019e-04  6.044e-06  16.866  < 2e-16 ***
## No_of_halfbath    6.490e+00  1.649e+00   3.937 9.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7538 
## F-statistic: 167.2 on 7 and 373 DF,  p-value: < 2.2e-16

#sapply(df, names)
ytest <- df['Sold_price']

ytest_t <- (ytest ^ lambda - 1) / lambda
ytest_t

ypred<-testing1[,1]
ypred

##        1        2        3        4        5        6        7        8 
## 162.6170 145.6362 123.5510 137.0497 163.0880 116.7299 175.3661 159.4850 
##        9       10 
## 143.1182 126.6522

prediction_error = ytest_t-testing1[,1]
prediction_error

PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  return(PRESS)
}

MSPE <- function(linear.model) {
  #' calculate the MSPE =PRESS/sample size
  return(PRESS(linear.model)/length(residuals(linear.model)))
}

pred_r_squared <- function(linear.model) {
  #' Use anova() to get the sum of squares for the linear model
  lm.anova <- anova(linear.model)
  #' Calculate the total sum of squares
  tss <- sum(lm.anova$'Sum Sq')
  # Calculate the predictive R^2
  pred.r.squared <- 1-PRESS(linear.model)/(tss)
  
  return(pred.r.squared)
}

MSPE(model6)

## [1] 313.3384

pred_r_squared(model6)

## [1] 0.7439672

We get a prediction R-squared of 74.3% which is not too good but not bad. There’s always scope for improvement.