The goal of this project is to predict the house prices in city of Cincinnati by observing the potential predictive power of a number of variables on the house prices. Our assumption prior to starting this project is that the sold prices of houses should be correlated to and perhaps predicted by various external indicators. For example, the notion was that if a house has more bedrooms/bathrooms, then the house sale price would be higher than that of the houses with less number of bedrooms/bathrooms. Currently our data has 380 observations, we plan to add more observations to enhance our model. Through several iterations of this analysis we are hoping to eliminate unnecessary and irrelevant regressor variables and end up with a simplified regression equation to get the best possible prediction. We began with the following 16 covariates with 1 response variable: 1. Sold Price: Our response variable. 2. Zip Code 3. Sold Date 4. Number of bedrooms 5. Number of full bathrooms 6. Number of half bathrooms 7. Stories 8. Year Built 9. Sqft 10. Lot Size 11. Total number of rooms 12. School Distance 13. Parking Capacity 14. Parking Size 15. Basement Size 16. Lawn 17. Patio We created 18 different models using different transformations and finnaly decided to settle in for the 6th model.
Our observed Multiple R-squared is 0.7584 and Adjusted R-squared is 0.7538.
The Prediction R-squared is 0.7339898.
Final equation of model is : Sold_price= 9.206e+01+ 7.154e+00No_of_fullbath + -3.691e+00No_of_bedrooms+ 3.832e+00Parking_Capacity + -1.767e+00Zipcode_bin + 1.223e+00Total_rooms+ 1.019e-04med_price + 6.490e+00*No_of_halfbath
#Loading the required packages
library(readxl) # To read in the data
library(tidyverse) # For general data manipulation and regression analysis
## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr) # To format tables
library(DAAG)
## Loading required package: lattice
library(ggplot2) # To visualize data
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:DAAG':
##
## cities
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(qwraps2)
##
## Attaching package: 'qwraps2'
## The following object is masked from 'package:psych':
##
## logit
library(naniar)
library(formattable) # To format tables into currency
library(dplyr)
data <- read_xlsx("dataf.xlsx")
#removig the columns with URl and teammates names
#data <- data[,-c(1,2,3)] #already done on excel
data
data <- data %>%replace_with_na_all(condition = ~.x %in% common_na_strings)
data$Zipcode <- as.factor(data$Zipcode)
data$Lawn <- as.factor(data$Lawn)
data$Patio <- as.factor(data$Patio)
data$Year_built <- as.numeric(data$Year_built)
data$Parking_Capacity <- as.numeric(data$Parking_Capacity)
data$Lotsize_sqft <- as.numeric(data$Lotsize_sqft)
data$Sold_price <- as.numeric(data$Sold_price)
data$Sqft <- as.numeric(data$Sqft)
data$Total_rooms <- as.numeric(data$Total_rooms)
We see there are 381 total rows and some of the variables have lesser rows. It needs to be checked which variables are missing vslues and what percentage.
data[!duplicated(data[1:17]),]
We don’t find any duplicate rows. That’s good.
#checking for null values in the data
colSums(is.na(data))
## Sold_price Zipcode Sold_date No_of_bedrooms
## 0 0 0 0
## No_of_fullbath Stories Year_built Sqft
## 0 0 0 0
## Lotsize_sqft No_of_halfbath Total_rooms School_distance
## 0 17 17 0
## Parking_Capacity Parking_Size Basement_size Lawn
## 20 105 258 17
## Patio
## 17
Parking_Size and Basement_size have a major chunk of data missing. Let’s calculate the percentage of missing data.
sum(is.na(data$Parking_Size))*100/nrow(data)
## [1] 27.55906
Parking_Size has 27.5 % missing values. We can consider imputing the missing values here.
sum(is.na(data$Basement_size))*100/nrow(data)
## [1] 67.71654
Basement_Size has 67% null values which is not a good sign for our predicting model and it’s not advisable to impute such large missing values. So we’ll drop this column from our dataset.
data <- data[,-15]
Meanwhile, Imputing the missing values in Parking_Size by median, we’ll do the same for Total_rooms since it has only a small precentage of missing data.
data$Parksize_imp <- ifelse(is.na(data$Parking_Size), median(data$Parking_Size, na.rm=TRUE), data$Parking_Size)
data$Total_rooms <- ifelse(is.na(data$Total_rooms), median(data$Total_rooms, na.rm=TRUE), data$Total_rooms)
We’ll replace the missing values in other column by 0, since it seems that if data about Half_Bathrooms, Parking capacity, Lawna and Patio is missing, a reasonable argument seems that it might not be present.
data[is.na(data)] <- 0
colSums(is.na(data))
## Sold_price Zipcode Sold_date No_of_bedrooms
## 0 0 0 0
## No_of_fullbath Stories Year_built Sqft
## 0 0 0 0
## Lotsize_sqft No_of_halfbath Total_rooms School_distance
## 0 0 0 0
## Parking_Capacity Parking_Size Lawn Patio
## 0 0 0 0
## Parksize_imp
## 0
We don’t have any missing values in our dataset now.
Proceeding with the analysis, we can strengthen our analysis by adding the data of median prices of houses per square feet. The size of the house surely plays a major role in price of the house and the data provided by zillow according to the zipcodes and median prices per month accross years has been taken into consideration.
#loading the median price per squarefeet dataset
price <-read_xlsx("MedianValuePerSqft.xlsx")
price
This dataset has pricing data from 1970s which is not useful for our analysis. But we want to filter this dataset for only zip codes of interest, which are Cincinnati zip codes.
price_f <- filter(price, Zip %in% unique(data$Zipcode))
With this information, we can calculate an average of the median prices from the last 11 months (January 2019 - November 2019) for each zip code. We will create a new variable for the mean price.
price_avgmed <- data.frame(ID=price_f[,2], Means=rowMeans(price_f[,7:16]))
price_avgmed
data$Zipcode <- as.character(data$Zipcode)
price_avgmed$Zip <- as.character(price_avgmed$Zip)
data1 <- left_join(x = data, y = price_avgmed, by = c("Zipcode" = "Zip"))
With the help of mean of the median price we can calculate a new column of data called med_price, which is the price of house calculated by multiplying the size of house(Sqft) and Average median (Means).
data2 <- data1 %>% mutate(med_price = Sqft * Means)
data2
data <- data2[,-18]
data$Zipcode <- as.factor(data$Zipcode )
data <-data[,-3]
colSums(is.na(data))
## Sold_price Zipcode No_of_bedrooms No_of_fullbath
## 0 0 0 0
## Stories Year_built Sqft Lotsize_sqft
## 0 0 0 0
## No_of_halfbath Total_rooms School_distance Parking_Capacity
## 0 0 0 0
## Parking_Size Lawn Patio Parksize_imp
## 0 0 0 0
## med_price
## 0
Instead of using the variable year_built, we can use it to get the age of the house which seems a more reasonable factor to put into our model.
#Getting age of the house
data$age <- 2019 - data$Year_built
dim(data)
## [1] 381 18
summary(data)
## Sold_price Zipcode No_of_bedrooms No_of_fullbath
## Min. : 16000 45208 : 30 Min. :1.000 Min. :1.000
## 1st Qu.: 137800 45230 : 26 1st Qu.:3.000 1st Qu.:1.000
## Median : 202000 45245 : 23 Median :3.000 Median :2.000
## Mean : 299996 45219 : 19 Mean :3.247 Mean :1.992
## 3rd Qu.: 327900 45202 : 17 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :2700000 45217 : 17 Max. :7.000 Max. :7.000
## (Other):249
## Stories Year_built Sqft Lotsize_sqft
## Min. :1.000 Min. :1866 Min. : 609 Min. : 609
## 1st Qu.:1.000 1st Qu.:1916 1st Qu.:1344 1st Qu.: 4791
## Median :2.000 Median :1946 Median :1776 Median : 7579
## Mean :1.929 Mean :1946 Mean :2062 Mean : 13131
## 3rd Qu.:2.000 3rd Qu.:1978 3rd Qu.:2374 3rd Qu.: 12197
## Max. :9.000 Max. :2017 Max. :8426 Max. :150718
##
## No_of_halfbath Total_rooms School_distance Parking_Capacity
## Min. :0.0000 Min. : 2.000 Min. :0.100 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 6.000 1st Qu.:0.400 1st Qu.:1.000
## Median :1.0000 Median : 7.000 Median :0.600 Median :1.000
## Mean :0.5984 Mean : 7.115 Mean :0.805 Mean :1.346
## 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :2.0000 Max. :16.000 Max. :4.100 Max. :5.000
##
## Parking_Size Lawn Patio Parksize_imp med_price
## Min. : 0 0 :175 0 :110 Min. : 0.0 Min. : 47913
## 1st Qu.: 0 1 :205 1 :270 1st Qu.: 250.0 1st Qu.: 145515
## Median : 228 Yes: 1 Yes: 1 Median : 329.5 Median : 209853
## Mean : 237 Mean : 327.8 Mean : 277377
## 3rd Qu.: 410 3rd Qu.: 410.0 3rd Qu.: 321988
## Max. :1733 Max. :1733.0 Max. :1733260
##
## age
## Min. : 2.00
## 1st Qu.: 41.00
## Median : 73.00
## Mean : 73.29
## 3rd Qu.:103.00
## Max. :153.00
##
ggplot(data, aes(Sold_price)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(data$Sold_price),
sd = sd(data$Sold_price))) +
ggtitle("Plot of SalePrice Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Price data is positively skewed but considering the fact that the general population belongs poor to middle class, it is accepted these people will have cheaper houses. And there will be few costly houses.
boxplot(data$Sold_price)
That means most of the houses will also be smaller in size and we can expect size of house to show a similar behavior. Let’s check it.
ggplot(data, aes(Sqft)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = mean(data$Sqft),
sd = sd(data$Sold_price))) +
ggtitle("Plot of House Size Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is obviously expected behavior.
boxplot(data$Sqft)
ggplot(data, aes(Lotsize_sqft)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(data$Lotsize_sqft),
sd = sd(data$Lotsize_sqft))) +
ggtitle("Plot of Lotsize_sqft Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(age)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = mean(data$age),
sd = sd(data$age))) +
ggtitle("Plot of age Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The ages of the house suprisingly follows a normal distribution.
pairs.panels(data[c("Sold_price","No_of_bedrooms","No_of_fullbath","No_of_halfbath","Stories","Year_built","Sqft","Lotsize_sqft","Parking_Size","Parking_Capacity","Lawn","Patio","Zipcode","age","med_price")],hist.col = "green",gap=0)
pairs.panels(data[c("Sold_price","No_of_bedrooms","No_of_fullbath","No_of_halfbath","Stories")],hist.col = "blue",gap=0)
pairs.panels(data[c("Sold_price","Sqft","Lotsize_sqft","Parking_Size","Parking_Capacity")],hist.col = "red",gap=0)
pairs.panels(data[c("Sold_price","Lawn","Patio","Zipcode","age","med_price")],hist.col = "green",gap=0)
This website provides a ranking of the best to worst zipcodes in cincinnati which is being used to create bins of the data.
data$Zipcode_bin <- 0
data[which(data$Zipcode==45040|data$Zipcode==45242|data$Zipcode==45243|data$Zipcode==45209|data$Zipcode==45249|data$Zipcode==45208|data$Zipcode==45226|data$Zipcode==45202|data$Zipcode==45069|data$Zipcode==45241|data$Zipcode==41075|data$Zipcode==45255),"Zipcode_bin"] <- 1
data[which(data$Zipcode==45174|data$Zipcode==41017|data$Zipcode==45230|
data$Zipcode==45236|data$Zipcode==45039|data$Zipcode==45220|data$Zipcode==45140|data$Zipcode==45065|data$Zipcode==45213|data$Zipcode==41048|data$Zipcode==45246),"Zipcode_bin"] <- 2
data[which(data$Zipcode==41011|data$Zipcode==41091|data$Zipcode==45223|data$Zipcode==41073|data$Zipcode==41076|data$Zipcode==45150|data$Zipcode==45212|data$Zipcode==45248|data$Zipcode==45248|data$Zipcode==45036|data$Zipcode==41042|data$Zipcode==45215|data$Zipcode==41071|data$Zipcode==45050|data$Zipcode==45245|data$Zipcode==45014|data$Zipcode==41005),"Zipcode_bin"] <- 3
data[which(data$Zipcode==45206|data$Zipcode==45224|data$Zipcode==45238|data$Zipcode==41094|data$Zipcode==45247|data$Zipcode==41018|data$Zipcode==41001|data$Zipcode==41059|data$Zipcode==45211|data$Zipcode==45011|data$Zipcode==45152|data$Zipcode==45240|data$Zipcode==45217|data$Zipcode==45044|data$Zipcode==45034|data$Zipcode==45002|data$Zipcode==41051|data$Zipcode==45239|data$Zipcode==45218|data$Zipcode==45103|data$Zipcode==45252),"Zipcode_bin"] <- 4
data[which(data$Zipcode==47025|data$Zipcode==45204|data$Zipcode==41015|data$Zipcode==45231|data$Zipcode==41014|data$Zipcode==45013|data$Zipcode==45157|data$Zipcode==45030|data$Zipcode==41016|data$Zipcode==45005|data$Zipcode==45102|data$Zipcode==45216|data$Zipcode==45237),"Zipcode_bin"] <- 5
data$Zipcode_bin
## [1] 3 3 3 3 5 1 1 1 0 1 0 0 0 2 0 5 5 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 4 1 2
## [38] 1 2 2 0 1 0 1 3 1 1 2 0 2 4 3 1 5 0 3 0 4 2 2 4 2 1 2 0 1 2 1 4 2 0 0 0 0
## [75] 2 5 1 0 1 3 0 2 4 3 1 2 2 2 2 2 1 1 1 1 1 1 1 4 4 4 4 3 1 3 1 4 1 4 5 4 1
## [112] 2 5 1 0 1 3 4 0 0 2 1 0 5 1 0 3 3 4 3 3 3 3 0 4 3 3 1 1 0 1 1 1 1 0 0 0 0
## [149] 2 0 2 2 2 1 3 4 1 1 3 4 1 2 0 1 1 1 4 4 1 1 0 2 2 2 4 0 0 0 0 0 0 0 0 0 0
## [186] 0 3 2 4 3 1 0 4 2 1 3 1 3 1 1 3 0 4 4 2 1 1 1 1 1 1 1 1 1 1 1 5 4 5 4 4 4
## [223] 4 4 4 4 4 4 4 4 4 4 4 4 0 0 3 3 1 0 3 0 3 3 3 3 3 0 0 2 0 0 0 0 2 2 2 0 2
## [260] 2 0 4 1 2 1 4 1 3 4 4 0 4 4 1 4 3 0 4 1 0 2 2 3 0 3 1 5 5 2 4 4 4 0 0 0 1
## [297] 3 3 1 3 1 3 3 3 3 1 4 2 5 0 0 5 0 3 2 3 5 1 4 4 4 0 4 0 1 1 3 0 4 4 4 1 3
## [334] 4 2 1 1 1 1 1 1 1 1 1 1 1 5 5 1 4 4 4 0 1 2 2 2 0 3 1 1 2 4 4 1 4 1 1 0 1
## [371] 2 1 1 4 1 4 4 0 2 3 1
We’ll now fit our first model.
model1<-(lm(Sold_price~No_of_fullbath+No_of_bedrooms+Sqft+Parksize_imp+Parking_Capacity+Zipcode_bin+Stories+Total_rooms+med_price+No_of_halfbath+School_distance+Lotsize_sqft+age, data = data))
summary(model1)
##
## Call:
## lm(formula = Sold_price ~ No_of_fullbath + No_of_bedrooms + Sqft +
## Parksize_imp + Parking_Capacity + Zipcode_bin + Stories +
## Total_rooms + med_price + No_of_halfbath + School_distance +
## Lotsize_sqft + age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -682963 -59114 -145 51804 1358597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.731e+04 4.886e+04 -1.378 0.1692
## No_of_fullbath 6.171e+04 1.487e+04 4.151 4.12e-05 ***
## No_of_bedrooms -2.521e+04 1.207e+04 -2.089 0.0374 *
## Sqft -4.854e+01 1.972e+01 -2.461 0.0143 *
## Parksize_imp 9.990e+01 6.058e+01 1.649 0.1000 .
## Parking_Capacity -4.847e+02 1.272e+04 -0.038 0.9696
## Zipcode_bin -3.101e+03 5.558e+03 -0.558 0.5772
## Stories -6.675e+03 8.952e+03 -0.746 0.4564
## Total_rooms 7.940e+03 5.768e+03 1.377 0.1695
## med_price 1.244e+00 8.292e-02 14.996 < 2e-16 ***
## No_of_halfbath 7.376e+03 1.593e+04 0.463 0.6437
## School_distance -1.698e+04 1.379e+04 -1.231 0.2192
## Lotsize_sqft 9.936e-01 5.608e-01 1.772 0.0773 .
## age 1.064e+02 2.793e+02 0.381 0.7034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 159500 on 367 degrees of freedom
## Multiple R-squared: 0.7657, Adjusted R-squared: 0.7574
## F-statistic: 92.26 on 13 and 367 DF, p-value: < 2.2e-16
The F-statistic comes less than 0.05 which means the model is significant. The adjusted R-squared of 0.7573 suggests a strong model.
Our model should satisfy the following four criteria. 1)Linearity 2)Independent Errors 3)Normality assumption 4)Equal variance
Let’s see the model statistics.
plot(model1)
We can make the following observations from this model of ours.
1)The residuals vs Fitted Values graph indicates a nonconstant variance. 2)The normal Q-Q plot indicates that the distribution has heavy tails, but otherwise is symmetrical 3)Observations 373,372 and 215 seem to be the two identifiable outliers.
We saw that our Sold_price and Sqft(size of house) were positively skewed. Let’s try Log transformation of the variables which are skewed.
(MSRes=summary(model1)$sigma^2)
## [1] 25424537620
#log transformation of Sold_price
ggplot(data, aes(log(Sold_price))) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(log(data$Sold_price)),
sd = sd(log(data$Sold_price)))) +
ggtitle("Plot of log(SalePrice) Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#log transformation of Sqft
ggplot(data, aes(log(Sqft))) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(log(data$Sqft)),
sd = sd(log(data$Sqft)))) +
ggtitle("Plot of log(Sqft) Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(log(Lotsize_sqft))) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(log(data$Lotsize_sqft)),
sd = sd(log(data$Lotsize_sqft)))) +
ggtitle("Plot of log(Lotsize_sqft) Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(log(School_distance))) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "red",
args = list(mean = mean(log(data$School_distance)),
sd = sd(log(data$School_distance)))) +
ggtitle("Plot of log(School_distance) Distribution") +
theme(plot.title = element_text(size = 11))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s try modelling with our transformed variables and see if that works for us.
model2<-(lm(log(Sold_price)~No_of_fullbath+No_of_bedrooms+log(Sqft)+Parksize_imp+Parking_Capacity+Zipcode_bin+Stories+Total_rooms+med_price+No_of_halfbath+log(School_distance)+log(Lotsize_sqft)+age, data = data))
summary(model2)
##
## Call:
## lm(formula = log(Sold_price) ~ No_of_fullbath + No_of_bedrooms +
## log(Sqft) + Parksize_imp + Parking_Capacity + Zipcode_bin +
## Stories + Total_rooms + med_price + No_of_halfbath + log(School_distance) +
## log(Lotsize_sqft) + age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85040 -0.20083 0.03226 0.27620 1.50606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.040e+01 7.520e-01 13.835 < 2e-16 ***
## No_of_fullbath 1.532e-01 4.209e-02 3.639 0.000313 ***
## No_of_bedrooms -1.018e-01 3.534e-02 -2.880 0.004214 **
## log(Sqft) 1.521e-01 1.130e-01 1.347 0.178899
## Parksize_imp 1.142e-04 1.750e-04 0.653 0.514418
## Parking_Capacity 8.226e-02 3.643e-02 2.258 0.024526 *
## Zipcode_bin -4.998e-02 1.599e-02 -3.126 0.001914 **
## Stories 1.712e-02 2.618e-02 0.654 0.513505
## Total_rooms 2.819e-02 1.651e-02 1.707 0.088584 .
## med_price 1.738e-06 2.016e-07 8.619 < 2e-16 ***
## No_of_halfbath 1.856e-01 4.598e-02 4.036 6.63e-05 ***
## log(School_distance) -2.478e-02 3.270e-02 -0.758 0.449124
## log(Lotsize_sqft) -7.914e-03 3.437e-02 -0.230 0.818021
## age -9.195e-04 7.930e-04 -1.160 0.246997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4576 on 367 degrees of freedom
## Multiple R-squared: 0.6741, Adjusted R-squared: 0.6625
## F-statistic: 58.39 on 13 and 367 DF, p-value: < 2.2e-16
So, obviously log transformation doesn’t quite work on our data. The R-squared has decreased considerably and model2 is weaker than model1.
plot(model2)
Boxcox Transformation
#using aliases for convenience
attach(data)
y<-Sold_price
x1<-No_of_fullbath
x2<-No_of_bedrooms
x3<-Sqft
x4<-Parksize_imp
x5<-Parking_Capacity
x6<-Zipcode_bin
x13<-Zipcode
x7<-Stories
x8<-Total_rooms
x9<-med_price
x10<-No_of_halfbath
x11<-School_distance
x12<-Lotsize_sqft
x14<-age
model3<-(lm(y ~ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10+x11+x12+x14
, data = data))
summary(model3)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
## x10 + x11 + x12 + x14, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -682963 -59114 -145 51804 1358597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.731e+04 4.886e+04 -1.378 0.1692
## x1 6.171e+04 1.487e+04 4.151 4.12e-05 ***
## x2 -2.521e+04 1.207e+04 -2.089 0.0374 *
## x3 -4.854e+01 1.972e+01 -2.461 0.0143 *
## x4 9.990e+01 6.058e+01 1.649 0.1000 .
## x5 -4.847e+02 1.272e+04 -0.038 0.9696
## x6 -3.101e+03 5.558e+03 -0.558 0.5772
## x7 -6.675e+03 8.952e+03 -0.746 0.4564
## x8 7.940e+03 5.768e+03 1.377 0.1695
## x9 1.244e+00 8.292e-02 14.996 < 2e-16 ***
## x10 7.376e+03 1.593e+04 0.463 0.6437
## x11 -1.698e+04 1.379e+04 -1.231 0.2192
## x12 9.936e-01 5.608e-01 1.772 0.0773 .
## x14 1.064e+02 2.793e+02 0.381 0.7034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 159500 on 367 degrees of freedom
## Multiple R-squared: 0.7657, Adjusted R-squared: 0.7574
## F-statistic: 92.26 on 13 and 367 DF, p-value: < 2.2e-16
(MSRes=summary(model3)$sigma^2)
## [1] 25424537620
bc<-MASS::boxcox(y ~ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10+x11+x12+x14)
(lambda <- bc$x[which.max(bc$y)])
## [1] 0.3030303
data$y2 <- (data$Sold_price ^ lambda - 1) / lambda
model4<-(lm(y2 ~ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10+x11+x12+x14,data=data))
summary(model4)
##
## Call:
## lm(formula = y2 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
## x10 + x11 + x12 + x14, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.166 -8.489 -0.509 10.078 69.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.291e+01 5.274e+00 17.618 < 2e-16 ***
## x1 7.605e+00 1.605e+00 4.740 3.06e-06 ***
## x2 -3.053e+00 1.303e+00 -2.344 0.019622 *
## x3 -4.149e-03 2.129e-03 -1.949 0.052049 .
## x4 9.989e-03 6.538e-03 1.528 0.127418
## x5 2.800e+00 1.373e+00 2.040 0.042084 *
## x6 -1.747e+00 5.998e-01 -2.913 0.003796 **
## x7 3.405e-01 9.662e-01 0.352 0.724714
## x8 1.597e+00 6.225e-01 2.565 0.010700 *
## x9 1.166e-04 8.950e-06 13.026 < 2e-16 ***
## x10 5.849e+00 1.720e+00 3.402 0.000744 ***
## x11 -2.094e+00 1.489e+00 -1.407 0.160379
## x12 -7.300e-05 6.053e-05 -1.206 0.228575
## x14 -2.115e-02 3.015e-02 -0.702 0.483380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.21 on 367 degrees of freedom
## Multiple R-squared: 0.7669, Adjusted R-squared: 0.7586
## F-statistic: 92.87 on 13 and 367 DF, p-value: < 2.2e-16
( MSRes=summary(model4)$sigma^2 )
## [1] 296.1689
This model seems good, and the Mean Squared Residual also looks okay.
model6<-(lm(y2 ~ No_of_fullbath+No_of_bedrooms+Parking_Capacity+Zipcode_bin+Total_rooms+med_price+No_of_halfbath,data=data))
summary(model6)
##
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity +
## Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.663 -8.379 -0.069 10.464 67.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.206e+01 3.828e+00 24.051 < 2e-16 ***
## No_of_fullbath 7.154e+00 1.538e+00 4.652 4.58e-06 ***
## No_of_bedrooms -3.691e+00 1.239e+00 -2.979 0.003079 **
## Parking_Capacity 3.832e+00 1.002e+00 3.825 0.000153 ***
## Zipcode_bin -1.767e+00 5.959e-01 -2.965 0.003225 **
## Total_rooms 1.223e+00 5.975e-01 2.047 0.041349 *
## med_price 1.019e-04 6.044e-06 16.866 < 2e-16 ***
## No_of_halfbath 6.490e+00 1.649e+00 3.937 9.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7538
## F-statistic: 167.2 on 7 and 373 DF, p-value: < 2.2e-16
Let’s check the performance of this model.
plot(model6)
## Issues
Let’s try to use the original Zipcode variable instead of Zipcode_bin
#install.packages("glmnet")
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 3.0-1
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:formattable':
##
## area
## The following object is masked from 'package:DAAG':
##
## hills
## The following object is masked from 'package:dplyr':
##
## select
vif <- function(mod, ...) {
if (any(is.na(coef(mod))))
stop ("there are aliased coefficients in the model")
v <- vcov(mod)
assign <- attr(model.matrix(mod), "assign")
if (names(coefficients(mod)[1]) == "(Intercept)") {
v <- v[-1, -1]
assign <- assign[-1]
}
else warning("No intercept: vifs may not be sensible.")
terms <- labels(terms(mod))
n.terms <- length(terms)
if (n.terms < 2) stop("model contains fewer than 2 terms")
R <- cov2cor(v)
detR <- det(R)
result <- matrix(0, n.terms, 3)
rownames(result) <- terms
colnames(result) <- c("GVIF", "Df", "GVIF^(1/(2*Df))")
for (term in 1:n.terms) {
subs <- which(assign == term)
result[term, 1] <- det(as.matrix(R[subs, subs])) *
det(as.matrix(R[-subs, -subs])) / detR
result[term, 2] <- length(subs)
}
if (all(result[, 2] == 1)) result <- result[, 1]
else result[, 3] <- result[, 1]^(1/(2 * result[, 2]))
result
}
vif(model6)
## No_of_fullbath No_of_bedrooms Parking_Capacity Zipcode_bin
## 2.513842 1.620320 1.263289 1.097569
## Total_rooms med_price No_of_halfbath
## 1.892211 2.199216 1.183733
We can see that vif values are less than 10, and we can say variance influence factors are not present in our variables. And we can say our dataset doesn’t suffer from multicollinearity.
We don’t need to perform ridge regression.
We will try to use Stepwise regression to get out the best model.
1)Forward Selection
attach(data)
## The following objects are masked from data (pos = 6):
##
## age, Lawn, Lotsize_sqft, med_price, No_of_bedrooms, No_of_fullbath,
## No_of_halfbath, Parking_Capacity, Parking_Size, Parksize_imp,
## Patio, School_distance, Sold_price, Sqft, Stories, Total_rooms,
## Year_built, Zipcode, Zipcode_bin
y<-Sold_price
x1<-No_of_fullbath
x2<-No_of_bedrooms
x3<-Sqft
x4<-Parksize_imp
x5<-Parking_Capacity
x6<-Zipcode_bin
x13<-Zipcode
x7<-Stories
x8<-Total_rooms
x9<-med_price
x10<-No_of_halfbath
x11<-School_distance
x12<-Lotsize_sqft
x14<-age
add1(lm(y~1,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")
add1(lm(y~x12,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")
add1(lm(y~x12+x1,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")
add1(lm(y~x12+x1+x3,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")
add1(lm(y~x12+x1+x3+x9,data=data), y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14, test="F")
model7<- lm(y~x12+x1+x3+x9,data=data)
summary(model7)
##
## Call:
## lm(formula = y ~ x12 + x1 + x3 + x9, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -719150 -61286 -2491 55711 1363917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.181e+04 2.178e+04 -3.756 0.00020 ***
## x12 1.238e+00 5.022e-01 2.465 0.01416 *
## x1 6.342e+04 1.334e+04 4.754 2.85e-06 ***
## x3 -5.409e+01 1.663e+01 -3.253 0.00124 **
## x9 1.264e+00 7.241e-02 17.461 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 159900 on 376 degrees of freedom
## Multiple R-squared: 0.7586, Adjusted R-squared: 0.7561
## F-statistic: 295.5 on 4 and 376 DF, p-value: < 2.2e-16
drop1(lm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x14,data=data), test="F")
drop1(lm(y~x1+x2+x3+x4+x6+x7+x8+x9+x10+x11+x12+x14,data=data), test="F")
drop1(lm(y~x1+x2+x4+x3+x6+x7+x8+x9+x10+x11+x12,data=data), test="F")
drop1(lm(y~x1+x2+x4+x3+x8+x9+x6+x11+x12,data=data), test="F")
drop1(lm(y~x1+x2+x4+x3+x8+x9+x11+x12,data=data), test="F")
drop1(lm(y~x1+x2+x4+x6+x9+x11+x12,data=data), test="F")
drop1(lm(y~x1+x2+x4+x8+x9+x11,data=data), test="F")
drop1(lm(y~x1+x2+x4+x9+x11,data=data), test="F")
drop1(lm(y~x1+x2+x4+x9,data=data), test="F")
model8<-lm(y~x1+x2+x4+x9,data=data)
summary(model8)
##
## Call:
## lm(formula = y ~ x1 + x2 + x4 + x9, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -699320 -58165 -1467 53112 1392390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.085e+04 3.097e+04 -1.965 0.05015 .
## x1 5.676e+04 1.336e+04 4.250 2.7e-05 ***
## x2 -2.886e+04 1.026e+04 -2.812 0.00518 **
## x4 9.843e+01 4.320e+01 2.279 0.02325 *
## x9 1.115e+00 5.185e-02 21.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 160200 on 376 degrees of freedom
## Multiple R-squared: 0.7578, Adjusted R-squared: 0.7552
## F-statistic: 294.1 on 4 and 376 DF, p-value: < 2.2e-16
We are getting the same results.
This is our final model.
summary(model6)
##
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity +
## Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.663 -8.379 -0.069 10.464 67.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.206e+01 3.828e+00 24.051 < 2e-16 ***
## No_of_fullbath 7.154e+00 1.538e+00 4.652 4.58e-06 ***
## No_of_bedrooms -3.691e+00 1.239e+00 -2.979 0.003079 **
## Parking_Capacity 3.832e+00 1.002e+00 3.825 0.000153 ***
## Zipcode_bin -1.767e+00 5.959e-01 -2.965 0.003225 **
## Total_rooms 1.223e+00 5.975e-01 2.047 0.041349 *
## med_price 1.019e-04 6.044e-06 16.866 < 2e-16 ***
## No_of_halfbath 6.490e+00 1.649e+00 3.937 9.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7538
## F-statistic: 167.2 on 7 and 373 DF, p-value: < 2.2e-16
MSRes=summary(model6)$sigma^2
MSRes
## [1] 302.0616
summary(model6)$sigma
## [1] 17.37992
library(DAAG)
df<-read_xlsx("testing.xlsx")
df
colSums(is.na(df))
## house Sold_price Zipcode Sold_date
## 0 0 0 5
## No_of_bedrooms No_of_fullbath Stories Year_built
## 0 0 1 0
## Sqft Lotsize_sqft No_of_halfbath Total_rooms
## 0 4 0 0
## School_distance Parking_Capacity Parking_Size Basement_size
## 5 0 8 10
## Lawn Patio
## 8 8
We drop the basement_size here because our model isn’t trained on this.
df
df$Parking_Capacity <- ifelse(is.na(df$Parking_Capacity), median(df$Parking_Capacity, na.rm=TRUE), df$Parking_Capacity)
df$Parking_Size <- ifelse(is.na(df$Parking_Size), median(df$Parking_Size, na.rm=TRUE), df$Parking_Size)
df <- df %>%replace_with_na_all(condition = ~.x %in% common_na_strings)
df$Zipcode <- as.factor(df$Zipcode)
df$Year_built <- as.numeric(df$Year_built)
df$Parking_Capacity <- as.numeric(df$Parking_Capacity)
df$Sold_price <- as.numeric(df$Sold_price)
df$Total_rooms <- as.numeric(df$Total_rooms)
price <-read_xlsx("MedianValuePerSqft.xlsx")
price
price_ft <- filter(price, Zip %in% unique(df$Zipcode))
price_avgmedt <- data.frame(ID=price_ft[,2], Means=rowMeans(price_ft[,7:16]))
price_avgmedt
df$Zipcode <- as.character(df$Zipcode)
price_avgmedt$Zip <- as.character(price_avgmedt$Zip)
df1 <- left_join(x = df, y = price_avgmedt, by = c("Zipcode" = "Zip"))
df2 <- df1 %>% mutate(med_price = Sqft * Means)
df2
df <- df2[,-18]
df$Zipcode <- as.factor(df$Zipcode )
df
df <-df[,-3]
df
df$med_price <- ifelse(is.na(df$med_price), median(df$med_price, na.rm=TRUE), df$med_price)
df$Zipcode_bin <- 0
df[which(df$Zipcode==45040|df$Zipcode==45242|df$Zipcode==45243|df$Zipcode==45209|df$Zipcode==45249|df$Zipcode==45208|df$Zipcode==45226|df$Zipcode==45202|df$Zipcode==45069|df$Zipcode==45241|df$Zipcode==41075|df$Zipcode==45255),"Zipcode_bin"] <- 1
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
df[which(df$Zipcode==45174|df$Zipcode==41017|df$Zipcode==45230|
df$Zipcode==45236|df$Zipcode==45039|df$Zipcode==45220|df$Zipcode==45140|df$Zipcode==45065|df$Zipcode==45213|df$Zipcode==41048|df$Zipcode==45246),"Zipcode_bin"] <- 2
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
df[which(df$Zipcode==41011|df$Zipcode==41091|df$Zipcode==45223|df$Zipcode==41073|df$Zipcode==41076|df$Zipcode==45150|df$Zipcode==45212|df$Zipcode==45248|df$Zipcode==45248|df$Zipcode==45036|df$Zipcode==41042|df$Zipcode==45215|df$Zipcode==41071|df$Zipcode==45050|df$Zipcode==45245|df$Zipcode==45014|df$Zipcode==41005),"Zipcode_bin"] <- 3
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
df[which(df$Zipcode==45206|df$Zipcode==45224|df$Zipcode==45238|df$Zipcode==41094|df$Zipcode==45247|df$Zipcode==41018|df$Zipcode==41001|df$Zipcode==41059|df$Zipcode==45211|df$Zipcode==45011|df$Zipcode==45152|df$Zipcode==45240|df$Zipcode==45217|df$Zipcode==45044|df$Zipcode==45034|df$Zipcode==45002|df$Zipcode==41051|df$Zipcode==45239|df$Zipcode==45218|df$Zipcode==45103|df$Zipcode==45252),"Zipcode_bin"] <- 4
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
df[which(df$Zipcode==47025|df$Zipcode==45204|df$Zipcode==41015|df$Zipcode==45231|df$Zipcode==41014|df$Zipcode==45013|df$Zipcode==45157|df$Zipcode==45030|df$Zipcode==41016|df$Zipcode==45005|df$Zipcode==45102|df$Zipcode==45216|df$Zipcode==45237),"Zipcode_bin"] <- 5
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
## Warning: Unknown or uninitialised column: 'Zipcode'.
df
myvars <- df[c("No_of_fullbath", "No_of_bedrooms","Parking_Capacity","Zipcode_bin","Total_rooms","med_price","No_of_halfbath")]
myvars<-data.frame(myvars)
myvars
testing1 = predict(model6, myvars, interval = c("confidence"), level = 0.95, type="response")
testing1
## fit lwr upr
## 1 162.6170 158.4699 166.7640
## 2 145.6362 139.5373 151.7350
## 3 123.5510 117.1402 129.9618
## 4 137.0497 133.3702 140.7292
## 5 163.0880 157.5779 168.5981
## 6 116.7299 108.8786 124.5813
## 7 175.3661 168.1912 182.5411
## 8 159.4850 154.5416 164.4284
## 9 143.1182 138.3019 147.9344
## 10 126.6522 122.9773 130.3271
model6
##
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity +
## Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
##
## Coefficients:
## (Intercept) No_of_fullbath No_of_bedrooms Parking_Capacity
## 92.0586083 7.1544521 -3.6913642 3.8318037
## Zipcode_bin Total_rooms med_price No_of_halfbath
## -1.7665257 1.2231665 0.0001019 6.4902319
summary(model6)
##
## Call:
## lm(formula = y2 ~ No_of_fullbath + No_of_bedrooms + Parking_Capacity +
## Zipcode_bin + Total_rooms + med_price + No_of_halfbath, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.663 -8.379 -0.069 10.464 67.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.206e+01 3.828e+00 24.051 < 2e-16 ***
## No_of_fullbath 7.154e+00 1.538e+00 4.652 4.58e-06 ***
## No_of_bedrooms -3.691e+00 1.239e+00 -2.979 0.003079 **
## Parking_Capacity 3.832e+00 1.002e+00 3.825 0.000153 ***
## Zipcode_bin -1.767e+00 5.959e-01 -2.965 0.003225 **
## Total_rooms 1.223e+00 5.975e-01 2.047 0.041349 *
## med_price 1.019e-04 6.044e-06 16.866 < 2e-16 ***
## No_of_halfbath 6.490e+00 1.649e+00 3.937 9.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.38 on 373 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7538
## F-statistic: 167.2 on 7 and 373 DF, p-value: < 2.2e-16
#sapply(df, names)
ytest <- df['Sold_price']
ytest_t <- (ytest ^ lambda - 1) / lambda
ytest_t
ypred<-testing1[,1]
ypred
## 1 2 3 4 5 6 7 8
## 162.6170 145.6362 123.5510 137.0497 163.0880 116.7299 175.3661 159.4850
## 9 10
## 143.1182 126.6522
prediction_error = ytest_t-testing1[,1]
prediction_error
PRESS <- function(linear.model) {
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
MSPE <- function(linear.model) {
#' calculate the MSPE =PRESS/sample size
return(PRESS(linear.model)/length(residuals(linear.model)))
}
pred_r_squared <- function(linear.model) {
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1-PRESS(linear.model)/(tss)
return(pred.r.squared)
}
MSPE(model6)
## [1] 313.3384
pred_r_squared(model6)
## [1] 0.7439672
We get a prediction R-squared of 74.3% which is not too good but not bad. There’s always scope for improvement.