Use data set House Property. Find out which all columns / features impact changes in Price of house.
Create a linear model for all coulmns with corrrelation.
Predict the Price of house with some dummy values.
The data for these sales comes from the official public records of home sales in the King County area, Washington State.
The data sets contains 21613 rows. Each represents a home sold from May 2014 through May 2015.
http://your.kingcounty.gov/assessor/eRealProperty/ResGlossaryOfTerms.html.
Size: 21613 obs. of 21 variables:
Attributes:
id - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was
condition - An index from 1 to 5 on the condition of the apartment,
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Functions
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
Dataset
setwd("D:/Welingkar/Trim 4/Machine Learning/Assignment/Assignment 2")
dfrModel <- read.csv("./house-data.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## id date price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000 221900 3 1.00 1180
## 2 6414100192 20141209T000000 538000 3 2.25 2570
## 3 5631500400 20150225T000000 180000 2 1.00 770
## 4 2487200875 20141209T000000 604000 4 3.00 1960
## 5 1954400510 20150218T000000 510000 3 2.00 1680
## 6 7237550310 20140512T000000 1225000 4 4.50 5420
## sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1 5650 1 0 0 3 7 1180 0
## 2 7242 2 0 0 3 7 2170 400
## 3 10000 1 0 0 3 6 770 0
## 4 5000 1 0 0 5 7 1050 910
## 5 8080 1 0 0 3 8 1680 0
## 6 101930 1 0 0 3 11 3890 1530
## yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1955 0 98178 47.5112 -122.257 1340 5650
## 2 1951 1991 98125 47.7210 -122.319 1690 7639
## 3 1933 0 98028 47.7379 -122.233 2720 8062
## 4 1965 0 98136 47.5208 -122.393 1360 5000
## 5 1987 0 98074 47.6168 -122.045 1800 7503
## 6 2001 0 98053 47.6561 -122.005 4760 101930
Observation
Id is unknown, hence we drop it.
dfrModel <- select(dfrModel, -c(id, date))
head(dfrModel)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 221900 3 1.00 1180 5650 1 0 0
## 2 538000 3 2.25 2570 7242 2 0 0
## 3 180000 2 1.00 770 10000 1 0 0
## 4 604000 4 3.00 1960 5000 1 0 0
## 5 510000 3 2.00 1680 8080 1 0 0
## 6 1225000 4 4.50 5420 101930 1 0 0
## condition grade sqft_above sqft_basement yr_built yr_renovated zipcode
## 1 3 7 1180 0 1955 0 98178
## 2 3 7 2170 400 1951 1991 98125
## 3 3 6 770 0 1933 0 98028
## 4 5 7 1050 910 1965 0 98136
## 5 3 8 1680 0 1987 0 98074
## 6 3 11 3890 1530 2001 0 98053
## lat long sqft_living15 sqft_lot15
## 1 47.5112 -122.257 1340 5650
## 2 47.7210 -122.319 1690 7639
## 3 47.7379 -122.233 2720 8062
## 4 47.5208 -122.393 1360 5000
## 5 47.6168 -122.045 1800 7503
## 6 47.6561 -122.005 4760 101930
Summary
#summary(dfrModel)
lapply(dfrModel, FUN=summary)
## $price
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75000 321950 450000 540088 645000 7700000
##
## $bedrooms
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 3.000 3.371 4.000 33.000
##
## $bathrooms
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.750 2.250 2.115 2.500 8.000
##
## $sqft_living
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1427 1910 2080 2550 13540
##
## $sqft_lot
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520 5040 7618 15107 10688 1651359
##
## $floors
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.500 1.494 2.000 3.500
##
## $waterfront
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.007542 0.000000 1.000000
##
## $view
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2343 0.0000 4.0000
##
## $condition
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.409 4.000 5.000
##
## $grade
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 7.000 7.657 8.000 13.000
##
## $sqft_above
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1190 1560 1788 2210 9410
##
## $sqft_basement
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 291.5 560.0 4820.0
##
## $yr_built
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1900 1951 1975 1971 1997 2015
##
## $yr_renovated
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 84.4 0.0 2015.0
##
## $zipcode
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 98001 98033 98065 98078 98118 98199
##
## $lat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.16 47.47 47.57 47.56 47.68 47.78
##
## $long
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -122.5 -122.3 -122.2 -122.2 -122.1 -121.3
##
## $sqft_living15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 399 1490 1840 1987 2360 6210
##
## $sqft_lot15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 651 5100 7620 12768 10083 871200
#detect_outliers(dfrModel$bedrooms)
#detect_outliers(dfrModel$sqft_living15)
lapply(dfrModel, FUN=detect_outliers)
## $price
## [1] 1146
##
## $bedrooms
## [1] 546
##
## $bathrooms
## [1] 571
##
## $sqft_living
## [1] 572
##
## $sqft_lot
## [1] 2425
##
## $floors
## [1] 0
##
## $waterfront
## [1] 163
##
## $view
## [1] 2124
##
## $condition
## [1] 30
##
## $grade
## [1] 1911
##
## $sqft_above
## [1] 611
##
## $sqft_basement
## [1] 496
##
## $yr_built
## [1] 0
##
## $yr_renovated
## [1] 914
##
## $zipcode
## [1] 0
##
## $lat
## [1] 2
##
## $long
## [1] 256
##
## $sqft_living15
## [1] 544
##
## $sqft_lot15
## [1] 2194
** Obsevations**
No. of outliers are more.
So for this model we are going without outliers.
lapply(dfrModel, FUN=Graph_Boxplot)
## $price
##
## $bedrooms
##
## $bathrooms
##
## $sqft_living
##
## $sqft_lot
##
## $floors
##
## $waterfront
##
## $view
##
## $condition
##
## $grade
##
## $sqft_above
##
## $sqft_basement
##
## $yr_built
##
## $yr_renovated
##
## $zipcode
##
## $lat
##
## $long
##
## $sqft_living15
##
## $sqft_lot15
Observation
Many predictor variables numeric categoric variable. Convert these to factor variables
dfrModel1<-dfrModel
dfrModel1$waterfront <- as.factor(dfrModel$waterfront)
levels(dfrModel1$waterfront) <- c("No-Waterfront", "Yes-Waterfront")
dfrModel1$view <- as.factor(dfrModel$view)
levels(dfrModel1$view) <- c("Bad View", "Average View", "Good View", "Very Good View", "Excellent View")
dfrModel1$condition <- as.factor(dfrModel$condition)
levels(dfrModel1$condition) <- c("Bad Condition", "Average Condition", "Good Condition", "Very Good Condition", "Excellent Condition")
dfrModel1$grade <- as.factor(dfrModel$grade)
head(dfrModel1)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront
## 1 221900 3 1.00 1180 5650 1 No-Waterfront
## 2 538000 3 2.25 2570 7242 2 No-Waterfront
## 3 180000 2 1.00 770 10000 1 No-Waterfront
## 4 604000 4 3.00 1960 5000 1 No-Waterfront
## 5 510000 3 2.00 1680 8080 1 No-Waterfront
## 6 1225000 4 4.50 5420 101930 1 No-Waterfront
## view condition grade sqft_above sqft_basement yr_built
## 1 Bad View Good Condition 7 1180 0 1955
## 2 Bad View Good Condition 7 2170 400 1951
## 3 Bad View Good Condition 6 770 0 1933
## 4 Bad View Excellent Condition 7 1050 910 1965
## 5 Bad View Good Condition 8 1680 0 1987
## 6 Bad View Good Condition 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(dfrModel$price, as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## price bedrooms bathrooms sqft_living sqft_lot
## 1.00000000 0.30834960 0.52513750 0.70203505 0.08966086
## floors waterfront view condition grade
## 0.25679388 0.26636943 0.39729349 0.03636179 0.66743425
## sqft_above sqft_basement yr_built yr_renovated zipcode
## 0.60556730 0.32381602 0.05401153 0.12643379 -0.05320285
## lat long sqft_living15 sqft_lot15
## 0.30700348 0.02162624 0.58537890 0.08244715
** Observations**
1) Removing all the columns whichi are having less correlation
Data Cleaning 2
dfrModel <- select(dfrModel, -c(sqft_lot, floors, waterfront, condition, yr_built, yr_renovated, zipcode, long, sqft_lot15, sqft_living15, sqft_above,sqft_basement, bedrooms))
head(dfrModel)
## price bathrooms sqft_living view grade lat
## 1 221900 1.00 1180 0 7 47.5112
## 2 538000 2.25 2570 0 7 47.7210
## 3 180000 1.00 770 0 6 47.7379
## 4 604000 3.00 1960 0 7 47.5208
## 5 510000 2.00 1680 0 8 47.6168
## 6 1225000 4.50 5420 0 11 47.6561
Summary
#summary(dfrModel)
lapply(dfrModel, FUN=summary)
## $price
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75000 321950 450000 540088 645000 7700000
##
## $bathrooms
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.750 2.250 2.115 2.500 8.000
##
## $sqft_living
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1427 1910 2080 2550 13540
##
## $view
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2343 0.0000 4.0000
##
## $grade
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 7.000 7.657 8.000 13.000
##
## $lat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.16 47.47 47.57 47.56 47.68 47.78
#detect_outliers(dfrModel$bedrooms)
#detect_outliers(dfrModel$sqft_living15)
lapply(dfrModel, FUN=detect_outliers)
## $price
## [1] 1146
##
## $bathrooms
## [1] 571
##
## $sqft_living
## [1] 572
##
## $view
## [1] 2124
##
## $grade
## [1] 1911
##
## $lat
## [1] 2
lapply(dfrModel, FUN=Graph_Boxplot)
## $price
##
## $bathrooms
##
## $sqft_living
##
## $view
##
## $grade
##
## $lat
Observation
Outliers present in many features.
We are removing outliers in this model.
head(dfrModel)
## price bathrooms sqft_living view grade lat
## 1 221900 1.00 1180 0 7 47.5112
## 2 538000 2.25 2570 0 7 47.7210
## 3 180000 1.00 770 0 6 47.7379
## 4 604000 3.00 1960 0 7 47.5208
## 5 510000 2.00 1680 0 8 47.6168
## 6 1225000 4.50 5420 0 11 47.6561
Removing Outliers
dfrModel$price <- Remove_Outliers(dfrModel$price)
dfrModel$bathrooms <- Remove_Outliers(dfrModel$bathrooms)
dfrModel$sqft_living <- Remove_Outliers(dfrModel$sqft_living)
dfrModel$lat <- Remove_Outliers(dfrModel$lat)
dfrModel <- dfrModel[complete.cases(dfrModel), ]
head(dfrModel)
## price bathrooms sqft_living view grade lat
## 1 221900 1.00 1180 0 7 47.5112
## 2 538000 2.25 2570 0 7 47.7210
## 3 180000 1.00 770 0 6 47.7379
## 4 604000 3.00 1960 0 7 47.5208
## 5 510000 2.00 1680 0 8 47.6168
## 7 257500 2.25 1715 0 7 47.3097
Observations
Outliers have been removed
Visualize
dfrGraph <- gather(dfrModel, variable, value, -price)
head(dfrGraph)
## price variable value
## 1 221900 bathrooms 1.00
## 2 538000 bathrooms 2.25
## 3 180000 bathrooms 1.00
## 4 604000 bathrooms 3.00
## 5 510000 bathrooms 2.00
## 6 257500 bathrooms 2.25
ggplot(dfrGraph) +
geom_jitter(aes(value,price, colour=variable)) +
geom_smooth(aes(value,price, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Price With Other Features")
Observation
We see an impact of all the features with ‘Price’.
Find Best Multi Linear Model
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.
#?step()
stpModel=step(lm(data=dfrModel, price~.), trace=0, steps=10000)
stpSummary <- summary(stpModel)
stpSummary
##
## Call:
## lm(formula = price ~ bathrooms + sqft_living + view + grade +
## lat, data = dfrModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -524162 -80541 -13143 62448 698156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.833e+07 2.972e+05 -95.320 < 2e-16 ***
## bathrooms -1.211e+04 1.897e+03 -6.383 1.77e-10 ***
## sqft_living 1.085e+02 1.944e+00 55.826 < 2e-16 ***
## view 4.823e+04 1.416e+03 34.060 < 2e-16 ***
## grade 6.413e+04 1.254e+03 51.156 < 2e-16 ***
## lat 5.913e+05 6.264e+03 94.404 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124300 on 20070 degrees of freedom
## Multiple R-squared: 0.626, Adjusted R-squared: 0.6259
## F-statistic: 6719 on 5 and 20070 DF, p-value: < 2.2e-16
Observation
5 Variables are Independent cariables which are affecting price
Multiple R square value = 0.62 (average to good model).
Make Final Multi Linear Model
#x1 <- dfrModel$bedrooms
x2 <- dfrModel$bathrooms
x3 <- dfrModel$view
x4 <- dfrModel$grade
x5 <- dfrModel$lat
#x6 <- dfrModel$sqft_above
x7 <- dfrModel$sqft_living
y <- dfrModel$price
slmModel <- lm(y~x2+x3+x4+x5+x7, data=dfrModel)
Observation
Model successfully created and no errors.
Show Model
# print summary
summary(slmModel)
##
## Call:
## lm(formula = y ~ x2 + x3 + x4 + x5 + x7, data = dfrModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -524162 -80541 -13143 62448 698156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.833e+07 2.972e+05 -95.320 < 2e-16 ***
## x2 -1.211e+04 1.897e+03 -6.383 1.77e-10 ***
## x3 4.823e+04 1.416e+03 34.060 < 2e-16 ***
## x4 6.413e+04 1.254e+03 51.156 < 2e-16 ***
## x5 5.913e+05 6.264e+03 94.404 < 2e-16 ***
## x7 1.085e+02 1.944e+00 55.826 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124300 on 20070 degrees of freedom
## Multiple R-squared: 0.626, Adjusted R-squared: 0.6259
## F-statistic: 6719 on 5 and 20070 DF, p-value: < 2.2e-16
Test Data
# find mpg of a person with weight 3.0
dfrTest <- data.frame(x2=c(1, 1.5, 2.25),x3=c(0, 1, 0), x4=c(7,10,7), x5=c(47.5112, 47.721, 47.721), x7=c(1180, 1300, 2570))
dfrTest
## x2 x3 x4 x5 x7
## 1 1.00 0 7 47.5112 1180
## 2 1.50 1 10 47.7210 1300
## 3 2.25 0 7 47.7210 2570
#names(dfrTest) <- c("x1","x2","x3")
#dfrTest
Observation
Test Data successfully created.
Predict
result <- predict(slmModel, dfrTest)
print(result)
## 1 2 3
## 333999.5 705654.9 593766.3