This is a final Project for the statistical programming in R (STAT612) course. The dataset fot this project is taken from Kaggle and it is about House Sales in King Country, USA.
The data set was obtained from www.kaggle.com website. It is Kaggle dataset which is about House Sales in King County of Washington State, USA. the link for this dataset is https://www.kaggle.com/gabriellima/house-sales-in-king-county-usa/data
Most of the time, housing prices are subject to different market forces. Sometimes the prices rise, other times the prices falls. The market forces that affect the housing prices may include interest rates, economic factors (such as GDP, employment, manufacturing, prices of goods), import/export and government subsidies. These forces are out of our control and can not be easily predictable. As the result, this paper does not explore the effect of those different external factors on the price of houses. Instead, we will be focusing to explore effect of various internal factors such as number of bedrooms, bathrooms, view, condition, grade, location, square foot, etc. - on the housing prices.
The data set contains the prices of houses against a various parameters that may or may not affect the house price. The objective of the study is to use statistical analysis in order to find out the dependence of these variables on the price of houses. it is to assess which parameters highly affect the housing prices and which variables have minimal affect on the price of houses. The statistical tools that we will be focusing to use are Correlation, box plot, various scatter plot and bar plots. in addition to that geospatial representation of those house sale price were plotted on ESRI maps to see which prices are higher on which part of the study area using the longitude and latitude points. Over all, important insights between the variables were drawn from boxplots, histogram, scatter, corrgrams and geospatial mapping.
The data for these sales comes from the official public records of home sales in the King County area, Washington State, USA. The data sets contains 21613 rows and 21 columns. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:
[, 1] id - Unique ID for each home sold.
[, 2] date - Date of the home sale.
[, 3] price - Price of each home sold.
[, 4] bedrooms - Number of bedrooms.
[, 5] bathrooms - Number of bathrooms, where - 0.5 accounts for a room with a toilet but no shower.
[, 6] sqft_living - Square footage of the apartments interior living space.
[, 7] sqft_lot - Square footage of the land space.
[, 8] floors - Number of floors.
[, 9] waterfront - A variable for whether the apartment was overlooking the waterfront or not.
[, 10] view - An index from 0 to 4 of how good the view of the property was.
[, 11] condition - An index from 1 to 5 on the condition of the apartment.
[, 12] grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
[, 13] sqft_above - The square footage of the interior housing space that is above ground level.
[, 14] sqft_basement - The square footage of the interior housing space that is below ground level.
[, 15] yr_built - The year the house was initially built.
[, 16] yr_renovated - The year of the house’s last renovation.
[, 17] zipcode - What zipcode area the house is in.
[, 18] lat - Lattitude.
[, 19] long - Longitude.
[, 20] sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors.
[, 21] sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors.
options(scipen = 999)
options(warn=-1)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(forcats)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(lattice)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(leaps)
library(tidyr)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(dplyr)
library(ggplot2)
library(corrgram)
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
##
## baseball
## The following object is masked from 'package:lattice':
##
## panel.fill
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
##
## here
## The following object is masked from 'package:base':
##
## date
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(date)
library(FactoMineR)
library(tree)
library(corrplot)
## corrplot 0.84 loaded
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(rpart)
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
## The following object is masked from 'package:purrr':
##
## map
library(rbokeh)
library(stringr)
library(leaflet)
KC_Data <- read.csv("kc_house_data.csv", sep=",", header=T, stringsAsFactors=F)
KC_Data <-na.omit(KC_Data)
head(KC_Data)
## id date price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000 221900 3 1.00 1180
## 2 6414100192 20141209T000000 538000 3 2.25 2570
## 3 5631500400 20150225T000000 180000 2 1.00 770
## 4 2487200875 20141209T000000 604000 4 3.00 1960
## 5 1954400510 20150218T000000 510000 3 2.00 1680
## 6 7237550310 20140512T000000 1230000 4 4.50 5420
## sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1 5650 1 0 0 3 7 1180 0
## 2 7242 2 0 0 3 7 2170 400
## 3 10000 1 0 0 3 6 770 0
## 4 5000 1 0 0 5 7 1050 910
## 5 8080 1 0 0 3 8 1680 0
## 6 101930 1 0 0 3 11 3890 1530
## yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1955 0 98178 47.5112 -122.257 1340 5650
## 2 1951 1991 98125 47.7210 -122.319 1690 7639
## 3 1933 0 98028 47.7379 -122.233 2720 8062
## 4 1965 0 98136 47.5208 -122.393 1360 5000
## 5 1987 0 98074 47.6168 -122.045 1800 7503
## 6 2001 0 98053 47.6561 -122.005 4760 101930
glimpse(KC_Data)
## Observations: 21,613
## Variables: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, ...
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T...
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000,...
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, ...
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0...
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
dim(KC_Data)
## [1] 21613 21
We have 21613 observations (rows) and 21 columns (variables) in our data set.
This dataset has 21 variables including price (our target) and 21,613 observation. The goal of this data processing is to predict accurately the price by using available predictor variables.
Variables such as bathrooms, bedrooms, floors, condition, waterfront, view and grade should be converted in to factor variables (dummy variables) in our data set. For example, Condition has three maximum factors, while grade has at least five. It is better to convert them in to dummy variables as they are not continous numeric variables.
Converting bathrooms, bedrooms, floors, condition, waterfront, view and grade in to factor variables:
KC_Data$bedrooms <- as.factor(KC_Data$bedrooms)
KC_Data$bathrooms <- as.factor(KC_Data$bathrooms)
KC_Data$waterfront <- as.factor(KC_Data$waterfront)
KC_Data$view <- as.factor(KC_Data$view)
KC_Data$grade <- as.factor(KC_Data$grade)
KC_Data$floors <- as.factor(KC_Data$floors)
KC_Data$condition <- as.factor(KC_Data$condition)
table(is.na(KC_Data))
##
## FALSE
## 453873
attach(KC_Data)
KC_Data=data.frame(price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15)
#We dont need variables such as ID, zipcode,date1qAqqqqq `
attach(KC_Data)
## The following objects are masked from KC_Data (pos = 3):
##
## bathrooms, bedrooms, condition, floors, grade, price,
## sqft_above, sqft_basement, sqft_living, sqft_living15,
## sqft_lot, sqft_lot15, view, waterfront, yr_built, yr_renovated
We are including a few plots to help us visualize the breakdown of some of the variables we believe will be significant to housing prices.
### 21.1. assessing Price vs. bedrooms using boxplots and bar chart
Using simple Box plot
## Price vs. bedrooms ->> There is relationship between price and bedrooms (significant relationship exists)
boxplot1=boxplot(price~bedrooms, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. bedrooms", xlab="bedrooms", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bedrooms, y = price, fill = bedrooms, main = "Price vs. bedrooms" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(bedrooms = as.factor(bedrooms)) %>%
group_by(bedrooms) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(bedrooms = reorder(bedrooms,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = bedrooms,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = bedrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'bedrooms',
y = 'Median Price',
title = 'bedrooms and Median Price') +
coord_flip() +
theme_bw()
Price and bedrooms have nice correlation. As number of bedrooms increases price also increases.
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bathrooms, y = price, fill = bathrooms,main = "Price vs. Bathrooms" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2) +
coord_flip()
Using simple Box plot
## Price vs. Bathrooms ->> Nice correlation, as # of bahtrooms increases [median of bar plot], price increases as well, with one exception when bathroom=7
boxplot2=boxplot(price~bathrooms, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Bathrooms", xlab="Bathrooms", ylab="Price")
Using bar chart in ggplot
KC_Data %>%
mutate(bathrooms = as.factor(bathrooms)) %>%
group_by(bathrooms) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(bathrooms = reorder(bathrooms,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = bathrooms,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = bathrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 3.5, colour = 'yellow',
fontface = 'bold') +
labs(x = 'bathrooms',
y = 'Median Price',
title = 'bathrooms and Median Price') +
coord_flip() +
theme_bw()
Price of house and its associated number of bathrooms have nice correlation. As number of bahtrooms increases (median of bar plot), price increases as well
Using simple Box plot
## Price vs. Grade ->> Nice correlation, grade increases [median of bar plot], price increases as well
boxplot3=boxplot(price~grade, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Grade", xlab="Grade", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = grade, y = price, fill = grade,main = "Price vs. Grade" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(grade = as.factor(grade)) %>%
group_by(grade) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(grade = reorder(grade,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = grade,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = grade, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'grade',
y = 'Median Price',
title = 'grade and Median Price') +
coord_flip() +
theme_bw()
**Price and Grade have also nice correlation. As grade increases (median of bar plot), price also increases.
Using simple Box plot
## Price vs. View ->> Nice correlation, view increases [median of bar plot], price increases as well
boxplot4=boxplot(price~view, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. View", xlab="View", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = view, y = price, fill = view, main = "Price vs. View" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(view = as.factor(view)) %>%
group_by(view) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(view = reorder(view,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = view,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = view, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'view',
y = 'Median Price',
title = 'view and Median Price') +
coord_flip() +
theme_bw()
Price and View has nice correlation. AS view increases (median of bar plot), the price of house also increases.
Using simple Box plot
## Price vs. condition ->> This is almost no relationship between price and condition
boxplot5=boxplot(price~condition, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. condition", xlab="condition", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = condition, y = price, fill = condition, main = "Price vs. condition" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(condition = as.factor(condition)) %>%
group_by(condition) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(condition = reorder(condition,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = condition,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = condition, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'condition',
y = 'Median Price',
title = 'condition and Median Price') +
coord_flip() +
theme_bw()
** there is almost very little or no relationship between price and condition. the relation ship that we see is almost insignificant.**
Using simple Box plot
## Price vs. floors ->> This is almost no relationship between price and floors (insignificant relationship exists)
boxplot6=boxplot(price~floors, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. floors", xlab="floors", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = floors, y = price, fill = floors, main = "Price vs. floors" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(floors = as.factor(floors)) %>%
group_by(floors) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(floors = reorder(floors,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = floors,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = floors, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'floors',
y = 'Median Price',
title = 'floors and Median Price') +
coord_flip() +
theme_bw()
** the relationship that we see between floors and price is almost insignificant. However it shows some sort of positive correlation **
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = waterfront, y = price, fill = waterfront, main = "Price vs. waterfront" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(waterfront = as.factor(waterfront)) %>%
group_by(waterfront) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(waterfront = reorder(waterfront,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = waterfront,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = waterfront, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'waterfront',
y = 'Median Price',
title = 'waterfront and Median Price') +
coord_flip() +
theme_bw()
Using bar chart in ggplot
KC_Data %>%
group_by(yr_renovated) %>%
dplyr::summarise(Median_Price = median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(yr_renovated = reorder(yr_renovated,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
head(10) %>%
ggplot(aes(x = yr_renovated,y = Median_Price)) +
geom_bar(stat='identity',colour="white",fill = "blue") +
geom_text(aes(x = yr_renovated, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'year renovated',
y = 'Median Price',
title = 'Year renovated and Median Price') +
coord_flip() +
theme_bw()
Year renovated doesn’t affect the price of the house.
We plot the Price Plot , unfortunately the graph does not reveal much.
case 1 Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
case 2 Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
scale_x_continuous(limits=c(0,2e6)) +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
case 3 - Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
scale_x_continuous(limits=c(0,1e6)) +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living)")+
ylab("Price")
Price and Sqft_living have nice correlation. As sqft_living increases price also increases.
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living15,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living15,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living15)")+
ylab("Price")
Price and Sqft_living15 have nice correlation. As sqft_living15 increases price also increases.
KC_Data %>%
filter(!is.na(sqft_living15)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living15,y=sqft_living))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living15,y=sqft_living),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living15)")+
ylab("sqft_living")
** sqft_living and sqft_living15 has high correlation**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_lot)) %>%
ggplot(aes(x=sqft_lot,y=price))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
stat_smooth(aes(x=sqft_lot,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot)")+
ylab("Price")
** Sqft_lot and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_lot15)) %>%
ggplot(aes(x=sqft_lot15,y=price))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot15))) +
stat_smooth(aes(x=sqft_lot15,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot15)")+
ylab("Price")
** Sqft_lot15 and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot15**
KC_Data %>%
filter(!is.na(sqft_lot15)) %>%
filter(!is.na(sqft_lot)) %>%
ggplot(aes(x=sqft_lot,y=sqft_lot15))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
stat_smooth(aes(x=sqft_lot,y=sqft_lot15),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot)")+
ylab("sqft_lot15")
** sqft_lot and sqft_lot15 has high positive correlation**
## Price vs. Lat ->> This is more like a normal dist relationship, price peaks around when lat= 47.64 and declines afterwards, but this can be modeled easily. we would say Lat explains the price as well.
boxplot5=boxplot(price~lat, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Lat", xlab="Lat", ylab="Price")
or
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(lat)) %>%
ggplot(aes(x=lat,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=lat,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("lat")+
ylab("Price")
** Price vs. Lat looks to have more likely a normal distribution relationship. The house price peaks around when lat= 47.64 and declines afterwards. Generally, we would say that Lat explains the price well.**
## Price vs. sqft_basement
boxplot6=boxplot(price~sqft_basement, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. sqft_basement", xlab="sqft_basement", ylab="Price")
OR
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_basement)) %>%
ggplot(aes(x=sqft_basement,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(sqft_basement,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("sqft_basement")+
ylab("Price")
** Price and sqft_basement have good correlation. We would say sqft_basement explains the price well.**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(date)) %>%
ggplot(aes(x=date,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(date,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("date")+
ylab("Price")
Price and date have almost no relationship. Thus, date doesn’t explain house price.
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_above, y = price, color = price))
viridis::scale_color_viridis(discrete=TRUE)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: NA
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## reset: function
## scale_name: viridis
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
geom_smooth(mapping = aes(x = sqft_living, y = price))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
geom_smooth(mapping = aes(x = sqft_living, y = price, linetype = waterfront))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
based on above visualization procedures variables such as number of bedrooms, number of bathrooms, grade, view of the houses, condition of the house, whether the has have water front or not, sqft_basement, sqft_living, sqft_living15, lat, sqft_lot and sqft_lot15 affect positively the price of the houses eventhough their degree may differ.
**sqft_living, sqft_living, bathrooms, grade, view, bedrooms, condition and sqft_basement are factors that have effect on the prices of the houses .
Testing correlation of the variables against price
vctCorr = numeric(0)
for (i in names(KC_Data))
{
cor.result <- cor(KC_Data$price, as.numeric(KC_Data[,i]))
vctCorr <- c(vctCorr, cor.result)
}
KC_DatarCorr <- vctCorr
names(KC_DatarCorr) <- names(KC_Data)
KC_DatarCorr
## price bedrooms bathrooms sqft_living sqft_lot
## 1.00000000 0.31489546 0.52450698 0.70204372 0.08965521
## floors waterfront view condition grade
## 0.25678570 0.26633051 0.39734647 0.03639192 0.66755773
## sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 0.60556551 0.32383736 0.05398183 0.12644223 0.58537401
## sqft_lot15
## 0.08245555
ggcorr(KC_Data, hjust = 0.8, layout.exp = 1) +
ggtitle("Correlation between house variables")
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded glmnet 2.0-16
n=length(grade)
z=sample(n,n/2)
x <- model.matrix(price~., data = KC_Data) [,-1] # Predictor matrix, with intercept column removed
y <- KC_Data$price # y vector with just the actual values of the response
ridge <- glmnet(x[z,], y[z], alpha=0 ) # Object with many Ridge regressions for various values of ???
plot(ridge) # plot of false slopes
set.seed(1) # To get repeatable results
cv.ridge=cv.glmnet(x,y,alpha=0) # 10-Fold is the default
cbind("Lambda"=cv.ridge$lambda, "10-Fold MSE"=cv.ridge$cvm)
## Lambda 10-Fold MSE
## [1,] 257898381.71 134930583513
## [2,] 234987400.30 133862551304
## [3,] 214111767.33 133722345441
## [4,] 195090668.05 133603090409
## [5,] 177759350.81 133472537957
## [6,] 161967700.02 133329652562
## [7,] 147578935.97 133173310731
## [8,] 134468430.05 133002294956
## [9,] 122522625.35 132815287526
## [10,] 111638053.01 132610864288
## [11,] 101720436.07 132387488446
## [12,] 92683872.89 132143504505
## [13,] 84450093.08 131877132498
## [14,] 76947779.57 131586462678
## [15,] 70111950.92 131269450202
## [16,] 63883398.44 130923891638
## [17,] 58208173.40 130547500264
## [18,] 53037119.71 130137794854
## [19,] 48325448.17 129692168281
## [20,] 44032348.55 129207875960
## [21,] 40120636.07 128682041197
## [22,] 36556429.34 128111664404
## [23,] 33308856.91 127493636206
## [24,] 30349789.87 126824755010
## [25,] 27653598.19 126101749680
## [26,] 25196928.74 125321307912
## [27,] 22958503.04 124480110890
## [28,] 20918932.91 123574874738
## [29,] 19060552.57 122602399166
## [30,] 17367265.61 121559623535
## [31,] 15824405.60 120443690372
## [32,] 14418609.02 119252016033
## [33,] 13137699.54 117982367890
## [34,] 11970582.52 116632946990
## [35,] 10907148.96 115202474629
## [36,] 9938187.91 113690280817
## [37,] 9055306.69 112096392000
## [38,] 8250858.20 110421614974
## [39,] 7517874.70 108667613389
## [40,] 6850007.43 106836969776
## [41,] 6241471.65 104933141127
## [42,] 5686996.51 102960821576
## [43,] 5181779.42 100925519454
## [44,] 4721444.42 98833783570
## [45,] 4302004.31 96693106653
## [46,] 3919826.10 94511847249
## [47,] 3571599.55 92299124676
## [48,] 3254308.49 90064687188
## [49,] 2965204.70 87818756784
## [50,] 2701784.08 85571855477
## [51,] 2461765.03 83334618911
## [52,] 2243068.61 81117603992
## [53,] 2043800.58 78931097475
## [54,] 1862234.97 76784932429
## [55,] 1696799.15 74688318973
## [56,] 1546060.19 72649695012
## [57,] 1408712.46 70676601691
## [58,] 1283566.33 68775519166
## [59,] 1169537.83 66949732452
## [60,] 1065639.32 65208144817
## [61,] 970970.86 63552206989
## [62,] 884712.48 61984176011
## [63,] 806117.06 60505344957
## [64,] 734503.85 59116032295
## [65,] 669252.55 57815663548
## [66,] 609798.00 56602867799
## [67,] 555625.22 55475346589
## [68,] 506265.01 54432201009
## [69,] 461289.82 53469231979
## [70,] 420310.11 52582891498
## [71,] 382970.91 51769455613
## [72,] 348948.83 51026578427
## [73,] 317949.18 50348212580
## [74,] 289703.45 49730653178
## [75,] 263967.00 49169868396
## [76,] 240516.90 48662005374
## [77,] 219150.04 48206426001
## [78,] 199681.35 47793569750
## [79,] 181942.21 47422267242
## [80,] 165778.97 47090017826
## [81,] 151051.62 46794911484
## [82,] 137632.61 46529621115
## [83,] 125405.71 46293704972
## [84,] 114265.01 46084711673
## [85,] 104114.02 45901863558
## [86,] 94864.82 45739324821
## [87,] 86437.29 45596601456
## [88,] 78758.44 45471311459
## [89,] 71761.76 45363603472
## [90,] 65386.64 45270537923
## [91,] 59577.87 45191239640
## [92,] 54285.14 45123467498
## [93,] 49462.60 45068472115
## [94,] 45068.47 45016620960
## [95,] 41064.72 44978964685
## [96,] 37416.64 44944239209
## [97,] 34092.65 44916366762
## [98,] 31063.95 44893829664
## [99,] 28304.32 44876277740
plot(cv.ridge, LW = 4) # Plot all lambdas vs. MSEs
coef(cv.ridge)
## 75 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2516095.93258032
## bedrooms1 -5370.75193326
## bedrooms2 18634.97777342
## bedrooms3 -4858.98005490
## bedrooms4 -4971.26990417
## bedrooms5 8818.44222001
## bedrooms6 -16438.54892017
## bedrooms7 -60487.45948929
## bedrooms8 94898.62598990
## bedrooms9 -14755.54075373
## bedrooms10 -49815.80191362
## bedrooms11 -122600.68305411
## bedrooms33 125217.81943486
## bathrooms0.5 -86033.99358139
## bathrooms0.75 -49367.16809985
## bathrooms1 -18725.77913584
## bathrooms1.25 32261.12384956
## bathrooms1.5 -13926.51571203
## bathrooms1.75 -14192.94947690
## bathrooms2 -12179.36726667
## bathrooms2.25 -89.40661514
## bathrooms2.5 -21476.44692110
## bathrooms2.75 -3366.61405324
## bathrooms3 22626.82430663
## bathrooms3.25 92078.49441772
## bathrooms3.5 60361.33952563
## bathrooms3.75 158967.66526206
## bathrooms4 150222.06230055
## bathrooms4.25 231589.67608589
## bathrooms4.5 168770.00383168
## bathrooms4.75 429509.01991359
## bathrooms5 291919.64741588
## bathrooms5.25 388781.12984220
## bathrooms5.5 563595.66869511
## bathrooms5.75 480794.51355362
## bathrooms6 806353.83456227
## bathrooms6.25 571001.62775195
## bathrooms6.5 165800.48938604
## bathrooms6.75 398132.22953792
## bathrooms7.5 -104549.13117987
## bathrooms7.75 2669076.52293248
## bathrooms8 1310690.56326269
## sqft_living 57.00702034
## sqft_lot -0.03108303
## floors1.5 32656.68796382
## floors2 15607.56794965
## floors2.5 143815.64976705
## floors3 78985.17474939
## floors3.5 192405.14018510
## waterfront1 377249.36400571
## view1 93582.67601218
## view2 58601.93125170
## view3 101361.83723950
## view4 243792.23130311
## condition2 -39211.32973495
## condition3 -16690.35513809
## condition4 4495.17244829
## condition5 47775.41823446
## grade3 -76281.52907214
## grade4 -94432.02925161
## grade5 -114634.75615719
## grade6 -90110.62060560
## grade7 -59421.77389177
## grade8 -8865.23718039
## grade9 72480.51035199
## grade10 176217.15857050
## grade11 317803.97967054
## grade12 582228.43889469
## grade13 1178977.60212396
## sqft_above 51.11521512
## sqft_basement 66.95684611
## yr_built -1180.55571653
## yr_renovated 36.78039953
## sqft_living15 58.84783004
## sqft_lot15 -0.28925602
best.lambda=cv.ridge$lambda.min
best.lambda
## [1] 28304.32
log(best.lambda) # Best lambda and log(lambda)
## [1] 10.25077
min.mse=min(cv.ridge$cvm) # Lowest 10FCV MSE
cbind("Best Lambda"=best.lambda,"Log(Lambda)"=log(best.lambda),"Best 10FCV MSE"=min.mse)
## Best Lambda Log(Lambda) Best 10FCV MSE
## [1,] 28304.32 10.25077 44876277740
Yhat=predict(cv.ridge,cv.ridge$lambda.min, newx=x[-z,])
mean((Yhat-y[-z])^2) #mean squared prediction error
## [1] 38372777876
lasso <- glmnet(x[z,], y[z], alpha=1 ) # Object with many lass regressions for various values
plot(lasso, LW = 4)
cv.lasso=cv.glmnet(x[z,],y[z],alpha=1,lambda=seq(0,1000,1))
cv.lasso$lambda.min
## [1] 1000
min(cv.lasso$cvm)
## [1] 48582602001
plot(cv.lasso)
lasso=glmnet(x[z,],y[z],alpha=1,lambda=cv.lasso$lambda.min)
coef(lasso)
## 75 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 5592256.6932775
## bedrooms1 29142.4114994
## bedrooms2 44582.8757816
## bedrooms3 11204.3132496
## bedrooms4 -10301.6517977
## bedrooms5 -6735.2429381
## bedrooms6 -75033.3211138
## bedrooms7 -95435.8440753
## bedrooms8 74601.6177672
## bedrooms9 .
## bedrooms10 -94737.9480153
## bedrooms11 -147463.7580439
## bedrooms33 68135.7069312
## bathrooms0.5 -41289.0096711
## bathrooms0.75 .
## bathrooms1 .
## bathrooms1.25 .
## bathrooms1.5 .
## bathrooms1.75 -85.6237681
## bathrooms2 .
## bathrooms2.25 10867.0735554
## bathrooms2.5 -12095.4727618
## bathrooms2.75 .
## bathrooms3 27063.6946147
## bathrooms3.25 67810.6562627
## bathrooms3.5 56396.3559092
## bathrooms3.75 164292.5845633
## bathrooms4 104344.9221321
## bathrooms4.25 152520.3942531
## bathrooms4.5 168609.4328294
## bathrooms4.75 402272.2539910
## bathrooms5 406815.6575097
## bathrooms5.25 462178.3324392
## bathrooms5.5 789360.3849534
## bathrooms5.75 1191576.2163702
## bathrooms6 424020.4166560
## bathrooms6.25 -8358.2867302
## bathrooms6.5 .
## bathrooms6.75 .
## bathrooms7.5 .
## bathrooms7.75 3847993.5109901
## bathrooms8 .
## sqft_living 131.7307989
## sqft_lot .
## floors1.5 13999.6696300
## floors2 25740.0543535
## floors2.5 150945.6716215
## floors3 142685.0693930
## floors3.5 199078.8337391
## waterfront1 603675.9221005
## view1 72282.4579116
## view2 51826.2177537
## view3 78036.4616273
## view4 205635.8380991
## condition2 -21358.0350066
## condition3 .
## condition4 9044.0146733
## condition5 63151.8736554
## grade3 .
## grade4 -142754.6152769
## grade5 -208809.4292845
## grade6 -165938.3223359
## grade7 -82640.9507693
## grade8 .
## grade9 134214.4769132
## grade10 286778.7924714
## grade11 545253.1393748
## grade12 1024619.5767465
## grade13 1230720.8898316
## sqft_above .
## sqft_basement 31.7113766
## yr_built -2767.4147174
## yr_renovated 23.7637265
## sqft_living15 40.8987616
## sqft_lot15 -0.5876256
Some of the cofficients are zero. When lambda moves from lambda.min to higher value, it deletes more and more variables that are not significant to the model.
Yhat=predict(lasso, cv.lasso$lambda.min,newx=x[-z,])
mean((Yhat-y[-z])^2) #mean squared prediction error
## [1] 40830431353
reg = lm(price~. , data = KC_Data)
x=model.matrix(reg)
pc = princomp(x)
summary(pc)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 46658.0200180 16857.2906824 1298.1327633016
## Proportion of Variance 0.8837189 0.1153552 0.0006840687
## Cumulative Proportion 0.8837189 0.9990740 0.9997581004
## Comp.4 Comp.5 Comp.6
## Standard deviation 527.9624730920 410.44466891231 384.7583523302
## Proportion of Variance 0.0001131534 0.00006838656 0.0000600949
## Cumulative Proportion 0.9998712538 0.99993964040 0.9999997353
## Comp.7 Comp.8
## Standard deviation 25.4819441865983 0.6016539457258527
## Proportion of Variance 0.0000002635886 0.0000000001469451
## Cumulative Proportion 0.9999999988850 0.9999999990318967
## Comp.9 Comp.10
## Standard deviation 0.5698229400903827 0.5545971834730218
## Proportion of Variance 0.0000000001318079 0.0000000001248581
## Cumulative Proportion 0.9999999991637046 0.9999999992885628
## Comp.11 Comp.12
## Standard deviation 0.4649557407025032 0.42554282477323729
## Proportion of Variance 0.0000000000877576 0.00000000007351028
## Cumulative Proportion 0.9999999993763203 0.99999999944983065
## Comp.13 Comp.14
## Standard deviation 0.37056826591278252 0.35244691627342339
## Proportion of Variance 0.00000000005574398 0.00000000005042535
## Cumulative Proportion 0.99999999950557461 0.99999999955599994
## Comp.15 Comp.16
## Standard deviation 0.34650007417147627 0.32745887740036950
## Proportion of Variance 0.00000000004873805 0.00000000004352864
## Cumulative Proportion 0.99999999960473795 0.99999999964826669
## Comp.17 Comp.18
## Standard deviation 0.3086504953817114 0.2919804199716121
## Proportion of Variance 0.0000000000386719 0.0000000000346074
## Cumulative Proportion 0.9999999996869385 0.9999999997215460
## Comp.19 Comp.20
## Standard deviation 0.27248803220252882 0.27028235100358827
## Proportion of Variance 0.00000000003014091 0.00000000002965493
## Cumulative Proportion 0.99999999975168685 0.99999999978134180
## Comp.21 Comp.22
## Standard deviation 0.26124689065291201 0.24268515428602053
## Proportion of Variance 0.00000000002770536 0.00000000002390826
## Cumulative Proportion 0.99999999980904719 0.99999999983295540
## Comp.23 Comp.24
## Standard deviation 0.22171521990253179 0.20801372997483372
## Proportion of Variance 0.00000000001995504 0.00000000001756489
## Cumulative Proportion 0.99999999985291044 0.99999999987047528
## Comp.25 Comp.26
## Standard deviation 0.20211465886349297 0.18693097128036817
## Proportion of Variance 0.00000000001658277 0.00000000001418483
## Cumulative Proportion 0.99999999988705812 0.99999999990124289
## Comp.27 Comp.28
## Standard deviation 0.17290917388362265 0.16066106848673875
## Proportion of Variance 0.00000000001213662 0.00000000001047811
## Cumulative Proportion 0.99999999991337951 0.99999999992385769
## Comp.29 Comp.30
## Standard deviation 0.147023293347794987 0.140148582208489580
## Proportion of Variance 0.000000000008774731 0.000000000007973315
## Cumulative Proportion 0.999999999932632333 0.999999999940605733
## Comp.31 Comp.32
## Standard deviation 0.135889778406887790 0.128991346351518071
## Proportion of Variance 0.000000000007496095 0.000000000006754336
## Cumulative Proportion 0.999999999948101737 0.999999999954856111
## Comp.33 Comp.34
## Standard deviation 0.128006737313422048 0.119199617834721161
## Proportion of Variance 0.000000000006651616 0.000000000005767814
## Cumulative Proportion 0.999999999961507791 0.999999999967275510
## Comp.35 Comp.36
## Standard deviation 0.104754398846136965 0.099828500940133430
## Proportion of Variance 0.000000000004454573 0.000000000004045485
## Cumulative Proportion 0.999999999971730169 0.999999999975775600
## Comp.37 Comp.38
## Standard deviation 0.095059916733053390 0.08581102150772275
## Proportion of Variance 0.000000000003668228 0.00000000000298915
## Cumulative Proportion 0.999999999979443888 0.99999999998243294
## Comp.39 Comp.40
## Standard deviation 0.082233772051430454 0.079767919580845184
## Proportion of Variance 0.000000000002745124 0.000000000002582962
## Cumulative Proportion 0.999999999985178079 0.999999999987761123
## Comp.41 Comp.42
## Standard deviation 0.068304580968538420 0.063640596806720004
## Proportion of Variance 0.000000000001893918 0.000000000001644107
## Cumulative Proportion 0.999999999989654942 0.999999999991299071
## Comp.43 Comp.44
## Standard deviation 0.061998692113397445 0.060650152534340362
## Proportion of Variance 0.000000000001560366 0.000000000001493225
## Cumulative Proportion 0.999999999992859490 0.999999999994352740
## Comp.45 Comp.46
## Standard deviation 0.055346835930788860 0.0449130280785412991
## Proportion of Variance 0.000000000001243504 0.0000000000008188535
## Cumulative Proportion 0.999999999995596189 0.9999999999964150899
## Comp.47 Comp.48
## Standard deviation 0.0355800705233606102 0.0332535619482823772
## Proportion of Variance 0.0000000000005138959 0.0000000000004488879
## Cumulative Proportion 0.9999999999969289011 0.9999999999973778753
## Comp.49 Comp.50
## Standard deviation 0.0316380977814122513 0.0266110230503557900
## Proportion of Variance 0.0000000000004063331 0.0000000000002874648
## Cumulative Proportion 0.9999999999977842169 0.9999999999980716536
## Comp.51 Comp.52
## Standard deviation 0.0257761301732649939 0.0250733893359092568
## Proportion of Variance 0.0000000000002697099 0.0000000000002552041
## Cumulative Proportion 0.9999999999983413268 0.9999999999985965671
## Comp.53 Comp.54
## Standard deviation 0.0211825838371817246 0.0203893699829955193
## Proportion of Variance 0.0000000000001821459 0.0000000000001687598
## Cumulative Proportion 0.9999999999987786437 0.9999999999989473976
## Comp.55 Comp.56
## Standard deviation 0.0189685271037664770 0.0184007922202056827
## Proportion of Variance 0.0000000000001460591 0.0000000000001374468
## Cumulative Proportion 0.9999999999990935029 0.9999999999992309485
## Comp.57 Comp.58
## Standard deviation 0.0180061340420249634 0.016434618139512809
## Proportion of Variance 0.0000000000001316141 0.000000000000109643
## Cumulative Proportion 0.9999999999993626210 0.999999999999472200
## Comp.59 Comp.60
## Standard deviation 0.01407740851977356059 0.01333926086919276326
## Proportion of Variance 0.00000000000008044646 0.00000000000007223124
## Cumulative Proportion 0.99999999999955269114 0.99999999999962485564
## Comp.61 Comp.62
## Standard deviation 0.01249874420348607088 0.01231012946445568003
## Proportion of Variance 0.00000000000006341533 0.00000000000006151581
## Cumulative Proportion 0.99999999999968824937 0.99999999999974986675
## Comp.63 Comp.64
## Standard deviation 0.01004509430118860625 0.00958563444475138068
## Proportion of Variance 0.00000000000004096091 0.00000000000003729952
## Cumulative Proportion 0.99999999999979083398 0.99999999999982802645
## Comp.65 Comp.66
## Standard deviation 0.009484161994080772 0.009275901463541736
## Proportion of Variance 0.000000000000036514 0.000000000000034928
## Cumulative Proportion 0.999999999999864553 0.999999999999899525
## Comp.67 Comp.68
## Standard deviation 0.00858172819813234067 0.00679645489967816655
## Proportion of Variance 0.00000000000002989586 0.00000000000001875108
## Cumulative Proportion 0.99999999999992938982 0.99999999999994815258
## Comp.69 Comp.70
## Standard deviation 0.00667004244791527534 0.00616756610960241414
## Proportion of Variance 0.00000000000001806004 0.00000000000001544149
## Cumulative Proportion 0.99999999999996624922 0.99999999999998168132
## Comp.71 Comp.72
## Standard deviation 0.00593240950454403202 0.002556749582798334853
## Proportion of Variance 0.00000000000001428643 0.000000000000002653615
## Cumulative Proportion 0.99999999999999600320 0.999999999999998556710
## Comp.73 Comp.74
## Standard deviation 0.00186514536970121074 0.000153556135151463349310
## Proportion of Variance 0.00000000000000141217 0.000000000000000009571849
## Cumulative Proportion 1.00000000000000000000 1.000000000000000000000000
## Comp.75
## Standard deviation 0.00000000000000022204460492502100358875455343721
## Proportion of Variance 0.00000000000000000000000000000000000000002001437
## Cumulative Proportion 1.00000000000000000000000000000000000000000000000
screeplot(pc)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
pcreg=pcr(price~., data = KC_Data, scale=TRUE)
summary(pcreg)
## Data: X dimension: 21613 74
## Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 74
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 7.549 11.25 14.05 16.45 18.78 20.94 22.94
## price 37.276 55.05 55.59 55.66 60.53 60.60 60.60
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 24.91 26.85 28.76 30.63 32.40 34.08 35.72
## price 60.99 61.14 61.14 61.27 61.51 61.69 61.89
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## X 37.34 38.93 40.51 42.07 43.58 45.08
## price 61.92 61.96 61.98 62.33 62.81 63.10
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps
## X 46.57 48.05 49.52 50.96 52.39 53.82
## price 63.52 63.64 63.67 63.69 63.80 63.84
## 27 comps 28 comps 29 comps 30 comps 31 comps 32 comps
## X 55.22 56.62 58.0 59.39 60.77 62.14
## price 63.97 64.04 64.1 64.11 64.12 64.12
## 33 comps 34 comps 35 comps 36 comps 37 comps 38 comps
## X 63.50 64.86 66.22 67.57 68.92 70.26
## price 64.12 64.12 64.12 64.19 64.19 64.24
## 39 comps 40 comps 41 comps 42 comps 43 comps 44 comps
## X 71.60 72.94 74.27 75.60 76.91 78.22
## price 64.25 64.25 64.26 64.27 64.27 64.28
## 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 79.53 80.80 82.05 83.28 84.49 85.68
## price 64.31 64.31 64.34 64.51 64.73 64.76
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps
## X 86.84 87.99 89.12 90.22 91.31 92.32
## price 65.22 65.69 65.71 65.72 65.75 65.75
## 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## X 93.29 94.21 95.04 95.82 96.57 97.23
## price 65.79 65.92 66.11 66.52 67.46 67.47
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps
## X 97.84 98.35 98.76 99.13 99.46 99.73
## price 67.51 67.67 69.09 69.10 69.61 69.70
## 69 comps 70 comps 71 comps 72 comps 73 comps 74 comps
## X 99.99 100.00 100.00 100.00 100.00 100.00
## price 69.87 69.87 69.87 69.87 69.87 69.87
pcr.fit=pcr(price~., data = KC_Data, validation="CV")
# 10-Fold CV # Use validation="LOO" for LOOCV
summary(pcr.fit)
## Data: X dimension: 21613 74
## Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 74
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 367371 365792 365808 264296 258200 257223 256718
## adjCV 367371 365790 365805 264293 258193 257214 256706
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 248872 248760 248453 248044 245099 245087 245088
## adjCV 248859 248746 248440 248029 245083 245070 245072
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 245056 244850 244575 244464 244442 243345
## adjCV 245040 244832 244555 244444 244421 243282
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## CV 242855 241967 241258 241176 238450 238181
## adjCV 242831 241944 241233 241152 238408 238152
## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps
## CV 238138 237808 235342 235354 229745 229022
## adjCV 238110 237782 235308 235328 229679 228984
## 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps
## CV 227380 224744 222795 221196 220867 220401
## adjCV 227513 224685 222755 221131 220820 220356
## 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## CV 220193 219973 219764 219435 216609 215980
## adjCV 220160 219928 219713 219408 216357 215825
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps
## CV 213113 212486 212515 212492 211504 211340
## adjCV 213016 212414 212441 212515 211347 211271
## 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 210079 206451 206433 206372 206543 206518
## adjCV 211166 205616 206255 206209 206415 206399
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps
## CV 206679 207151 207174 207003 207014 207178
## adjCV 206546 207086 206968 206796 206881 207079
## 62 comps 63 comps 64 comps 65 comps 66 comps 67 comps
## CV 207301 207277 207252 207299 207817 208450
## adjCV 207070 207046 207050 207097 207533 208150
## 68 comps 69 comps 70 comps 71 comps 72 comps 73 comps
## CV 211462 211444 211939 213362 11041426 718949859129
## adjCV 211122 210891 211309 212606 10474829 682043450957
## 74 comps
## CV 2653260791041
## adjCV 2517118487482
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 88.3719 99.9074 99.98 99.99 99.99 100.00 100.00
## price 0.8676 0.8734 48.27 50.65 51.03 51.25 54.19
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 100.00 100.00 100.00 100.00 100.0 100.00 100.00
## price 54.24 54.36 54.51 55.59 55.6 55.61 55.62
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## X 100.0 100.00 100.00 100.00 100.00 100.00
## price 55.7 55.81 55.86 55.87 56.29 56.45
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 56.78 57.04 57.07 58.04 58.15 58.17
## 27 comps 28 comps 29 comps 30 comps 31 comps 32 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 58.29 59.18 59.19 61.16 61.39 61.97
## 33 comps 34 comps 35 comps 36 comps 37 comps 38 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 62.83 63.46 64.02 64.13 64.28 64.39
## 39 comps 40 comps 41 comps 42 comps 43 comps 44 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 64.51 64.56 64.67 65.73 65.99 66.76
## 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 66.98 66.98 67.04 67.43 67.45 67.57
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 69.06 69.12 69.21 69.22 69.22 69.23
## 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 69.24 69.35 69.37 69.38 69.38 69.43
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps
## X 100.00 100.00 100.00 100.00 100.0 100.0
## price 69.43 69.43 69.46 69.49 69.5 69.5
## 69 comps 70 comps 71 comps 72 comps 73 comps 74 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00
## price 69.73 69.79 69.85 69.87 69.87 69.88
The approach of the main components is another statistical technique to find the variables of the model that explain the maximum variance of the dependent variable.
fit1 = pcr(price ~., data = KC_Data, validation="CV", ncomp=1)
fit2 = pcr(price ~., data = KC_Data, validation="CV", ncomp=2)
fit3 = pcr(price ~., data = KC_Data, validation="CV", ncomp=3)
summary(fit1)
## Data: X dimension: 21613 74
## Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 1
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps
## CV 367371 365799
## adjCV 367371 365797
##
## TRAINING: % variance explained
## 1 comps
## X 88.3719
## price 0.8676
summary(fit2)
## Data: X dimension: 21613 74
## Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 2
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps
## CV 367371 365789 365804
## adjCV 367371 365787 365801
##
## TRAINING: % variance explained
## 1 comps 2 comps
## X 88.3719 99.9074
## price 0.8676 0.8734
summary(fit3)
## Data: X dimension: 21613 74
## Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 3
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps
## CV 367371 365817 365880 264358
## adjCV 367371 365813 365873 264351
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 88.3719 99.9074 99.98
## price 0.8676 0.8734 48.27
The optimal value M of the number of explanatory variables is 27. However, we note that from 24 variables, the RMSE, as well as variance of the model are close to the optimal value. As a result, M = 24 will be used to measure our MSE on the test set.
pcr.fit1 = pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 1)
pcr.pred1 = predict(pcr.fit1, data=KC_Data[-z,],ncomp= 1)
mean((pcr.pred1[2] - price[-z]) ^2)
## [1] 122998024054
pcr.fit2 = pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 2)
pcr.pred2 = predict(pcr.fit2, data=KC_Data[-z,],ncomp= 2)
mean((pcr.pred2[2] - price[-z]) ^2)
## [1] 122994791969
pcr.fit3= pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 3)
pcr.pred3 = predict(pcr.fit3, data=KC_Data[-z,],ncomp= 3)
mean((pcr.pred3[2] - price[-z]) ^2)
## [1] 151170678033
The last technique used will be regression trees. Let’s start with a single regression tree.
library(tree)
tree.fit = tree(price~., data = KC_Data[z,])
summary(tree.fit)
##
## Regression tree:
## tree(formula = price ~ ., data = KC_Data[z, ])
## Variables actually used in tree construction:
## [1] "grade" "sqft_living" "yr_built" "sqft_living15"
## [5] "view" "waterfront"
## Number of terminal nodes: 12
## Residual mean deviance: 52360000000 = 565200000000000 / 10790
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2473000 -129200 -27920 0 94260 2376000
plot(tree.fit)
text(tree.fit, pretty=0)
test = KC_Data[-z,]
tree.pred = predict(tree.fit, test)
mean((test$price - tree.pred)^2)
## [1] 55551085715
Using cross validation, it is possible to determine an optimal tree size. R provides a function for this cv.tree. The RMSE is much better than for a single tree, which seems logical. This method gets the best score on the RMSE test.
null = lm( price ~1-id -lat -long -date-zipcode, data=KC_Data )
full = lm( price ~.-id -lat -long -date-zipcode, data=KC_Data )
step( null, scope=list(lower=null, upper=full), direction="both" )
## Start: AIC=553903.4
## price ~ 1 - id - lat - long - date - zipcode
##
## Df Sum of Sq RSS AIC
## + grade 11 1515998291373849 1400649369462915 538072
## + sqft_living 1 1437514676706840 1479132984129924 539231
## + sqft_above 1 1069562647275666 1847085013561098 544032
## + bathrooms 29 1042325216709221 1874322444127544 544404
## + sqft_living15 1 999426441953220 1917221218883545 544838
## + view 4 490794829832554 2425852831004210 549929
## + bedrooms 12 310713435849958 2605934224986807 551493
## + sqft_basement 1 305870690879498 2610776969957267 551511
## + floors 5 246868011501070 2669779649335694 552002
## + waterfront 1 206883479315148 2709764181521616 552315
## + yr_renovated 1 46630304823592 2870017356013172 553557
## + sqft_lot 1 23444178016610 2893203482820154 553731
## + sqft_lot15 1 19830045706864 2896817615129900 553758
## + condition 4 20054859680333 2896592801156432 553762
## + yr_built 1 8499220511338 2908148440325426 553842
## <none> 2916647660836764 553903
##
## Step: AIC=538072.3
## price ~ grade
##
## Df Sum of Sq RSS AIC
## + sqft_living 1 180694699306282 1219954670156634 535089
## + yr_built 1 153374152309404 1247275217153512 535568
## + view 4 150028504263395 1250620865199520 535632
## + bathrooms 29 133859404646558 1266789964816357 535959
## + sqft_basement 1 126070360309279 1274579009153636 536036
## + waterfront 1 109331367950298 1291318001512617 536318
## + condition 4 53518746069880 1347130623393036 537238
## + sqft_living15 1 48264268501300 1352385100961616 537316
## + bedrooms 12 41194341905811 1359455027557104 537451
## + yr_renovated 1 39028675971590 1361620693491325 537464
## + floors 5 38704216806119 1361945152656796 537477
## + sqft_above 1 31004262115021 1369645107347894 537591
## + sqft_lot15 1 756750136712 1399892619326204 538063
## <none> 1400649369462915 538072
## + sqft_lot 1 76662760726 1400572706702189 538073
## - grade 11 1515998291373849 2916647660836764 553903
##
## Step: AIC=535089.1
## price ~ grade + sqft_living
##
## Df Sum of Sq RSS AIC
## + yr_built 1 151708080689686 1068246589466947 532221
## + view 4 114412915649021 1105541754507613 532969
## + waterfront 1 95833172270098 1124121497886536 533323
## + bathrooms 29 71670494822023 1148284175334610 533839
## + floors 5 41849081220236 1178105588936398 534345
## + condition 4 38625984614258 1181328685542376 534402
## + yr_renovated 1 28355840004071 1191598830152563 534583
## + sqft_above 1 28274019164135 1191680650992499 534584
## + sqft_basement 1 28274019164135 1191680650992498 534584
## + bedrooms 12 19714951308530 1200239718848104 534761
## + sqft_lot15 1 7052732171683 1212901937984951 534966
## + sqft_lot 1 3694294555220 1216260375601413 535026
## + sqft_living15 1 775511043807 1219179159112827 535077
## <none> 1219954670156634 535089
## - sqft_living 1 180694699306281 1400649369462915 538072
## - grade 11 259178313973290 1479132984129924 539231
##
## Step: AIC=532221
## price ~ grade + sqft_living + yr_built
##
## Df Sum of Sq RSS AIC
## + waterfront 1 82623310624817 985623278842130 530483
## + view 4 80129938355435 988116651111512 530544
## + bathrooms 29 51216977927506 1017029611539442 531217
## + floors 5 16647631590724 1051598957876224 531892
## + bedrooms 12 14568301904934 1053678287562014 531948
## + condition 4 5278532170358 1062968057296589 532122
## + sqft_lot15 1 4870046900518 1063376542566429 532124
## + yr_renovated 1 4582325623914 1063664263843033 532130
## + sqft_above 1 3960207429901 1064286382037046 532143
## + sqft_basement 1 3960207429901 1064286382037046 532143
## + sqft_lot 1 2866238469399 1065380350997548 532165
## + sqft_living15 1 1919196714053 1066327392752894 532184
## <none> 1068246589466947 532221
## - yr_built 1 151708080689686 1219954670156634 535089
## - sqft_living 1 179028627686564 1247275217153512 535568
## - grade 11 317877395776347 1386123985243294 537829
##
## Step: AIC=530483.2
## price ~ grade + sqft_living + yr_built + waterfront
##
## Df Sum of Sq RSS AIC
## + bathrooms 29 48547045149738 937076233692392 529449
## + view 4 24633883964556 960989394877575 529944
## + floors 5 15148510618529 970474768223602 530158
## + bedrooms 12 10207511688104 975415767154027 530282
## + sqft_lot15 1 5292296694224 980330982147906 530369
## + condition 4 5141419397231 980481859444899 530378
## + sqft_lot 1 2874249384382 982749029457748 530422
## + sqft_above 1 2819403938373 982803874903757 530423
## + sqft_basement 1 2819403938373 982803874903757 530423
## + yr_renovated 1 2065732697952 983557546144178 530440
## + sqft_living15 1 1666004338394 983957274503736 530449
## <none> 985623278842130 530483
## - waterfront 1 82623310624817 1068246589466947 532221
## - yr_built 1 138498219044405 1124121497886536 533323
## - sqft_living 1 166559377614045 1152182656456176 533856
## - grade 11 302128454010880 1287751732853010 536240
##
## Step: AIC=529449.5
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms
##
## Df Sum of Sq RSS AIC
## + view 4 21681460837136 915394772855256 528952
## + floors 5 10106227005061 926970006687331 529225
## + bedrooms 12 9386236979090 927689996713302 529256
## + sqft_living15 1 5113143746029 931963089946363 529333
## + condition 4 5089508845984 931986724846408 529340
## + sqft_lot15 1 4657386381207 932418847311185 529344
## + sqft_lot 1 2578409722102 934497823970290 529392
## + sqft_above 1 2024780875953 935051452816439 529405
## + sqft_basement 1 2024780875953 935051452816439 529405
## + yr_renovated 1 1182796665413 935893437026979 529424
## <none> 937076233692392 529449
## - bathrooms 29 48547045149738 985623278842130 530483
## - sqft_living 1 66542023207909 1003618256900301 530930
## - waterfront 1 79953377847050 1017029611539442 531217
## - yr_built 1 120261432140348 1057337665832740 532057
## - grade 11 256039295387065 1193115529079457 534648
##
## Step: AIC=528951.6
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view
##
## Df Sum of Sq RSS AIC
## + floors 5 9818828786067 905575944069190 528728
## + bedrooms 12 7826233527895 907568539327361 528790
## + sqft_lot15 1 4790687193391 910604085661865 528840
## + condition 4 4811616528834 910583156326422 528846
## + sqft_living15 1 3228266511306 912166506343950 528877
## + sqft_lot 1 2714879273255 912679893582002 528889
## + yr_renovated 1 941416476222 914453356379034 528931
## + sqft_above 1 405710042621 914989062812636 528944
## + sqft_basement 1 405710042621 914989062812636 528944
## <none> 915394772855256 528952
## - view 4 21681460837136 937076233692392 529449
## - waterfront 1 27598647185030 942993420040287 529592
## - bathrooms 29 45594622022318 960989394877575 529944
## - sqft_living 1 59996076321357 975390849176614 530322
## - yr_built 1 103996682099375 1019391454954632 531275
## - grade 11 237400307786424 1152795080641680 533913
##
## Step: AIC=528728.5
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors
##
## Df Sum of Sq RSS AIC
## + bedrooms 12 7105445557653 898470498511537 528582
## + sqft_living15 1 5387516463004 900188427606186 528602
## + condition 4 5193620490213 900382323578976 528612
## + sqft_lot15 1 4084561901652 901491382167538 528633
## + sqft_lot 1 2273897171343 903302046897847 528676
## + sqft_above 1 1483867502440 904092076566750 528695
## + sqft_basement 1 1483867502440 904092076566750 528695
## + yr_renovated 1 716443965073 904859500104117 528713
## <none> 905575944069190 528728
## - floors 5 9818828786067 915394772855256 528952
## - view 4 21394062618141 926970006687331 529225
## - waterfront 1 27304408756537 932880352825727 529368
## - bathrooms 29 40784956144341 946360900213531 529623
## - sqft_living 1 65325960345477 970901904414667 530232
## - yr_built 1 92891335146363 998467279215552 530837
## - grade 11 219049590302479 1124625534371669 533389
##
## Step: AIC=528582.2
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms
##
## Df Sum of Sq RSS AIC
## + sqft_living15 1 5415042304870 893055456206667 528454
## + condition 4 5344486757922 893126011753615 528461
## + sqft_lot15 1 4754063487350 893716435024187 528470
## + sqft_lot 1 2752248141423 895718250370114 528518
## + sqft_above 1 1194595223121 897275903288415 528555
## + sqft_basement 1 1194595223121 897275903288415 528555
## + yr_renovated 1 614866368578 897855632142958 528569
## <none> 898470498511537 528582
## - bedrooms 12 7105445557653 905575944069190 528728
## - floors 5 9098040815824 907568539327361 528790
## - view 4 20038851432438 918509349943975 529051
## - waterfront 1 26025087308756 924495585820293 529197
## - bathrooms 29 40517305162424 938987803673961 529478
## - sqft_living 1 67528336180621 965998834692158 530146
## - yr_built 1 93079351195127 991549849706663 530711
## - grade 11 196704257145249 1095174755656786 532839
##
## Step: AIC=528453.6
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms + sqft_living15
##
## Df Sum of Sq RSS AIC
## + sqft_lot15 1 5463948974416 887591507232252 528323
## + condition 4 5673650983132 887381805223536 528324
## + sqft_lot 1 2884571790546 890170884416121 528386
## + sqft_above 1 2308867245084 890746588961583 528400
## + sqft_basement 1 2308867245084 890746588961583 528400
## + yr_renovated 1 768008885086 892287447321581 528437
## <none> 893055456206667 528454
## - sqft_living15 1 5415042304870 898470498511537 528582
## - bedrooms 12 7132971399519 900188427606186 528602
## - floors 5 11175926150686 904231382357353 528712
## - view 4 17731994295014 910787450501681 528870
## - waterfront 1 26427496494078 919482952700745 529082
## - bathrooms 29 43292496060430 936347952267097 529419
## - sqft_living 1 45487147406557 938542603613224 529525
## - yr_built 1 95373569503914 988429025710581 530645
## - grade 11 162550866405222 1055606322611889 532046
##
## Step: AIC=528322.9
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms + sqft_living15 + sqft_lot15
##
## Df Sum of Sq RSS AIC
## + condition 4 5693535833180 881897971399072 528192
## + sqft_above 1 1674115123177 885917392109075 528284
## + sqft_basement 1 1674115123177 885917392109075 528284
## + yr_renovated 1 872833192991 886718674039261 528304
## <none> 887591507232252 528323
## + sqft_lot 1 6562631015 887584944601236 528325
## - sqft_lot15 1 5463948974415 893055456206667 528454
## - sqft_living15 1 6124927791935 893716435024187 528470
## - bedrooms 12 7863846923150 895455354155401 528490
## - floors 5 10503842033069 898095349265321 528567
## - view 4 17545576953667 905137084185918 528738
## - waterfront 1 26711327481605 914302834713856 528962
## - bathrooms 29 43308996904499 930900504136751 529295
## - sqft_living 1 48766682014856 936358189247108 529477
## - yr_built 1 90655114375840 978246621608092 530423
## - grade 11 159512651251593 1047104158483844 531873
##
## Step: AIC=528191.8
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition
##
## Df Sum of Sq RSS AIC
## + yr_renovated 1 1759724924523 880138246474549 528151
## + sqft_above 1 1425426988538 880472544410534 528159
## + sqft_basement 1 1425426988538 880472544410534 528159
## <none> 881897971399072 528192
## + sqft_lot 1 2658947534 881895312451538 528194
## - condition 4 5693535833180 887591507232252 528323
## - sqft_lot15 1 5483833824464 887381805223536 528324
## - sqft_living15 1 6470784886722 888368756285794 528348
## - bedrooms 12 8053239481990 889951210881062 528364
## - floors 5 10932027111154 892829998510226 528448
## - view 4 17273526720127 899171498119199 528603
## - waterfront 1 26884817025307 908782788424379 528839
## - bathrooms 29 43043868416491 924941839815563 529164
## - sqft_living 1 47602802198674 929500773597746 529326
## - yr_built 1 73552975079277 955450946478349 529921
## - grade 11 160331977462416 1042229948861488 531780
##
## Step: AIC=528150.7
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition +
## yr_renovated
##
## Df Sum of Sq RSS AIC
## + sqft_above 1 1491801287431 878646445187118 528116
## + sqft_basement 1 1491801287431 878646445187118 528116
## <none> 880138246474549 528151
## + sqft_lot 1 2341940559 880135904533990 528153
## - yr_renovated 1 1759724924523 881897971399072 528192
## - sqft_lot15 1 5661119763880 885799366238429 528287
## - condition 4 6580427564712 886718674039261 528304
## - sqft_living15 1 6756876226458 886895122701008 528314
## - bedrooms 12 7917812194678 888056058669228 528320
## - floors 5 10815947025497 890954193500046 528405
## - view 4 16882158445898 897020404920448 528553
## - waterfront 1 26151874192590 906290120667140 528782
## - bathrooms 29 42352385379146 922490631853695 529108
## - sqft_living 1 47098953443858 927237199918407 529275
## - yr_built 1 58834101961662 938972348436211 529547
## - grade 11 160400928926805 1040539175401354 531747
##
## Step: AIC=528116
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms +
## view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition +
## yr_renovated + sqft_above
##
## Df Sum of Sq RSS AIC
## <none> 878646445187118 528116
## + sqft_lot 1 242760400 878646202426718 528118
## - sqft_above 1 1491801287431 880138246474549 528151
## - yr_renovated 1 1826099223416 880472544410534 528159
## - sqft_lot15 1 5051397707436 883697842894554 528238
## - condition 4 6334737214381 884981182401499 528263
## - bedrooms 12 7498755819374 886145201006492 528276
## - sqft_living15 1 7654660624978 886301105812096 528301
## - floors 5 12077866478589 890724311665707 528401
## - view 4 14855500545413 893501945732531 528470
## - waterfront 1 26483767407463 905130212594581 528756
## - sqft_living 1 38051576797683 916698021984801 529030
## - bathrooms 29 42127789694189 920774234881307 529070
## - yr_built 1 55377219146088 934023664333206 529435
## - grade 11 161434776383310 1040081221570428 531739
##
## Call:
## lm(formula = price ~ grade + sqft_living + yr_built + waterfront +
## bathrooms + view + floors + bedrooms + sqft_living15 + sqft_lot15 +
## condition + yr_renovated + sqft_above, data = KC_Data)
##
## Coefficients:
## (Intercept) grade3 grade4 grade5 grade6
## 5568144.3097 -107186.1901 -182465.1101 -214647.4367 -163015.2988
## grade7 grade8 grade9 grade10 grade11
## -70186.6592 20423.5127 160733.3946 320884.6820 539592.3263
## grade12 grade13 sqft_living yr_built waterfront1
## 919927.1871 1713514.6559 137.2528 -2813.0057 507112.8322
## bathrooms0.5 bathrooms0.75 bathrooms1 bathrooms1.25 bathrooms1.5
## -28719.8866 28011.3156 60561.0428 87591.2845 64491.6913
## bathrooms1.75 bathrooms2 bathrooms2.25 bathrooms2.5 bathrooms2.75
## 67274.1105 72280.2867 86399.9342 64375.3564 78939.6466
## bathrooms3 bathrooms3.25 bathrooms3.5 bathrooms3.75 bathrooms4
## 111367.9475 175320.7335 140169.9891 264015.2334 235633.4912
## bathrooms4.25 bathrooms4.5 bathrooms4.75 bathrooms5 bathrooms5.25
## 335755.4625 271945.5269 586260.9317 436618.0757 564653.3075
## bathrooms5.5 bathrooms5.75 bathrooms6 bathrooms6.25 bathrooms6.5
## 737604.9832 574355.3117 1092566.1753 655461.6713 156319.0105
## bathrooms6.75 bathrooms7.5 bathrooms7.75 bathrooms8 view1
## 411931.1983 109867.0600 3702175.1404 1682607.2764 106758.5586
## view2 view3 view4 floors1.5 floors2
## 44288.8029 87795.0452 216820.0869 22824.9257 30139.7192
## floors2.5 floors3 floors3.5 bedrooms1 bedrooms2
## 147956.3149 139949.6685 284704.1607 35787.6779 51172.3759
## bedrooms3 bedrooms4 bedrooms5 bedrooms6 bedrooms7
## 12984.5021 -13833.0589 -12312.7246 -69766.3604 -159800.1439
## bedrooms8 bedrooms9 bedrooms10 bedrooms11 bedrooms33
## 56420.7378 -136530.6435 -132297.6888 -236177.1113 178757.7085
## sqft_living15 sqft_lot15 condition2 condition3 condition4
## 47.7290 -0.5858 27742.4250 51660.3490 71141.0769
## condition5 yr_renovated sqft_above
## 118374.8914 24.8020 -27.4211
Best_1=lm(formula = price ~ grade + sqft_living + yr_built + waterfront + bathrooms + view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition + yr_renovated + sqft_above, data = KC_Data[z,])
cv.error=fitted.values(Best_1, KC_Data)
mean((cv.error[-z]-price[-z])^2)
## [1] 229193908561
Stepwise method has a MSPE of 245,760,760,973. Tree Regression has a MSPE of 61,612,773,589. If we use only one principle component, the MSPE will be 134,247,905,869. MSPE for two and three principle components will be 134,206,128,747 and 143,132,709,054, respectively. Lasso has a cv of MSPE 44,566,731,235. Ridge has a cv of MSPE 41,645,084,735. The order of MSPE from least to greatest is as follows:
Ridge < Lasso < Tree < PCR(3) < PCR(2) < PCR(1) < Stepwise
Based solely on MSPE, the ridge regression appears to give us the best MSE using cross-validation using the validation set approach.
As we illustrated in the beginning of our project, we have a very large data set containing a vast amount of variables and total number of data values. More specifically, we have 14 variables, and 21613 rows of data, so the MSE from cross-validation is considerably larger than what we have seen in some of example and models we used in class.
In our data set, ridge regression has the smallest MSPE. Because the Regression model uses variable shrinkage and eliminates some of the useless variables, we will choose Ridge Regression as our final model; it also has the lowest MSPE giving us more reason to select this model. Other than ridge regression, the lasso and tree models are also considerable.
We won’t choose stepwise or pcr because all of those models have a much larger MSPE than lasso, tree and ridge models. We did not use logistic regression for this analysis because our dependent variable in this case is a numeric variable.