1. Introduction

This is a final Project for the statistical programming in R (STAT612) course. The dataset fot this project is taken from Kaggle and it is about House Sales in King Country, USA.

The data set was obtained from www.kaggle.com website. It is Kaggle dataset which is about House Sales in King County of Washington State, USA. the link for this dataset is https://www.kaggle.com/gabriellima/house-sales-in-king-county-usa/data

Most of the time, housing prices are subject to different market forces. Sometimes the prices rise, other times the prices falls. The market forces that affect the housing prices may include interest rates, economic factors (such as GDP, employment, manufacturing, prices of goods), import/export and government subsidies. These forces are out of our control and can not be easily predictable. As the result, this paper does not explore the effect of those different external factors on the price of houses. Instead, we will be focusing to explore effect of various internal factors such as number of bedrooms, bathrooms, view, condition, grade, location, square foot, etc. - on the housing prices.

2. Overview and objective of the Study

The data set contains the prices of houses against a various parameters that may or may not affect the house price. The objective of the study is to use statistical analysis in order to find out the dependence of these variables on the price of houses. it is to assess which parameters highly affect the housing prices and which variables have minimal affect on the price of houses. The statistical tools that we will be focusing to use are Correlation, box plot, various scatter plot and bar plots. in addition to that geospatial representation of those house sale price were plotted on ESRI maps to see which prices are higher on which part of the study area using the longitude and latitude points. Over all, important insights between the variables were drawn from boxplots, histogram, scatter, corrgrams and geospatial mapping.

3. The Dataset and Description

The data for these sales comes from the official public records of home sales in the King County area, Washington State, USA. The data sets contains 21613 rows and 21 columns. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:

[, 1] id - Unique ID for each home sold.

[, 2] date - Date of the home sale.

[, 3] price - Price of each home sold.

[, 4] bedrooms - Number of bedrooms.

[, 5] bathrooms - Number of bathrooms, where - 0.5 accounts for a room with a toilet but no shower.

[, 6] sqft_living - Square footage of the apartments interior living space.

[, 7] sqft_lot - Square footage of the land space.

[, 8] floors - Number of floors.

[, 9] waterfront - A variable for whether the apartment was overlooking the waterfront or not.

[, 10] view - An index from 0 to 4 of how good the view of the property was.

[, 11] condition - An index from 1 to 5 on the condition of the apartment.

[, 12] grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

[, 13] sqft_above - The square footage of the interior housing space that is above ground level.

[, 14] sqft_basement - The square footage of the interior housing space that is below ground level.

[, 15] yr_built - The year the house was initially built.

[, 16] yr_renovated - The year of the house’s last renovation.

[, 17] zipcode - What zipcode area the house is in.

[, 18] lat - Lattitude.

[, 19] long - Longitude.

[, 20] sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors.

[, 21] sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors.

4. problem question to be adressed

  1. Which variables seem to affect house sales pirce in the king county?
  2. What role does visualization play to see effect of various variables on housing price?
  3. Does the spatial location of the houses affect the house price?

5. loading the packages that may require for the data assessment

options(scipen = 999) 
options(warn=-1)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(forcats)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(lattice)
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(leaps)
library(tidyr)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(dplyr)
library(ggplot2)
library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
## The following object is masked from 'package:lattice':
## 
##     panel.fill
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(date)
library(FactoMineR)
library(tree)
library(corrplot)
## corrplot 0.84 loaded
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(rpart)
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
## 
##     ozone
## The following object is masked from 'package:purrr':
## 
##     map
library(rbokeh)
library(stringr)
library(leaflet)

4. Reading data

KC_Data <- read.csv("kc_house_data.csv", sep=",", header=T, stringsAsFactors=F)
KC_Data <-na.omit(KC_Data)
head(KC_Data)
##           id            date   price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000  221900        3      1.00        1180
## 2 6414100192 20141209T000000  538000        3      2.25        2570
## 3 5631500400 20150225T000000  180000        2      1.00         770
## 4 2487200875 20141209T000000  604000        4      3.00        1960
## 5 1954400510 20150218T000000  510000        3      2.00        1680
## 6 7237550310 20140512T000000 1230000        4      4.50        5420
##   sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1     5650      1          0    0         3     7       1180             0
## 2     7242      2          0    0         3     7       2170           400
## 3    10000      1          0    0         3     6        770             0
## 4     5000      1          0    0         5     7       1050           910
## 5     8080      1          0    0         3     8       1680             0
## 6   101930      1          0    0         3    11       3890          1530
##   yr_built yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1     1955            0   98178 47.5112 -122.257          1340       5650
## 2     1951         1991   98125 47.7210 -122.319          1690       7639
## 3     1933            0   98028 47.7379 -122.233          2720       8062
## 4     1965            0   98136 47.5208 -122.393          1360       5000
## 5     1987            0   98074 47.6168 -122.045          1800       7503
## 6     2001            0   98053 47.6561 -122.005          4760     101930
glimpse(KC_Data)
## Observations: 21,613
## Variables: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, ...
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T...
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000,...
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, ...
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0...
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
dim(KC_Data)
## [1] 21613    21

We have 21613 observations (rows) and 21 columns (variables) in our data set.

This dataset has 21 variables including price (our target) and 21,613 observation. The goal of this data processing is to predict accurately the price by using available predictor variables.

Transforming data - converting certain dummy variables in to factor

Variables such as bathrooms, bedrooms, floors, condition, waterfront, view and grade should be converted in to factor variables (dummy variables) in our data set. For example, Condition has three maximum factors, while grade has at least five. It is better to convert them in to dummy variables as they are not continous numeric variables.

Converting bathrooms, bedrooms, floors, condition, waterfront, view and grade in to factor variables:

KC_Data$bedrooms <- as.factor(KC_Data$bedrooms)
KC_Data$bathrooms <- as.factor(KC_Data$bathrooms)
KC_Data$waterfront <- as.factor(KC_Data$waterfront)
KC_Data$view <- as.factor(KC_Data$view)
KC_Data$grade <- as.factor(KC_Data$grade)
KC_Data$floors <- as.factor(KC_Data$floors)
KC_Data$condition <- as.factor(KC_Data$condition)

Checking for NA values in entire dataset

table(is.na(KC_Data))
## 
##  FALSE 
## 453873
attach(KC_Data)
KC_Data=data.frame(price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15)
#We dont need variables such as ID, zipcode,date1qAqqqqq  `
attach(KC_Data)
## The following objects are masked from KC_Data (pos = 3):
## 
##     bathrooms, bedrooms, condition, floors, grade, price,
##     sqft_above, sqft_basement, sqft_living, sqft_living15,
##     sqft_lot, sqft_lot15, view, waterfront, yr_built, yr_renovated

Visualization and Plotting

We are including a few plots to help us visualize the breakdown of some of the variables we believe will be significant to housing prices.

### 21.1. assessing Price vs. bedrooms using boxplots and bar chart

Using simple Box plot

## Price vs. bedrooms ->> There is relationship between price and bedrooms (significant relationship exists)

boxplot1=boxplot(price~bedrooms, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. bedrooms", xlab="bedrooms", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bedrooms, y = price, fill = bedrooms, main = "Price vs. bedrooms" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(bedrooms = as.factor(bedrooms)) %>%
  group_by(bedrooms) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(bedrooms = reorder(bedrooms,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = bedrooms,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = bedrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'bedrooms', 
       y = 'Median Price', 
       title = 'bedrooms and Median Price') +
  coord_flip() + 
  theme_bw()

Price and bedrooms have nice correlation. As number of bedrooms increases price also increases.

21.2 - assessing Price vs. bathrooms using boxplots and bar chart.

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bathrooms, y = price, fill = bathrooms,main = "Price vs. Bathrooms" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2) +
coord_flip()

Using simple Box plot

## Price vs. Bathrooms ->> Nice correlation, as # of bahtrooms increases [median of bar plot], price increases as well, with one exception when bathroom=7

boxplot2=boxplot(price~bathrooms, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Bathrooms", xlab="Bathrooms", ylab="Price")

Using bar chart in ggplot

KC_Data %>%
  mutate(bathrooms = as.factor(bathrooms)) %>%
  group_by(bathrooms) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(bathrooms = reorder(bathrooms,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = bathrooms,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = bathrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 3.5, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'bathrooms', 
       y = 'Median Price', 
       title = 'bathrooms and Median Price') +
  coord_flip() + 
  theme_bw()

Price of house and its associated number of bathrooms have nice correlation. As number of bahtrooms increases (median of bar plot), price increases as well

21.3 - Assessing how the Grade affects the price using boxplots, and bar chart.

Using simple Box plot

## Price vs. Grade ->> Nice correlation, grade increases [median of bar plot], price increases as well

boxplot3=boxplot(price~grade, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Grade", xlab="Grade", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = grade, y = price, fill = grade,main = "Price vs. Grade" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(grade = as.factor(grade)) %>%
  group_by(grade) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(grade = reorder(grade,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = grade,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = grade, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'grade', 
       y = 'Median Price', 
       title = 'grade and Median Price') +
  coord_flip() + 
  theme_bw()

**Price and Grade have also nice correlation. As grade increases (median of bar plot), price also increases.

21.4. Assessing how the number of view affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. View ->> Nice correlation, view increases [median of bar plot], price increases as well

boxplot4=boxplot(price~view, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. View", xlab="View", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = view, y = price, fill = view, main = "Price vs. View" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(view = as.factor(view)) %>%
  group_by(view) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(view = reorder(view,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = view,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = view, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'view', 
       y = 'Median Price', 
       title = 'view and Median Price') +
  coord_flip() + 
  theme_bw()

Price and View has nice correlation. AS view increases (median of bar plot), the price of house also increases.

21.5 - Assessing how condition affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. condition ->> This is almost no relationship between price and condition
boxplot5=boxplot(price~condition, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. condition", xlab="condition", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = condition, y = price, fill = condition, main = "Price vs. condition" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(condition = as.factor(condition)) %>%
  group_by(condition) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(condition = reorder(condition,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = condition,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = condition, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'condition', 
       y = 'Median Price', 
       title = 'condition and Median Price') +
  coord_flip() + 
  theme_bw()

** there is almost very little or no relationship between price and condition. the relation ship that we see is almost insignificant.**

21.6 - Assessing how number of floors affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. floors ->> This is almost no relationship between price and floors (insignificant relationship exists)

boxplot6=boxplot(price~floors, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. floors", xlab="floors", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = floors, y = price, fill = floors, main = "Price vs. floors" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(floors = as.factor(floors)) %>%
  group_by(floors) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(floors = reorder(floors,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = floors,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = floors, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'floors', 
       y = 'Median Price', 
       title = 'floors and Median Price') +
  coord_flip() + 
  theme_bw()

** the relationship that we see between floors and price is almost insignificant. However it shows some sort of positive correlation **

21.7. Assessing how waterfront affect the price using boxplots and bar chart.

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = waterfront, y = price, fill = waterfront, main = "Price vs. waterfront" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(waterfront = as.factor(waterfront)) %>%
  group_by(waterfront) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(waterfront = reorder(waterfront,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = waterfront,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = waterfront, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'waterfront', 
       y = 'Median Price', 
       title = 'waterfront and Median Price') +
  coord_flip() + 
  theme_bw()

21.8 - Assessing how Year Renovated affect the price using bar chart.

Using bar chart in ggplot

KC_Data %>%
  group_by(yr_renovated) %>%
  dplyr::summarise(Median_Price = median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(yr_renovated = reorder(yr_renovated,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  head(10) %>%
  
  
  ggplot(aes(x = yr_renovated,y = Median_Price)) +
  geom_bar(stat='identity',colour="white",fill = "blue") +
  geom_text(aes(x = yr_renovated, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'year renovated', 
       y = 'Median Price', 
       title = 'Year renovated and Median Price') +
   coord_flip() +
  theme_bw()

Year renovated doesn’t affect the price of the house.

22 Using scater plots to indicate the relationship of numeric and date variables with house prices

22.1 - Price Plots

We plot the Price Plot , unfortunately the graph does not reveal much.

case 1 Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

case 2 Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  scale_x_continuous(limits=c(0,2e6)) +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

case 3 - Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  scale_x_continuous(limits=c(0,1e6)) +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

22.2. Sqft Living and Price scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(x=sqft_living,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living)")+
  ylab("Price")

Price and Sqft_living have nice correlation. As sqft_living increases price also increases.

22.3 - Assessing how Sqft_living15 affects Price using scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living15,y=price))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=sqft_living15,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living15)")+
  ylab("Price")

Price and Sqft_living15 have nice correlation. As sqft_living15 increases price also increases.

22.4 - Assessing whether Sqft Living and Sqft Living15 correlates using scatter Plot

KC_Data %>% 
  filter(!is.na(sqft_living15)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living15,y=sqft_living))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=sqft_living15,y=sqft_living),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living15)")+
  ylab("sqft_living")

** sqft_living and sqft_living15 has high correlation**

22.5 - Assessing whether Sqft Lot and Price correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_lot)) %>% 
  
  ggplot(aes(x=sqft_lot,y=price))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
  stat_smooth(aes(x=sqft_lot,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot)")+
  ylab("Price")

** Sqft_lot and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot**

22.6 - Assessing whether Sqft Lot and Price correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_lot15)) %>% 
  
  ggplot(aes(x=sqft_lot15,y=price))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot15))) +
  stat_smooth(aes(x=sqft_lot15,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot15)")+
  ylab("Price")

** Sqft_lot15 and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot15**

22.7 - Assessing whether sqft_lot and sqft_lot15 correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(sqft_lot15)) %>% 
  filter(!is.na(sqft_lot)) %>% 
  
  ggplot(aes(x=sqft_lot,y=sqft_lot15))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
  stat_smooth(aes(x=sqft_lot,y=sqft_lot15),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot)")+
  ylab("sqft_lot15")

** sqft_lot and sqft_lot15 has high positive correlation**

22.8 - Assessing how lat affets Price of the housing (using scatter Plot)

## Price vs. Lat ->> This is more like a normal dist relationship, price peaks around when lat= 47.64 and declines afterwards, but this can be modeled easily. we would say Lat explains the price as well.

boxplot5=boxplot(price~lat, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Lat", xlab="Lat", ylab="Price")

or

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(lat)) %>% 
 
  ggplot(aes(x=lat,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(x=lat,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("lat")+
  ylab("Price")

** Price vs. Lat looks to have more likely a normal distribution relationship. The house price peaks around when lat= 47.64 and declines afterwards. Generally, we would say that Lat explains the price well.**

24. Assessing how sqft_basement affets Price of the housing (by using scatter Plot)

## Price vs. sqft_basement 

boxplot6=boxplot(price~sqft_basement, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. sqft_basement", xlab="sqft_basement", ylab="Price")

OR

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_basement)) %>% 
 
  ggplot(aes(x=sqft_basement,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(sqft_basement,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("sqft_basement")+
  ylab("Price")

** Price and sqft_basement have good correlation. We would say sqft_basement explains the price well.**

25. assessing how date affets Price of the housing using scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(date)) %>% 
 
  ggplot(aes(x=date,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(date,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("date")+
  ylab("Price")

Price and date have almost no relationship. Thus, date doesn’t explain house price.

26 - More visualizations

ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_above, y = price, color = price))

viridis::scale_color_viridis(discrete=TRUE)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: viridis
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
  geom_smooth(mapping = aes(x = sqft_living, y = price))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
  
  geom_smooth(mapping = aes(x = sqft_living, y = price, linetype = waterfront))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

based on above visualization procedures variables such as number of bedrooms, number of bathrooms, grade, view of the houses, condition of the house, whether the has have water front or not, sqft_basement, sqft_living, sqft_living15, lat, sqft_lot and sqft_lot15 affect positively the price of the houses eventhough their degree may differ.

**sqft_living, sqft_living, bathrooms, grade, view, bedrooms, condition and sqft_basement are factors that have effect on the prices of the houses .

27. conducting More correlations to Strengthen our claim

Testing correlation of the variables against price

  • To strengthen our claim, we also computed more correlation between price and variables,

27.1 -Let us run the correlation function.

vctCorr = numeric(0)
for (i in names(KC_Data))
{
    cor.result <- cor(KC_Data$price, as.numeric(KC_Data[,i])) 
    vctCorr <- c(vctCorr, cor.result)
}
KC_DatarCorr <- vctCorr
names(KC_DatarCorr) <- names(KC_Data)
KC_DatarCorr
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##    1.00000000    0.31489546    0.52450698    0.70204372    0.08965521 
##        floors    waterfront          view     condition         grade 
##    0.25678570    0.26633051    0.39734647    0.03639192    0.66755773 
##    sqft_above sqft_basement      yr_built  yr_renovated sqft_living15 
##    0.60556551    0.32383736    0.05398183    0.12644223    0.58537401 
##    sqft_lot15 
##    0.08245555

27.2 Correlation between house variables continued ….

ggcorr(KC_Data, hjust = 0.8, layout.exp = 1) + 
ggtitle("Correlation between house variables")

Modelling

Ridge Regression

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-16
n=length(grade)
z=sample(n,n/2)
x <- model.matrix(price~., data = KC_Data) [,-1]  # Predictor matrix, with intercept column removed
y <- KC_Data$price # y vector with just the actual values of the response
ridge <- glmnet(x[z,], y[z], alpha=0 ) # Object with many Ridge regressions for various values of ???
plot(ridge) # plot of false slopes

set.seed(1) # To get repeatable results
cv.ridge=cv.glmnet(x,y,alpha=0) # 10-Fold is the default
cbind("Lambda"=cv.ridge$lambda, "10-Fold MSE"=cv.ridge$cvm)
##              Lambda  10-Fold MSE
##   [1,] 257898381.71 134930583513
##   [2,] 234987400.30 133862551304
##   [3,] 214111767.33 133722345441
##   [4,] 195090668.05 133603090409
##   [5,] 177759350.81 133472537957
##   [6,] 161967700.02 133329652562
##   [7,] 147578935.97 133173310731
##   [8,] 134468430.05 133002294956
##   [9,] 122522625.35 132815287526
##  [10,] 111638053.01 132610864288
##  [11,] 101720436.07 132387488446
##  [12,]  92683872.89 132143504505
##  [13,]  84450093.08 131877132498
##  [14,]  76947779.57 131586462678
##  [15,]  70111950.92 131269450202
##  [16,]  63883398.44 130923891638
##  [17,]  58208173.40 130547500264
##  [18,]  53037119.71 130137794854
##  [19,]  48325448.17 129692168281
##  [20,]  44032348.55 129207875960
##  [21,]  40120636.07 128682041197
##  [22,]  36556429.34 128111664404
##  [23,]  33308856.91 127493636206
##  [24,]  30349789.87 126824755010
##  [25,]  27653598.19 126101749680
##  [26,]  25196928.74 125321307912
##  [27,]  22958503.04 124480110890
##  [28,]  20918932.91 123574874738
##  [29,]  19060552.57 122602399166
##  [30,]  17367265.61 121559623535
##  [31,]  15824405.60 120443690372
##  [32,]  14418609.02 119252016033
##  [33,]  13137699.54 117982367890
##  [34,]  11970582.52 116632946990
##  [35,]  10907148.96 115202474629
##  [36,]   9938187.91 113690280817
##  [37,]   9055306.69 112096392000
##  [38,]   8250858.20 110421614974
##  [39,]   7517874.70 108667613389
##  [40,]   6850007.43 106836969776
##  [41,]   6241471.65 104933141127
##  [42,]   5686996.51 102960821576
##  [43,]   5181779.42 100925519454
##  [44,]   4721444.42  98833783570
##  [45,]   4302004.31  96693106653
##  [46,]   3919826.10  94511847249
##  [47,]   3571599.55  92299124676
##  [48,]   3254308.49  90064687188
##  [49,]   2965204.70  87818756784
##  [50,]   2701784.08  85571855477
##  [51,]   2461765.03  83334618911
##  [52,]   2243068.61  81117603992
##  [53,]   2043800.58  78931097475
##  [54,]   1862234.97  76784932429
##  [55,]   1696799.15  74688318973
##  [56,]   1546060.19  72649695012
##  [57,]   1408712.46  70676601691
##  [58,]   1283566.33  68775519166
##  [59,]   1169537.83  66949732452
##  [60,]   1065639.32  65208144817
##  [61,]    970970.86  63552206989
##  [62,]    884712.48  61984176011
##  [63,]    806117.06  60505344957
##  [64,]    734503.85  59116032295
##  [65,]    669252.55  57815663548
##  [66,]    609798.00  56602867799
##  [67,]    555625.22  55475346589
##  [68,]    506265.01  54432201009
##  [69,]    461289.82  53469231979
##  [70,]    420310.11  52582891498
##  [71,]    382970.91  51769455613
##  [72,]    348948.83  51026578427
##  [73,]    317949.18  50348212580
##  [74,]    289703.45  49730653178
##  [75,]    263967.00  49169868396
##  [76,]    240516.90  48662005374
##  [77,]    219150.04  48206426001
##  [78,]    199681.35  47793569750
##  [79,]    181942.21  47422267242
##  [80,]    165778.97  47090017826
##  [81,]    151051.62  46794911484
##  [82,]    137632.61  46529621115
##  [83,]    125405.71  46293704972
##  [84,]    114265.01  46084711673
##  [85,]    104114.02  45901863558
##  [86,]     94864.82  45739324821
##  [87,]     86437.29  45596601456
##  [88,]     78758.44  45471311459
##  [89,]     71761.76  45363603472
##  [90,]     65386.64  45270537923
##  [91,]     59577.87  45191239640
##  [92,]     54285.14  45123467498
##  [93,]     49462.60  45068472115
##  [94,]     45068.47  45016620960
##  [95,]     41064.72  44978964685
##  [96,]     37416.64  44944239209
##  [97,]     34092.65  44916366762
##  [98,]     31063.95  44893829664
##  [99,]     28304.32  44876277740
plot(cv.ridge, LW = 4) # Plot all lambdas vs. MSEs

coef(cv.ridge)
## 75 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)   2516095.93258032
## bedrooms1       -5370.75193326
## bedrooms2       18634.97777342
## bedrooms3       -4858.98005490
## bedrooms4       -4971.26990417
## bedrooms5        8818.44222001
## bedrooms6      -16438.54892017
## bedrooms7      -60487.45948929
## bedrooms8       94898.62598990
## bedrooms9      -14755.54075373
## bedrooms10     -49815.80191362
## bedrooms11    -122600.68305411
## bedrooms33     125217.81943486
## bathrooms0.5   -86033.99358139
## bathrooms0.75  -49367.16809985
## bathrooms1     -18725.77913584
## bathrooms1.25   32261.12384956
## bathrooms1.5   -13926.51571203
## bathrooms1.75  -14192.94947690
## bathrooms2     -12179.36726667
## bathrooms2.25     -89.40661514
## bathrooms2.5   -21476.44692110
## bathrooms2.75   -3366.61405324
## bathrooms3      22626.82430663
## bathrooms3.25   92078.49441772
## bathrooms3.5    60361.33952563
## bathrooms3.75  158967.66526206
## bathrooms4     150222.06230055
## bathrooms4.25  231589.67608589
## bathrooms4.5   168770.00383168
## bathrooms4.75  429509.01991359
## bathrooms5     291919.64741588
## bathrooms5.25  388781.12984220
## bathrooms5.5   563595.66869511
## bathrooms5.75  480794.51355362
## bathrooms6     806353.83456227
## bathrooms6.25  571001.62775195
## bathrooms6.5   165800.48938604
## bathrooms6.75  398132.22953792
## bathrooms7.5  -104549.13117987
## bathrooms7.75 2669076.52293248
## bathrooms8    1310690.56326269
## sqft_living        57.00702034
## sqft_lot           -0.03108303
## floors1.5       32656.68796382
## floors2         15607.56794965
## floors2.5      143815.64976705
## floors3         78985.17474939
## floors3.5      192405.14018510
## waterfront1    377249.36400571
## view1           93582.67601218
## view2           58601.93125170
## view3          101361.83723950
## view4          243792.23130311
## condition2     -39211.32973495
## condition3     -16690.35513809
## condition4       4495.17244829
## condition5      47775.41823446
## grade3         -76281.52907214
## grade4         -94432.02925161
## grade5        -114634.75615719
## grade6         -90110.62060560
## grade7         -59421.77389177
## grade8          -8865.23718039
## grade9          72480.51035199
## grade10        176217.15857050
## grade11        317803.97967054
## grade12        582228.43889469
## grade13       1178977.60212396
## sqft_above         51.11521512
## sqft_basement      66.95684611
## yr_built        -1180.55571653
## yr_renovated       36.78039953
## sqft_living15      58.84783004
## sqft_lot15         -0.28925602
best.lambda=cv.ridge$lambda.min
best.lambda
## [1] 28304.32
log(best.lambda) # Best lambda and log(lambda)
## [1] 10.25077
min.mse=min(cv.ridge$cvm) # Lowest 10FCV MSE
cbind("Best Lambda"=best.lambda,"Log(Lambda)"=log(best.lambda),"Best 10FCV MSE"=min.mse)
##      Best Lambda Log(Lambda) Best 10FCV MSE
## [1,]    28304.32    10.25077    44876277740
Yhat=predict(cv.ridge,cv.ridge$lambda.min, newx=x[-z,])
mean((Yhat-y[-z])^2) #mean squared prediction error
## [1] 38372777876

Lasso Regression

lasso <- glmnet(x[z,], y[z], alpha=1 ) # Object with many lass regressions for various values 
plot(lasso, LW = 4)

cv.lasso=cv.glmnet(x[z,],y[z],alpha=1,lambda=seq(0,1000,1))
cv.lasso$lambda.min
## [1] 1000
min(cv.lasso$cvm)
## [1] 48582602001
plot(cv.lasso)

lasso=glmnet(x[z,],y[z],alpha=1,lambda=cv.lasso$lambda.min)
coef(lasso)
## 75 x 1 sparse Matrix of class "dgCMatrix"
##                            s0
## (Intercept)   5592256.6932775
## bedrooms1       29142.4114994
## bedrooms2       44582.8757816
## bedrooms3       11204.3132496
## bedrooms4      -10301.6517977
## bedrooms5       -6735.2429381
## bedrooms6      -75033.3211138
## bedrooms7      -95435.8440753
## bedrooms8       74601.6177672
## bedrooms9           .        
## bedrooms10     -94737.9480153
## bedrooms11    -147463.7580439
## bedrooms33      68135.7069312
## bathrooms0.5   -41289.0096711
## bathrooms0.75       .        
## bathrooms1          .        
## bathrooms1.25       .        
## bathrooms1.5        .        
## bathrooms1.75     -85.6237681
## bathrooms2          .        
## bathrooms2.25   10867.0735554
## bathrooms2.5   -12095.4727618
## bathrooms2.75       .        
## bathrooms3      27063.6946147
## bathrooms3.25   67810.6562627
## bathrooms3.5    56396.3559092
## bathrooms3.75  164292.5845633
## bathrooms4     104344.9221321
## bathrooms4.25  152520.3942531
## bathrooms4.5   168609.4328294
## bathrooms4.75  402272.2539910
## bathrooms5     406815.6575097
## bathrooms5.25  462178.3324392
## bathrooms5.5   789360.3849534
## bathrooms5.75 1191576.2163702
## bathrooms6     424020.4166560
## bathrooms6.25   -8358.2867302
## bathrooms6.5        .        
## bathrooms6.75       .        
## bathrooms7.5        .        
## bathrooms7.75 3847993.5109901
## bathrooms8          .        
## sqft_living       131.7307989
## sqft_lot            .        
## floors1.5       13999.6696300
## floors2         25740.0543535
## floors2.5      150945.6716215
## floors3        142685.0693930
## floors3.5      199078.8337391
## waterfront1    603675.9221005
## view1           72282.4579116
## view2           51826.2177537
## view3           78036.4616273
## view4          205635.8380991
## condition2     -21358.0350066
## condition3          .        
## condition4       9044.0146733
## condition5      63151.8736554
## grade3              .        
## grade4        -142754.6152769
## grade5        -208809.4292845
## grade6        -165938.3223359
## grade7         -82640.9507693
## grade8              .        
## grade9         134214.4769132
## grade10        286778.7924714
## grade11        545253.1393748
## grade12       1024619.5767465
## grade13       1230720.8898316
## sqft_above          .        
## sqft_basement      31.7113766
## yr_built        -2767.4147174
## yr_renovated       23.7637265
## sqft_living15      40.8987616
## sqft_lot15         -0.5876256

Some of the cofficients are zero. When lambda moves from lambda.min to higher value, it deletes more and more variables that are not significant to the model.

Yhat=predict(lasso, cv.lasso$lambda.min,newx=x[-z,])
mean((Yhat-y[-z])^2) #mean squared prediction error
## [1] 40830431353

Principal Components Regression

reg = lm(price~. , data = KC_Data)
x=model.matrix(reg)
pc = princomp(x)
summary(pc)
## Importance of components:
##                               Comp.1        Comp.2          Comp.3
## Standard deviation     46658.0200180 16857.2906824 1298.1327633016
## Proportion of Variance     0.8837189     0.1153552    0.0006840687
## Cumulative Proportion      0.8837189     0.9990740    0.9997581004
##                                Comp.4          Comp.5         Comp.6
## Standard deviation     527.9624730920 410.44466891231 384.7583523302
## Proportion of Variance   0.0001131534   0.00006838656   0.0000600949
## Cumulative Proportion    0.9998712538   0.99993964040   0.9999997353
##                                  Comp.7             Comp.8
## Standard deviation     25.4819441865983 0.6016539457258527
## Proportion of Variance  0.0000002635886 0.0000000001469451
## Cumulative Proportion   0.9999999988850 0.9999999990318967
##                                    Comp.9            Comp.10
## Standard deviation     0.5698229400903827 0.5545971834730218
## Proportion of Variance 0.0000000001318079 0.0000000001248581
## Cumulative Proportion  0.9999999991637046 0.9999999992885628
##                                   Comp.11             Comp.12
## Standard deviation     0.4649557407025032 0.42554282477323729
## Proportion of Variance 0.0000000000877576 0.00000000007351028
## Cumulative Proportion  0.9999999993763203 0.99999999944983065
##                                    Comp.13             Comp.14
## Standard deviation     0.37056826591278252 0.35244691627342339
## Proportion of Variance 0.00000000005574398 0.00000000005042535
## Cumulative Proportion  0.99999999950557461 0.99999999955599994
##                                    Comp.15             Comp.16
## Standard deviation     0.34650007417147627 0.32745887740036950
## Proportion of Variance 0.00000000004873805 0.00000000004352864
## Cumulative Proportion  0.99999999960473795 0.99999999964826669
##                                   Comp.17            Comp.18
## Standard deviation     0.3086504953817114 0.2919804199716121
## Proportion of Variance 0.0000000000386719 0.0000000000346074
## Cumulative Proportion  0.9999999996869385 0.9999999997215460
##                                    Comp.19             Comp.20
## Standard deviation     0.27248803220252882 0.27028235100358827
## Proportion of Variance 0.00000000003014091 0.00000000002965493
## Cumulative Proportion  0.99999999975168685 0.99999999978134180
##                                    Comp.21             Comp.22
## Standard deviation     0.26124689065291201 0.24268515428602053
## Proportion of Variance 0.00000000002770536 0.00000000002390826
## Cumulative Proportion  0.99999999980904719 0.99999999983295540
##                                    Comp.23             Comp.24
## Standard deviation     0.22171521990253179 0.20801372997483372
## Proportion of Variance 0.00000000001995504 0.00000000001756489
## Cumulative Proportion  0.99999999985291044 0.99999999987047528
##                                    Comp.25             Comp.26
## Standard deviation     0.20211465886349297 0.18693097128036817
## Proportion of Variance 0.00000000001658277 0.00000000001418483
## Cumulative Proportion  0.99999999988705812 0.99999999990124289
##                                    Comp.27             Comp.28
## Standard deviation     0.17290917388362265 0.16066106848673875
## Proportion of Variance 0.00000000001213662 0.00000000001047811
## Cumulative Proportion  0.99999999991337951 0.99999999992385769
##                                     Comp.29              Comp.30
## Standard deviation     0.147023293347794987 0.140148582208489580
## Proportion of Variance 0.000000000008774731 0.000000000007973315
## Cumulative Proportion  0.999999999932632333 0.999999999940605733
##                                     Comp.31              Comp.32
## Standard deviation     0.135889778406887790 0.128991346351518071
## Proportion of Variance 0.000000000007496095 0.000000000006754336
## Cumulative Proportion  0.999999999948101737 0.999999999954856111
##                                     Comp.33              Comp.34
## Standard deviation     0.128006737313422048 0.119199617834721161
## Proportion of Variance 0.000000000006651616 0.000000000005767814
## Cumulative Proportion  0.999999999961507791 0.999999999967275510
##                                     Comp.35              Comp.36
## Standard deviation     0.104754398846136965 0.099828500940133430
## Proportion of Variance 0.000000000004454573 0.000000000004045485
## Cumulative Proportion  0.999999999971730169 0.999999999975775600
##                                     Comp.37             Comp.38
## Standard deviation     0.095059916733053390 0.08581102150772275
## Proportion of Variance 0.000000000003668228 0.00000000000298915
## Cumulative Proportion  0.999999999979443888 0.99999999998243294
##                                     Comp.39              Comp.40
## Standard deviation     0.082233772051430454 0.079767919580845184
## Proportion of Variance 0.000000000002745124 0.000000000002582962
## Cumulative Proportion  0.999999999985178079 0.999999999987761123
##                                     Comp.41              Comp.42
## Standard deviation     0.068304580968538420 0.063640596806720004
## Proportion of Variance 0.000000000001893918 0.000000000001644107
## Cumulative Proportion  0.999999999989654942 0.999999999991299071
##                                     Comp.43              Comp.44
## Standard deviation     0.061998692113397445 0.060650152534340362
## Proportion of Variance 0.000000000001560366 0.000000000001493225
## Cumulative Proportion  0.999999999992859490 0.999999999994352740
##                                     Comp.45               Comp.46
## Standard deviation     0.055346835930788860 0.0449130280785412991
## Proportion of Variance 0.000000000001243504 0.0000000000008188535
## Cumulative Proportion  0.999999999995596189 0.9999999999964150899
##                                      Comp.47               Comp.48
## Standard deviation     0.0355800705233606102 0.0332535619482823772
## Proportion of Variance 0.0000000000005138959 0.0000000000004488879
## Cumulative Proportion  0.9999999999969289011 0.9999999999973778753
##                                      Comp.49               Comp.50
## Standard deviation     0.0316380977814122513 0.0266110230503557900
## Proportion of Variance 0.0000000000004063331 0.0000000000002874648
## Cumulative Proportion  0.9999999999977842169 0.9999999999980716536
##                                      Comp.51               Comp.52
## Standard deviation     0.0257761301732649939 0.0250733893359092568
## Proportion of Variance 0.0000000000002697099 0.0000000000002552041
## Cumulative Proportion  0.9999999999983413268 0.9999999999985965671
##                                      Comp.53               Comp.54
## Standard deviation     0.0211825838371817246 0.0203893699829955193
## Proportion of Variance 0.0000000000001821459 0.0000000000001687598
## Cumulative Proportion  0.9999999999987786437 0.9999999999989473976
##                                      Comp.55               Comp.56
## Standard deviation     0.0189685271037664770 0.0184007922202056827
## Proportion of Variance 0.0000000000001460591 0.0000000000001374468
## Cumulative Proportion  0.9999999999990935029 0.9999999999992309485
##                                      Comp.57              Comp.58
## Standard deviation     0.0180061340420249634 0.016434618139512809
## Proportion of Variance 0.0000000000001316141 0.000000000000109643
## Cumulative Proportion  0.9999999999993626210 0.999999999999472200
##                                       Comp.59                Comp.60
## Standard deviation     0.01407740851977356059 0.01333926086919276326
## Proportion of Variance 0.00000000000008044646 0.00000000000007223124
## Cumulative Proportion  0.99999999999955269114 0.99999999999962485564
##                                       Comp.61                Comp.62
## Standard deviation     0.01249874420348607088 0.01231012946445568003
## Proportion of Variance 0.00000000000006341533 0.00000000000006151581
## Cumulative Proportion  0.99999999999968824937 0.99999999999974986675
##                                       Comp.63                Comp.64
## Standard deviation     0.01004509430118860625 0.00958563444475138068
## Proportion of Variance 0.00000000000004096091 0.00000000000003729952
## Cumulative Proportion  0.99999999999979083398 0.99999999999982802645
##                                     Comp.65              Comp.66
## Standard deviation     0.009484161994080772 0.009275901463541736
## Proportion of Variance 0.000000000000036514 0.000000000000034928
## Cumulative Proportion  0.999999999999864553 0.999999999999899525
##                                       Comp.67                Comp.68
## Standard deviation     0.00858172819813234067 0.00679645489967816655
## Proportion of Variance 0.00000000000002989586 0.00000000000001875108
## Cumulative Proportion  0.99999999999992938982 0.99999999999994815258
##                                       Comp.69                Comp.70
## Standard deviation     0.00667004244791527534 0.00616756610960241414
## Proportion of Variance 0.00000000000001806004 0.00000000000001544149
## Cumulative Proportion  0.99999999999996624922 0.99999999999998168132
##                                       Comp.71                 Comp.72
## Standard deviation     0.00593240950454403202 0.002556749582798334853
## Proportion of Variance 0.00000000000001428643 0.000000000000002653615
## Cumulative Proportion  0.99999999999999600320 0.999999999999998556710
##                                       Comp.73                    Comp.74
## Standard deviation     0.00186514536970121074 0.000153556135151463349310
## Proportion of Variance 0.00000000000000141217 0.000000000000000009571849
## Cumulative Proportion  1.00000000000000000000 1.000000000000000000000000
##                                                                  Comp.75
## Standard deviation     0.00000000000000022204460492502100358875455343721
## Proportion of Variance 0.00000000000000000000000000000000000000002001437
## Cumulative Proportion  1.00000000000000000000000000000000000000000000000
screeplot(pc)

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:stats':
## 
##     loadings
pcreg=pcr(price~., data = KC_Data, scale=TRUE)
summary(pcreg)
## Data:    X dimension: 21613 74 
##  Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 74
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        7.549    11.25    14.05    16.45    18.78    20.94    22.94
## price   37.276    55.05    55.59    55.66    60.53    60.60    60.60
##        8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X        24.91    26.85     28.76     30.63     32.40     34.08     35.72
## price    60.99    61.14     61.14     61.27     61.51     61.69     61.89
##        15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## X         37.34     38.93     40.51     42.07     43.58     45.08
## price     61.92     61.96     61.98     62.33     62.81     63.10
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps
## X         46.57     48.05     49.52     50.96     52.39     53.82
## price     63.52     63.64     63.67     63.69     63.80     63.84
##        27 comps  28 comps  29 comps  30 comps  31 comps  32 comps
## X         55.22     56.62      58.0     59.39     60.77     62.14
## price     63.97     64.04      64.1     64.11     64.12     64.12
##        33 comps  34 comps  35 comps  36 comps  37 comps  38 comps
## X         63.50     64.86     66.22     67.57     68.92     70.26
## price     64.12     64.12     64.12     64.19     64.19     64.24
##        39 comps  40 comps  41 comps  42 comps  43 comps  44 comps
## X         71.60     72.94     74.27     75.60     76.91     78.22
## price     64.25     64.25     64.26     64.27     64.27     64.28
##        45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X         79.53     80.80     82.05     83.28     84.49     85.68
## price     64.31     64.31     64.34     64.51     64.73     64.76
##        51 comps  52 comps  53 comps  54 comps  55 comps  56 comps
## X         86.84     87.99     89.12     90.22     91.31     92.32
## price     65.22     65.69     65.71     65.72     65.75     65.75
##        57 comps  58 comps  59 comps  60 comps  61 comps  62 comps
## X         93.29     94.21     95.04     95.82     96.57     97.23
## price     65.79     65.92     66.11     66.52     67.46     67.47
##        63 comps  64 comps  65 comps  66 comps  67 comps  68 comps
## X         97.84     98.35     98.76     99.13     99.46     99.73
## price     67.51     67.67     69.09     69.10     69.61     69.70
##        69 comps  70 comps  71 comps  72 comps  73 comps  74 comps
## X         99.99    100.00    100.00    100.00    100.00    100.00
## price     69.87     69.87     69.87     69.87     69.87     69.87
pcr.fit=pcr(price~., data = KC_Data, validation="CV") 
# 10-Fold CV  # Use validation="LOO" for LOOCV
summary(pcr.fit)
## Data:    X dimension: 21613 74 
##  Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 74
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          367371   365792   365808   264296   258200   257223   256718
## adjCV       367371   365790   365805   264293   258193   257214   256706
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      248872   248760   248453    248044    245099    245087    245088
## adjCV   248859   248746   248440    248029    245083    245070    245072
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV       245056    244850    244575    244464    244442    243345
## adjCV    245040    244832    244555    244444    244421    243282
##        20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## CV       242855    241967    241258    241176    238450    238181
## adjCV    242831    241944    241233    241152    238408    238152
##        26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## CV       238138    237808    235342    235354    229745    229022
## adjCV    238110    237782    235308    235328    229679    228984
##        32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## CV       227380    224744    222795    221196    220867    220401
## adjCV    227513    224685    222755    221131    220820    220356
##        38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## CV       220193    219973    219764    219435    216609    215980
## adjCV    220160    219928    219713    219408    216357    215825
##        44 comps  45 comps  46 comps  47 comps  48 comps  49 comps
## CV       213113    212486    212515    212492    211504    211340
## adjCV    213016    212414    212441    212515    211347    211271
##        50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## CV       210079    206451    206433    206372    206543    206518
## adjCV    211166    205616    206255    206209    206415    206399
##        56 comps  57 comps  58 comps  59 comps  60 comps  61 comps
## CV       206679    207151    207174    207003    207014    207178
## adjCV    206546    207086    206968    206796    206881    207079
##        62 comps  63 comps  64 comps  65 comps  66 comps  67 comps
## CV       207301    207277    207252    207299    207817    208450
## adjCV    207070    207046    207050    207097    207533    208150
##        68 comps  69 comps  70 comps  71 comps  72 comps      73 comps
## CV       211462    211444    211939    213362  11041426  718949859129
## adjCV    211122    210891    211309    212606  10474829  682043450957
##             74 comps
## CV     2653260791041
## adjCV  2517118487482
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      88.3719  99.9074    99.98    99.99    99.99   100.00   100.00
## price   0.8676   0.8734    48.27    50.65    51.03    51.25    54.19
##        8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       100.00   100.00    100.00    100.00     100.0    100.00    100.00
## price    54.24    54.36     54.51     55.59      55.6     55.61     55.62
##        15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## X         100.0    100.00    100.00    100.00    100.00    100.00
## price      55.7     55.81     55.86     55.87     56.29     56.45
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     56.78     57.04     57.07     58.04     58.15     58.17
##        27 comps  28 comps  29 comps  30 comps  31 comps  32 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     58.29     59.18     59.19     61.16     61.39     61.97
##        33 comps  34 comps  35 comps  36 comps  37 comps  38 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     62.83     63.46     64.02     64.13     64.28     64.39
##        39 comps  40 comps  41 comps  42 comps  43 comps  44 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     64.51     64.56     64.67     65.73     65.99     66.76
##        45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     66.98     66.98     67.04     67.43     67.45     67.57
##        51 comps  52 comps  53 comps  54 comps  55 comps  56 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     69.06     69.12     69.21     69.22     69.22     69.23
##        57 comps  58 comps  59 comps  60 comps  61 comps  62 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     69.24     69.35     69.37     69.38     69.38     69.43
##        63 comps  64 comps  65 comps  66 comps  67 comps  68 comps
## X        100.00    100.00    100.00    100.00     100.0     100.0
## price     69.43     69.43     69.46     69.49      69.5      69.5
##        69 comps  70 comps  71 comps  72 comps  73 comps  74 comps
## X        100.00    100.00    100.00    100.00    100.00    100.00
## price     69.73     69.79     69.85     69.87     69.87     69.88

Reports Square Root of MSE and % Variance Explained

to plot the Square Root of MSE# Use val.type=“MSEP” instead for the MSE

PCA

The approach of the main components is another statistical technique to find the variables of the model that explain the maximum variance of the dependent variable.

fit1 = pcr(price ~., data = KC_Data, validation="CV", ncomp=1)
fit2 = pcr(price ~., data = KC_Data, validation="CV", ncomp=2)
fit3 = pcr(price ~., data = KC_Data, validation="CV", ncomp=3)
summary(fit1)
## Data:    X dimension: 21613 74 
##  Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 1
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps
## CV          367371   365799
## adjCV       367371   365797
## 
## TRAINING: % variance explained
##        1 comps
## X      88.3719
## price   0.8676
summary(fit2)
## Data:    X dimension: 21613 74 
##  Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 2
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps
## CV          367371   365789   365804
## adjCV       367371   365787   365801
## 
## TRAINING: % variance explained
##        1 comps  2 comps
## X      88.3719  99.9074
## price   0.8676   0.8734
summary(fit3)
## Data:    X dimension: 21613 74 
##  Y dimension: 21613 1
## Fit method: svdpc
## Number of components considered: 3
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps
## CV          367371   365817   365880   264358
## adjCV       367371   365813   365873   264351
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps
## X      88.3719  99.9074    99.98
## price   0.8676   0.8734    48.27

The optimal value M of the number of explanatory variables is 27. However, we note that from 24 variables, the RMSE, as well as variance of the model are close to the optimal value. As a result, M = 24 will be used to measure our MSE on the test set.

pcr.fit1 = pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 1)
pcr.pred1 = predict(pcr.fit1,  data=KC_Data[-z,],ncomp= 1)
mean((pcr.pred1[2] - price[-z]) ^2)
## [1] 122998024054
pcr.fit2 = pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 2)
pcr.pred2 = predict(pcr.fit2,  data=KC_Data[-z,],ncomp= 2)
mean((pcr.pred2[2] - price[-z]) ^2)
## [1] 122994791969
pcr.fit3= pcr(price ~ ., data=KC_Data[z,], validation="CV", ncomp= 3)
pcr.pred3 = predict(pcr.fit3,  data=KC_Data[-z,],ncomp= 3)
mean((pcr.pred3[2] - price[-z]) ^2)
## [1] 151170678033

Regression Tree

The last technique used will be regression trees. Let’s start with a single regression tree.

library(tree)
tree.fit = tree(price~., data = KC_Data[z,])
summary(tree.fit)
## 
## Regression tree:
## tree(formula = price ~ ., data = KC_Data[z, ])
## Variables actually used in tree construction:
## [1] "grade"         "sqft_living"   "yr_built"      "sqft_living15"
## [5] "view"          "waterfront"   
## Number of terminal nodes:  12 
## Residual mean deviance:  52360000000 = 565200000000000 / 10790 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2473000  -129200   -27920        0    94260  2376000
plot(tree.fit)
text(tree.fit, pretty=0)

test = KC_Data[-z,]
tree.pred = predict(tree.fit, test)
mean((test$price - tree.pred)^2)
## [1] 55551085715

Using cross validation, it is possible to determine an optimal tree size. R provides a function for this cv.tree. The RMSE is much better than for a single tree, which seems logical. This method gets the best score on the RMSE test.

Stepwise Variable Selection

null = lm( price ~1-id -lat -long -date-zipcode, data=KC_Data )
full = lm( price ~.-id -lat -long -date-zipcode, data=KC_Data )
step( null, scope=list(lower=null, upper=full), direction="both" )
## Start:  AIC=553903.4
## price ~ 1 - id - lat - long - date - zipcode
## 
##                 Df        Sum of Sq              RSS    AIC
## + grade         11 1515998291373849 1400649369462915 538072
## + sqft_living    1 1437514676706840 1479132984129924 539231
## + sqft_above     1 1069562647275666 1847085013561098 544032
## + bathrooms     29 1042325216709221 1874322444127544 544404
## + sqft_living15  1  999426441953220 1917221218883545 544838
## + view           4  490794829832554 2425852831004210 549929
## + bedrooms      12  310713435849958 2605934224986807 551493
## + sqft_basement  1  305870690879498 2610776969957267 551511
## + floors         5  246868011501070 2669779649335694 552002
## + waterfront     1  206883479315148 2709764181521616 552315
## + yr_renovated   1   46630304823592 2870017356013172 553557
## + sqft_lot       1   23444178016610 2893203482820154 553731
## + sqft_lot15     1   19830045706864 2896817615129900 553758
## + condition      4   20054859680333 2896592801156432 553762
## + yr_built       1    8499220511338 2908148440325426 553842
## <none>                              2916647660836764 553903
## 
## Step:  AIC=538072.3
## price ~ grade
## 
##                 Df        Sum of Sq              RSS    AIC
## + sqft_living    1  180694699306282 1219954670156634 535089
## + yr_built       1  153374152309404 1247275217153512 535568
## + view           4  150028504263395 1250620865199520 535632
## + bathrooms     29  133859404646558 1266789964816357 535959
## + sqft_basement  1  126070360309279 1274579009153636 536036
## + waterfront     1  109331367950298 1291318001512617 536318
## + condition      4   53518746069880 1347130623393036 537238
## + sqft_living15  1   48264268501300 1352385100961616 537316
## + bedrooms      12   41194341905811 1359455027557104 537451
## + yr_renovated   1   39028675971590 1361620693491325 537464
## + floors         5   38704216806119 1361945152656796 537477
## + sqft_above     1   31004262115021 1369645107347894 537591
## + sqft_lot15     1     756750136712 1399892619326204 538063
## <none>                              1400649369462915 538072
## + sqft_lot       1      76662760726 1400572706702189 538073
## - grade         11 1515998291373849 2916647660836764 553903
## 
## Step:  AIC=535089.1
## price ~ grade + sqft_living
## 
##                 Df       Sum of Sq              RSS    AIC
## + yr_built       1 151708080689686 1068246589466947 532221
## + view           4 114412915649021 1105541754507613 532969
## + waterfront     1  95833172270098 1124121497886536 533323
## + bathrooms     29  71670494822023 1148284175334610 533839
## + floors         5  41849081220236 1178105588936398 534345
## + condition      4  38625984614258 1181328685542376 534402
## + yr_renovated   1  28355840004071 1191598830152563 534583
## + sqft_above     1  28274019164135 1191680650992499 534584
## + sqft_basement  1  28274019164135 1191680650992498 534584
## + bedrooms      12  19714951308530 1200239718848104 534761
## + sqft_lot15     1   7052732171683 1212901937984951 534966
## + sqft_lot       1   3694294555220 1216260375601413 535026
## + sqft_living15  1    775511043807 1219179159112827 535077
## <none>                             1219954670156634 535089
## - sqft_living    1 180694699306281 1400649369462915 538072
## - grade         11 259178313973290 1479132984129924 539231
## 
## Step:  AIC=532221
## price ~ grade + sqft_living + yr_built
## 
##                 Df       Sum of Sq              RSS    AIC
## + waterfront     1  82623310624817  985623278842130 530483
## + view           4  80129938355435  988116651111512 530544
## + bathrooms     29  51216977927506 1017029611539442 531217
## + floors         5  16647631590724 1051598957876224 531892
## + bedrooms      12  14568301904934 1053678287562014 531948
## + condition      4   5278532170358 1062968057296589 532122
## + sqft_lot15     1   4870046900518 1063376542566429 532124
## + yr_renovated   1   4582325623914 1063664263843033 532130
## + sqft_above     1   3960207429901 1064286382037046 532143
## + sqft_basement  1   3960207429901 1064286382037046 532143
## + sqft_lot       1   2866238469399 1065380350997548 532165
## + sqft_living15  1   1919196714053 1066327392752894 532184
## <none>                             1068246589466947 532221
## - yr_built       1 151708080689686 1219954670156634 535089
## - sqft_living    1 179028627686564 1247275217153512 535568
## - grade         11 317877395776347 1386123985243294 537829
## 
## Step:  AIC=530483.2
## price ~ grade + sqft_living + yr_built + waterfront
## 
##                 Df       Sum of Sq              RSS    AIC
## + bathrooms     29  48547045149738  937076233692392 529449
## + view           4  24633883964556  960989394877575 529944
## + floors         5  15148510618529  970474768223602 530158
## + bedrooms      12  10207511688104  975415767154027 530282
## + sqft_lot15     1   5292296694224  980330982147906 530369
## + condition      4   5141419397231  980481859444899 530378
## + sqft_lot       1   2874249384382  982749029457748 530422
## + sqft_above     1   2819403938373  982803874903757 530423
## + sqft_basement  1   2819403938373  982803874903757 530423
## + yr_renovated   1   2065732697952  983557546144178 530440
## + sqft_living15  1   1666004338394  983957274503736 530449
## <none>                              985623278842130 530483
## - waterfront     1  82623310624817 1068246589466947 532221
## - yr_built       1 138498219044405 1124121497886536 533323
## - sqft_living    1 166559377614045 1152182656456176 533856
## - grade         11 302128454010880 1287751732853010 536240
## 
## Step:  AIC=529449.5
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms
## 
##                 Df       Sum of Sq              RSS    AIC
## + view           4  21681460837136  915394772855256 528952
## + floors         5  10106227005061  926970006687331 529225
## + bedrooms      12   9386236979090  927689996713302 529256
## + sqft_living15  1   5113143746029  931963089946363 529333
## + condition      4   5089508845984  931986724846408 529340
## + sqft_lot15     1   4657386381207  932418847311185 529344
## + sqft_lot       1   2578409722102  934497823970290 529392
## + sqft_above     1   2024780875953  935051452816439 529405
## + sqft_basement  1   2024780875953  935051452816439 529405
## + yr_renovated   1   1182796665413  935893437026979 529424
## <none>                              937076233692392 529449
## - bathrooms     29  48547045149738  985623278842130 530483
## - sqft_living    1  66542023207909 1003618256900301 530930
## - waterfront     1  79953377847050 1017029611539442 531217
## - yr_built       1 120261432140348 1057337665832740 532057
## - grade         11 256039295387065 1193115529079457 534648
## 
## Step:  AIC=528951.6
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view
## 
##                 Df       Sum of Sq              RSS    AIC
## + floors         5   9818828786067  905575944069190 528728
## + bedrooms      12   7826233527895  907568539327361 528790
## + sqft_lot15     1   4790687193391  910604085661865 528840
## + condition      4   4811616528834  910583156326422 528846
## + sqft_living15  1   3228266511306  912166506343950 528877
## + sqft_lot       1   2714879273255  912679893582002 528889
## + yr_renovated   1    941416476222  914453356379034 528931
## + sqft_above     1    405710042621  914989062812636 528944
## + sqft_basement  1    405710042621  914989062812636 528944
## <none>                              915394772855256 528952
## - view           4  21681460837136  937076233692392 529449
## - waterfront     1  27598647185030  942993420040287 529592
## - bathrooms     29  45594622022318  960989394877575 529944
## - sqft_living    1  59996076321357  975390849176614 530322
## - yr_built       1 103996682099375 1019391454954632 531275
## - grade         11 237400307786424 1152795080641680 533913
## 
## Step:  AIC=528728.5
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors
## 
##                 Df       Sum of Sq              RSS    AIC
## + bedrooms      12   7105445557653  898470498511537 528582
## + sqft_living15  1   5387516463004  900188427606186 528602
## + condition      4   5193620490213  900382323578976 528612
## + sqft_lot15     1   4084561901652  901491382167538 528633
## + sqft_lot       1   2273897171343  903302046897847 528676
## + sqft_above     1   1483867502440  904092076566750 528695
## + sqft_basement  1   1483867502440  904092076566750 528695
## + yr_renovated   1    716443965073  904859500104117 528713
## <none>                              905575944069190 528728
## - floors         5   9818828786067  915394772855256 528952
## - view           4  21394062618141  926970006687331 529225
## - waterfront     1  27304408756537  932880352825727 529368
## - bathrooms     29  40784956144341  946360900213531 529623
## - sqft_living    1  65325960345477  970901904414667 530232
## - yr_built       1  92891335146363  998467279215552 530837
## - grade         11 219049590302479 1124625534371669 533389
## 
## Step:  AIC=528582.2
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms
## 
##                 Df       Sum of Sq              RSS    AIC
## + sqft_living15  1   5415042304870  893055456206667 528454
## + condition      4   5344486757922  893126011753615 528461
## + sqft_lot15     1   4754063487350  893716435024187 528470
## + sqft_lot       1   2752248141423  895718250370114 528518
## + sqft_above     1   1194595223121  897275903288415 528555
## + sqft_basement  1   1194595223121  897275903288415 528555
## + yr_renovated   1    614866368578  897855632142958 528569
## <none>                              898470498511537 528582
## - bedrooms      12   7105445557653  905575944069190 528728
## - floors         5   9098040815824  907568539327361 528790
## - view           4  20038851432438  918509349943975 529051
## - waterfront     1  26025087308756  924495585820293 529197
## - bathrooms     29  40517305162424  938987803673961 529478
## - sqft_living    1  67528336180621  965998834692158 530146
## - yr_built       1  93079351195127  991549849706663 530711
## - grade         11 196704257145249 1095174755656786 532839
## 
## Step:  AIC=528453.6
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms + sqft_living15
## 
##                 Df       Sum of Sq              RSS    AIC
## + sqft_lot15     1   5463948974416  887591507232252 528323
## + condition      4   5673650983132  887381805223536 528324
## + sqft_lot       1   2884571790546  890170884416121 528386
## + sqft_above     1   2308867245084  890746588961583 528400
## + sqft_basement  1   2308867245084  890746588961583 528400
## + yr_renovated   1    768008885086  892287447321581 528437
## <none>                              893055456206667 528454
## - sqft_living15  1   5415042304870  898470498511537 528582
## - bedrooms      12   7132971399519  900188427606186 528602
## - floors         5  11175926150686  904231382357353 528712
## - view           4  17731994295014  910787450501681 528870
## - waterfront     1  26427496494078  919482952700745 529082
## - bathrooms     29  43292496060430  936347952267097 529419
## - sqft_living    1  45487147406557  938542603613224 529525
## - yr_built       1  95373569503914  988429025710581 530645
## - grade         11 162550866405222 1055606322611889 532046
## 
## Step:  AIC=528322.9
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms + sqft_living15 + sqft_lot15
## 
##                 Df       Sum of Sq              RSS    AIC
## + condition      4   5693535833180  881897971399072 528192
## + sqft_above     1   1674115123177  885917392109075 528284
## + sqft_basement  1   1674115123177  885917392109075 528284
## + yr_renovated   1    872833192991  886718674039261 528304
## <none>                              887591507232252 528323
## + sqft_lot       1      6562631015  887584944601236 528325
## - sqft_lot15     1   5463948974415  893055456206667 528454
## - sqft_living15  1   6124927791935  893716435024187 528470
## - bedrooms      12   7863846923150  895455354155401 528490
## - floors         5  10503842033069  898095349265321 528567
## - view           4  17545576953667  905137084185918 528738
## - waterfront     1  26711327481605  914302834713856 528962
## - bathrooms     29  43308996904499  930900504136751 529295
## - sqft_living    1  48766682014856  936358189247108 529477
## - yr_built       1  90655114375840  978246621608092 530423
## - grade         11 159512651251593 1047104158483844 531873
## 
## Step:  AIC=528191.8
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition
## 
##                 Df       Sum of Sq              RSS    AIC
## + yr_renovated   1   1759724924523  880138246474549 528151
## + sqft_above     1   1425426988538  880472544410534 528159
## + sqft_basement  1   1425426988538  880472544410534 528159
## <none>                              881897971399072 528192
## + sqft_lot       1      2658947534  881895312451538 528194
## - condition      4   5693535833180  887591507232252 528323
## - sqft_lot15     1   5483833824464  887381805223536 528324
## - sqft_living15  1   6470784886722  888368756285794 528348
## - bedrooms      12   8053239481990  889951210881062 528364
## - floors         5  10932027111154  892829998510226 528448
## - view           4  17273526720127  899171498119199 528603
## - waterfront     1  26884817025307  908782788424379 528839
## - bathrooms     29  43043868416491  924941839815563 529164
## - sqft_living    1  47602802198674  929500773597746 529326
## - yr_built       1  73552975079277  955450946478349 529921
## - grade         11 160331977462416 1042229948861488 531780
## 
## Step:  AIC=528150.7
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition + 
##     yr_renovated
## 
##                 Df       Sum of Sq              RSS    AIC
## + sqft_above     1   1491801287431  878646445187118 528116
## + sqft_basement  1   1491801287431  878646445187118 528116
## <none>                              880138246474549 528151
## + sqft_lot       1      2341940559  880135904533990 528153
## - yr_renovated   1   1759724924523  881897971399072 528192
## - sqft_lot15     1   5661119763880  885799366238429 528287
## - condition      4   6580427564712  886718674039261 528304
## - sqft_living15  1   6756876226458  886895122701008 528314
## - bedrooms      12   7917812194678  888056058669228 528320
## - floors         5  10815947025497  890954193500046 528405
## - view           4  16882158445898  897020404920448 528553
## - waterfront     1  26151874192590  906290120667140 528782
## - bathrooms     29  42352385379146  922490631853695 529108
## - sqft_living    1  47098953443858  927237199918407 529275
## - yr_built       1  58834101961662  938972348436211 529547
## - grade         11 160400928926805 1040539175401354 531747
## 
## Step:  AIC=528116
## price ~ grade + sqft_living + yr_built + waterfront + bathrooms + 
##     view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition + 
##     yr_renovated + sqft_above
## 
##                 Df       Sum of Sq              RSS    AIC
## <none>                              878646445187118 528116
## + sqft_lot       1       242760400  878646202426718 528118
## - sqft_above     1   1491801287431  880138246474549 528151
## - yr_renovated   1   1826099223416  880472544410534 528159
## - sqft_lot15     1   5051397707436  883697842894554 528238
## - condition      4   6334737214381  884981182401499 528263
## - bedrooms      12   7498755819374  886145201006492 528276
## - sqft_living15  1   7654660624978  886301105812096 528301
## - floors         5  12077866478589  890724311665707 528401
## - view           4  14855500545413  893501945732531 528470
## - waterfront     1  26483767407463  905130212594581 528756
## - sqft_living    1  38051576797683  916698021984801 529030
## - bathrooms     29  42127789694189  920774234881307 529070
## - yr_built       1  55377219146088  934023664333206 529435
## - grade         11 161434776383310 1040081221570428 531739
## 
## Call:
## lm(formula = price ~ grade + sqft_living + yr_built + waterfront + 
##     bathrooms + view + floors + bedrooms + sqft_living15 + sqft_lot15 + 
##     condition + yr_renovated + sqft_above, data = KC_Data)
## 
## Coefficients:
##   (Intercept)         grade3         grade4         grade5         grade6  
##  5568144.3097   -107186.1901   -182465.1101   -214647.4367   -163015.2988  
##        grade7         grade8         grade9        grade10        grade11  
##   -70186.6592     20423.5127    160733.3946    320884.6820    539592.3263  
##       grade12        grade13    sqft_living       yr_built    waterfront1  
##   919927.1871   1713514.6559       137.2528     -2813.0057    507112.8322  
##  bathrooms0.5  bathrooms0.75     bathrooms1  bathrooms1.25   bathrooms1.5  
##   -28719.8866     28011.3156     60561.0428     87591.2845     64491.6913  
## bathrooms1.75     bathrooms2  bathrooms2.25   bathrooms2.5  bathrooms2.75  
##    67274.1105     72280.2867     86399.9342     64375.3564     78939.6466  
##    bathrooms3  bathrooms3.25   bathrooms3.5  bathrooms3.75     bathrooms4  
##   111367.9475    175320.7335    140169.9891    264015.2334    235633.4912  
## bathrooms4.25   bathrooms4.5  bathrooms4.75     bathrooms5  bathrooms5.25  
##   335755.4625    271945.5269    586260.9317    436618.0757    564653.3075  
##  bathrooms5.5  bathrooms5.75     bathrooms6  bathrooms6.25   bathrooms6.5  
##   737604.9832    574355.3117   1092566.1753    655461.6713    156319.0105  
## bathrooms6.75   bathrooms7.5  bathrooms7.75     bathrooms8          view1  
##   411931.1983    109867.0600   3702175.1404   1682607.2764    106758.5586  
##         view2          view3          view4      floors1.5        floors2  
##    44288.8029     87795.0452    216820.0869     22824.9257     30139.7192  
##     floors2.5        floors3      floors3.5      bedrooms1      bedrooms2  
##   147956.3149    139949.6685    284704.1607     35787.6779     51172.3759  
##     bedrooms3      bedrooms4      bedrooms5      bedrooms6      bedrooms7  
##    12984.5021    -13833.0589    -12312.7246    -69766.3604   -159800.1439  
##     bedrooms8      bedrooms9     bedrooms10     bedrooms11     bedrooms33  
##    56420.7378   -136530.6435   -132297.6888   -236177.1113    178757.7085  
## sqft_living15     sqft_lot15     condition2     condition3     condition4  
##       47.7290        -0.5858     27742.4250     51660.3490     71141.0769  
##    condition5   yr_renovated     sqft_above  
##   118374.8914        24.8020       -27.4211

here is the final model given by stepwise selection-other models were in between were not included to save space

best model According to the stepwise function:

Best_1=lm(formula = price ~ grade + sqft_living + yr_built + waterfront +  bathrooms + view + floors + bedrooms + sqft_living15 + sqft_lot15 + condition + yr_renovated + sqft_above, data = KC_Data[z,])

cv.error=fitted.values(Best_1, KC_Data)
mean((cv.error[-z]-price[-z])^2)
## [1] 229193908561

Conclusions

Stepwise method has a MSPE of 245,760,760,973. Tree Regression has a MSPE of 61,612,773,589. If we use only one principle component, the MSPE will be 134,247,905,869. MSPE for two and three principle components will be 134,206,128,747 and 143,132,709,054, respectively. Lasso has a cv of MSPE 44,566,731,235. Ridge has a cv of MSPE 41,645,084,735. The order of MSPE from least to greatest is as follows:

Ridge < Lasso < Tree < PCR(3) < PCR(2) < PCR(1) < Stepwise 

Based solely on MSPE, the ridge regression appears to give us the best MSE using cross-validation using the validation set approach.

As we illustrated in the beginning of our project, we have a very large data set containing a vast amount of variables and total number of data values. More specifically, we have 14 variables, and 21613 rows of data, so the MSE from cross-validation is considerably larger than what we have seen in some of example and models we used in class.

In our data set, ridge regression has the smallest MSPE. Because the Regression model uses variable shrinkage and eliminates some of the useless variables, we will choose Ridge Regression as our final model; it also has the lowest MSPE giving us more reason to select this model. Other than ridge regression, the lasso and tree models are also considerable.

We won’t choose stepwise or pcr because all of those models have a much larger MSPE than lasso, tree and ridge models. We did not use logistic regression for this analysis because our dependent variable in this case is a numeric variable.