WQD 7004 GROUP PROJECT

Multiple Linear Regression for the Prediction of House Prices

Introduction

Overview

Introduction

House price floating every now and then. It became difficult to estimate the house price, especially during the pandemic.

Regression models are powerful models to describe relationships between variables. We can observe the changes of dependent variables when independent variables change. Multiple Linear Regression (MLR) estimates the relationship between two or more independent variables and one dependent variable. Analysis of Variance (ANOVA) is a subset of MLR. ANOVA estimates how a quantitative dependent variable varies base on the levels of categorical independent variables.

This project focuses on exploring the house sales dataset, using regression models to examine the relationship between variables of the dataset and estimate the future house price.

Problem Statement

House price varies based on the condition of itself and the environment. From the number of bedrooms to the location of the house, any variable might be the key that affects the house price the most. This project will use ANOVA and MLR to determine the relation of house situations with sold price and predict the house price.

Idendified Questions

Is there a significant difference in the house sold price based on the house variables?
Can we predict the house price with the given attributes of each record of house sold?

Objectives

To have a basic understanding of the dataset.
To investigate if the condition rate, renovation and city located affect house sold price.
To predict the future house price with at least 70% of accuracy.

Dataset Information

The table below contains basic information of the dataset:

Title	KC_Housesales_Data
Year	2018
Source	https://www.kaggle.com/swathiachath/kc-housesales-data?select=kc_house_data.csv
Purpose	Predict House sale prices using Multi-Linear Regression
Dimension	21 columns x 21613 rows
Description	The dataset was obtained from Kaggle. The dataset contains house sold prices and house details in King County, an area in the Washington state of the US, from May 2014 to May 2015.

The table below describes the 21 variables of the raw dataset:

No.	Variable	Description	Datatype
1	id	-	Numeric
2	date	Date of house was sold	String
3	price	Price sold	Numeric
4	bedrooms	Number of bedroom	Numeric
5	bathrooms	Number of bathroom per bedroom	Numeric
6	sqft_living	House square footage	Numeric
7	sqft_lot	Lot square footage	Numeric
8	floors	Floor level	Numeric
9	waterfront	House has a view to a waterfront	Numeric
10	view	House has been viewed	Numeric
11	condition	How good the house condition is: 1=poor; 5=excellent	Numeric
12	grade	Grade given by King County: 1=poor; 5=excellent	Numeric
13	sqft_above	Square footage of house apart from basement	Numeric
14	sqrt_basement	Basement square footage	Numeric
15	yr_built	House built year	Numeric
16	yr_renovated	House renovated year	Numeric
17	zipcode	House zip code	Numeric
18	lat	House latitude coordinate	Numeric
19	long	House longitude coordinate	Numeric
20	sqft_living15	Living room area in 2015 (implies renovated)	Numeric
21	sqft_loft15	Lot area in 2015 (implies renovated)	Numeric

Methodology

There are 3 methods being carried out on the dataset to achieve our main objectives.

Exploratory Data Analysis (EDA)

    Exploratory Data Analysis, also known as data exploration, is the first step in a data analysis to explore the nature of data by doing some visualisations. EDA is a quick and simple analysis where it could help to determine the underlying pattern, detect outliers or anomalies, and to understand the relationships between variables and hence enable us to extract important factors for the model we are going to build.

    Plotting from raw data allows us to have an overview on the behaviours and the distributions of variables. For instance, histogram could be used to visualise the distribution of a numerical variable whereas a bar chart could be used to visualise the frequency distribution of a categorical variable.

    Box plot could be used to detect outliers for single numerical variable, whereas a scatter plot is able to detect an outlier observation between two numerical variables. Other than that, a scatter plot is also good to visualise the relationship or correlation between two variables. Consequently, we are able to detect the important factors which correlated with the dependent variable that we would want to predict.

    R package of tidyverse is good for data subsetting, data transforming and data visualisation. The package ggplot2 is especially useful for EDA when it comes to visualisation, such as histogram, bar chart, box plot etc could be easily done by using ggplot function.

    Example of coding for a simple boxplot:

rating <- c(5,2,3,2,4,3,10)
boxplot(rating, col="yellow")
title("Boxplot of ratings")

#There is an outlier in the rating data series.

Analysis of Variance (ANOVA)

    Analysis of variance (ANOVA) is a method used to test if there is significant difference in the means of two or more independent groups or levels of factors. It is categorised into one way ANOVA and two way ANOVA - one way ANOVA is to conduct a hypothesis testing on one independent variable whereas two way ANOVA will involve two or more independent variables.

    The hypothesis testing for ANOVA could be generally mentioned as:
        \(H_{0}\) (Null Hypothesis): There is no difference in means among the independent groups of object.
        \(H_{1}\) (Alternative Hypothesis): There are at least two independent groups having means which are statistically different from each other.

    The null hypothesis could be expressed in the format of symbol:
        \(H_{0}: \mu_{1} = \mu_{2} = \mu_{3} = ... = \mu_{k}\)
              where \(\mu =\) group mean
                      \(k =\) number of independent groups

    The test statistic for ANOVA is F-statistic and it is following F-distribution:
          \(F = \frac{\sum n_{j}(\bar{X_{j}}-\bar{X})^{2}/(k-1)}{\sum\sum(X-\bar{X_{j}})^{2}/(N-k)}\)

    with the critical value to be found in the table of probability values of F-distribution with degrees of freedom \(df_{1} = k-1\) and \(df_{2} = N-k\) where \(N = n_{1} + n_{2} + ... + n_{k}\). The F-test is also called as Omnibus Test.

    However, there are some assumptions to be taken care of before carrying out the ANOVA test:
            1. The population from which the sample is drawn is normally distributed.
            2. Homogeneity of variance among the groups of independent variable - the variance for each group of samples are approximately the same.
            3. All the observations in samples should be independent from each other.

    If the F-statistic is large enough to reject the null hypothesis, we could observe that there is difference between at least two group means. However, ANOVA will not tell us of which two groups are having different means and the degree of difference.

    A post-hoc Tukey’s HSD Test can be conducted to examine the difference lie between which specific groups of object. The Tukey HSD test statistic formula is:
              \(HSD = \frac{M_{i}-M_{j}}{\sqrt\frac{MS_{w}}{n_{h}}}\)
              where
                  \(M_{i}-M_{j}\) is the difference between the means of pair groups
                  \(MS_{w}\) is the Mean Square Within and \(n\) is the number in the group

Multiple Linear Regression Analysis (MLR)

    Multiple Linear Regression Analysis is an inferential analysis to explain the relationship between one continuous dependent variable (y) with two or more independent variables (x). Dependent variable is also known as response variable or predicted variable where the value of y is to be computed based on the input of x, whereas the independent variable is also known as explanatory variable, regressor, or predictor. The independent variables can be continuous or categorical where dummy coded is necessary when there are more than 2 categories.

    MLR is a predictive model which quantifies the relationship between explanatory variables and dependent variable in order to predict the outcome of dependent variable. Example of questions could be answered by MLR analysis are -
      1. Do height and weight affect the blood pressure?
      2. Do mileage and the car brand able to predict the sales price of used car?

    Formulation of the hypotheses: . The null hypothesis and alternative hypothesis could be expressed as:
        \(H_{0}\) (Null Hypothesis): There is no relationship between explanatory variables and response variable.
        \(H_{1}\) (Alternative Hypothesis): There is at least one explanatory variable is associated linearly with the response variable.

    The hypotheses could be written as:
        \(H_{0}: \beta_{1} = \beta_{2} =...= \beta_{n} = 0\)
        \(H_{1}:\) at least one \(\beta_{j} \not= 0\) (for j = 1, 2, …, n)

    Fit a linear model: Find the best fit regression equation between the explanatory variables and response variable. The estimates of the parameters of explanatory variables can be used to explain how changing one unit in explanatory variables can the response variable changes. Furthermore, the explanatory variables could be examined on whether they are significant related to the response variable by checking on the p-values. If the p-value is less than 0.05, the explanatory variable is said to have significant effect on the prediction of response variable.

        The general equation of MLR:

           \(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{n}X{n} + \epsilon\)

              where \(Y\) - dependent variable
                     \(X_{n}\) - independent variable
                     \(\beta_{0}\) - intercept of \(Y\) when all \(X\)s are zero
                     \(\beta_{n}\) - parameter estimate for each \(X_{n}\)
                     \(\epsilon\) - independent error term for the model

    Model accuracy assessment: The coefficient of determination (R-squared) is used to determine how much variation in y (response variable) could be explained by x (explanatory variables). The higher the value of R-squared, the better the model fitting the data. Additionally, Residual standard error (RSE) shows how far the observed values are from the predicted values for response variable, which is a typical sized residual or error. The smaller the RSE, the better is the model.

    Model validity: There are several assumptions of carrying out multiple linear regression analysis.
      1. The values of residuals should be normally distributed by following N(0,1) - independent and have constant variance
      2. The relationship between explanatory variables and the response variable is linear
      3. The explanatory variables should not be highly correlated to each other - no multicollinearity in the data
      4. There should be no biasing data in the analysis

    The diagnostic plots of the fitted model should be examined. The residual vs fitted plot should show scattered dots around a horizontal line with no distinct pattern to indicate the relationship between X and Y is linear. A Normal Q-Q Plot is used to determine if the residuals follow a straight line indicating the residuals are normally distributed. The Scale-Location Plot shows if residuals are spread evenly or equally along the ranges of predictors. This plot aims to verify the homoscedasticity of the residuals (equal variance). If the plot showing a pattern of ununiform dispersion, the variance of residuals is not homogenous (as shown in Case 2 below).

Scale-Location Plot

Lastly, the Residuals vs Leverage Plot helps to identify if there is influential case within the dataset, namely leverage. The dashed line will show the boundaries Cook’s distance, those cases located outside of dashed line are influential to the regression results. If we remove these cases, the regression model might be significantly changed.

Data Preparation

There are 2 datasets being used in this project. First - the KC_Housesales_Data from Kaggle and US Zip Codes Database retrieved from simplemaps.

# import 2 datasets to be stored in 2 different dataframes
library(readr)
df_house <- read_csv("kc_house_data.csv")

library(readxl)
df_zipcode <- read_excel("uszips.xlsx")

The first 5 rows of each dataset are shown below, as well as the structure of each dataframe.

# KC_Housesales_Data
head(df_house)

## # A tibble: 6 x 21
##        id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##     <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1  7.13e9 10/1~ 2.22e5        3      1           1180     5650      1          0
## 2  6.41e9 12/9~ 5.38e5        3      2.25        2570     7242      2          0
## 3  5.63e9 2/25~ 1.8 e5        2      1            770    10000      1          0
## 4  2.49e9 12/9~ 6.04e5        4      3           1960     5000      1          0
## 5  1.95e9 2/18~ 5.1 e5        3      2           1680     8080      1          0
## 6  7.24e9 5/12~ 1.23e6        4      4.5         5420   101930      1          0
## # ... with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

colnames(df_house)

##  [1] "id"            "date"          "price"         "bedrooms"     
##  [5] "bathrooms"     "sqft_living"   "sqft_lot"      "floors"       
##  [9] "waterfront"    "view"          "condition"     "grade"        
## [13] "sqft_above"    "sqft_basement" "yr_built"      "yr_renovated" 
## [17] "zipcode"       "lat"           "long"          "sqft_living15"
## [21] "sqft_lot15"

# US Zip Codes Database
head(df_zipcode)

## # A tibble: 6 x 18
##     zip   lat   lng city      state_id state_name  zcta  parent_zcta population
##   <dbl> <dbl> <dbl> <chr>     <chr>    <chr>       <chr> <lgl>            <dbl>
## 1   601  18.2 -66.8 Adjuntas  PR       Puerto Rico TRUE  NA               17113
## 2   602  18.4 -67.2 Aguada    PR       Puerto Rico TRUE  NA               37751
## 3   603  18.5 -67.1 Aguadilla PR       Puerto Rico TRUE  NA               47081
## 4   606  18.2 -66.9 Maricao   PR       Puerto Rico TRUE  NA                6392
## 5   610  18.3 -67.1 Anasco    PR       Puerto Rico TRUE  NA               26686
## 6   612  18.4 -66.7 Arecibo   PR       Puerto Rico TRUE  NA               59369
## # ... with 9 more variables: density <dbl>, county_fips <dbl>,
## #   county_name <chr>, county_weights <chr>, county_names_all <chr>,
## #   county_fips_all <chr>, imprecise <chr>, military <chr>, timezone <chr>

colnames(df_zipcode)

##  [1] "zip"              "lat"              "lng"              "city"            
##  [5] "state_id"         "state_name"       "zcta"             "parent_zcta"     
##  [9] "population"       "density"          "county_fips"      "county_name"     
## [13] "county_weights"   "county_names_all" "county_fips_all"  "imprecise"       
## [17] "military"         "timezone"

Data Cleaning

    Data cleaning for df_house:
          1. Remove unnecessary columns
          2. Transform some columns to categorical values
          3. Check for missing values

## Removing the unnecessary information

df_house$id <- NULL
df_house$sqft_living <- NULL
df_house$sqft_lot <- NULL
df_house$view <- NULL
df_house$grade <- NULL
df_house$view <- NULL
df_house$sqft_above <- NULL
df_house$zip <- df_house$zipcode
df_house$zipcode <- NULL
df_house$lat <- NULL
df_house$long <- NULL

## Changing sqft_basement to basement (1 for yes and 0 for no)
df_house$basement <- df_house$sqft_basement
df_house$sqft_basement <- NULL
df_house$basement = ifelse(df_house$basement>0,"1","0")

## Changing yr_built to age
library("stringr")  
df_house$date <- str_sub(df_house$date, - 4, - 1) #Extract the year from date of building sold 
df_house$date <- as.numeric(df_house$date) #changing it to numeric for subtraction operation
df_house$age <- df_house$date - df_house$yr_built #age of building when sold is calculated
df_house$date <- NULL
df_house$yr_built <- NULL

## Changing yr_renovated to renovation (1 for yes and 0 for no)
df_house$renovation = ifelse(df_house$"yr_renovated">0,"1","0")
df_house$yr_renovated <- NULL

## Checking for missing values
sum(is.na(df_house))

## [1] 0

## Saving new dataframe in csv
write.csv(df_house,'house_data.csv',row.names=FALSE)

# Detecting Outliers
boxplot(df_house$price, ylab = "price")

boxplot(df_house$bedrooms, ylab = "bedroom")

# reexamine the structure of the dataframe now
str(df_house)

## spec_tbl_df [21,597 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ price        : num [1:21597] 221900 538000 180000 604000 510000 ...
##  $ bedrooms     : num [1:21597] 3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num [1:21597] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ floors       : num [1:21597] 1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : num [1:21597] 0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num [1:21597] 3 3 3 5 3 3 3 3 3 3 ...
##  $ sqft_living15: num [1:21597] 1340 1690 2720 1360 1800 ...
##  $ sqft_lot15   : num [1:21597] 5650 7639 8062 5000 7503 ...
##  $ zip          : num [1:21597] 98178 98125 98028 98136 98074 ...
##  $ basement     : chr [1:21597] "0" "1" "0" "1" ...
##  $ age          : num [1:21597] 59 63 82 49 28 13 19 52 55 12 ...
##  $ renovation   : chr [1:21597] "0" "1" "0" "0" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   date = col_character(),
##   ..   price = col_double(),
##   ..   bedrooms = col_double(),
##   ..   bathrooms = col_double(),
##   ..   sqft_living = col_double(),
##   ..   sqft_lot = col_double(),
##   ..   floors = col_double(),
##   ..   waterfront = col_double(),
##   ..   view = col_double(),
##   ..   condition = col_double(),
##   ..   grade = col_double(),
##   ..   sqft_above = col_double(),
##   ..   sqft_basement = col_double(),
##   ..   yr_built = col_double(),
##   ..   yr_renovated = col_double(),
##   ..   zipcode = col_double(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   sqft_living15 = col_double(),
##   ..   sqft_lot15 = col_double()
##   .. )

      Data cleaning for df_zipcode:
          1. Remove unnecessary columns
          2. Remove duplicate rows using zipcode variable

## Removing the unnecessary information
df_zipcode <- df_zipcode[ -c(2:3,5,7:18) ]

## Removing duplicate rows using the zipcode variable
library(dplyr)
df_zipcode %>% distinct(zip, .keep_all= TRUE)

## # A tibble: 33,121 x 3
##      zip city        state_name 
##    <dbl> <chr>       <chr>      
##  1   601 Adjuntas    Puerto Rico
##  2   602 Aguada      Puerto Rico
##  3   603 Aguadilla   Puerto Rico
##  4   606 Maricao     Puerto Rico
##  5   610 Anasco      Puerto Rico
##  6   612 Arecibo     Puerto Rico
##  7   616 Bajadero    Puerto Rico
##  8   617 Barceloneta Puerto Rico
##  9   622 Boqueron    Puerto Rico
## 10   623 Cabo Rojo   Puerto Rico
## # ... with 33,111 more rows

write.csv(df_zipcode,'zipcode.csv',row.names=FALSE)

Data Merging

After we have cleaned the df_house and df_zipcode, now we can do data merging to merge the 2 dataframes as there are no city and state_name in the df_house.

# Merging 2 datasets
df_house <- read_csv("house_data.csv")
df_zip <- read_csv("zipcode.csv")
df <- merge(df_house,df_zipcode,by="zip")

# Examine the frequency table of city and state_name
table(df$city)

## 
##        Auburn      Bellevue Black Diamond       Bothell     Carnation 
##           911          1407           100           195           124 
##        Duvall      Enumclaw     Fall City   Federal Way      Issaquah 
##           190           233            80           779           733 
##       Kenmore          Kent      Kirkland  Maple Valley        Medina 
##           283          1201           977           589            50 
## Mercer Island    North Bend       Redmond        Renton     Sammamish 
##           282           220           977          1597           800 
##       Seattle    Snoqualmie        Vashon   Woodinville 
##          8973           308           117           471

table(df$state_name)

## 
## Washington 
##      21597

# Since all state name is Washington, we can omit the state_name column
df$state_name <- NULL

# print the first few rows of the merged data
head(df, 10)

##      zip  price bedrooms bathrooms floors waterfront condition sqft_living15
## 1  98001 225000        3      1.00    1.0          0         3          1660
## 2  98001 210000        3      1.00    1.0          0         3          1820
## 3  98001 230000        3      1.00    1.0          0         4          1140
## 4  98001 196000        3      2.25    2.0          0         3          1890
## 5  98001 321500        4      2.50    2.0          0         3          2740
## 6  98001 360000        4      2.00    1.0          0         5          1230
## 7  98001 207000        3      1.00    1.0          0         4          1356
## 8  98001 420000        4      2.50    2.0          0         3          2990
## 9  98001 305000        2      1.00    1.5          0         4          1650
## 10 98001 429000        4      2.50    2.0          0         3          2730
##    sqft_lot15 basement age renovation   city
## 1        7245        1  51          0 Auburn
## 2        7210        1  51          0 Auburn
## 3        9639        0  48          0 Auburn
## 4        7519        0  40          0 Auburn
## 5        5816        0  11          0 Auburn
## 6       15750        0  49          0 Auburn
## 7        9450        1  71          0 Auburn
## 8        9033        0  13          0 Auburn
## 9       19200        0  79          0 Auburn
## 10       8688        0  13          0 Auburn

str(df)

## 'data.frame':    21597 obs. of  13 variables:
##  $ zip          : num  98001 98001 98001 98001 98001 ...
##  $ price        : num  225000 210000 230000 196000 321500 ...
##  $ bedrooms     : num  3 3 3 3 4 4 3 4 2 4 ...
##  $ bathrooms    : num  1 1 1 2.25 2.5 2 1 2.5 1 2.5 ...
##  $ floors       : num  1 1 1 2 2 1 1 2 1.5 2 ...
##  $ waterfront   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num  3 3 4 3 3 5 4 3 4 3 ...
##  $ sqft_living15: num  1660 1820 1140 1890 2740 ...
##  $ sqft_lot15   : num  7245 7210 9639 7519 5816 ...
##  $ basement     : num  1 1 0 0 0 0 1 0 0 0 ...
##  $ age          : num  51 51 48 40 11 49 71 13 79 13 ...
##  $ renovation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ city         : chr  "Auburn" "Auburn" "Auburn" "Auburn" ...

write.csv(df,'merged.csv',row.names=FALSE)

Data Analysis

Exploratory Data Analysis (EDA)

## Rearranging the columns so that the price will be the first variable
df <- df[,c(2:12,1,13)]

## Viewing a Part of the Data
head(df,10)

##     price bedrooms bathrooms floors waterfront condition sqft_living15
## 1  225000        3      1.00    1.0          0         3          1660
## 2  210000        3      1.00    1.0          0         3          1820
## 3  230000        3      1.00    1.0          0         4          1140
## 4  196000        3      2.25    2.0          0         3          1890
## 5  321500        4      2.50    2.0          0         3          2740
## 6  360000        4      2.00    1.0          0         5          1230
## 7  207000        3      1.00    1.0          0         4          1356
## 8  420000        4      2.50    2.0          0         3          2990
## 9  305000        2      1.00    1.5          0         4          1650
## 10 429000        4      2.50    2.0          0         3          2730
##    sqft_lot15 basement age renovation   zip   city
## 1        7245        1  51          0 98001 Auburn
## 2        7210        1  51          0 98001 Auburn
## 3        9639        0  48          0 98001 Auburn
## 4        7519        0  40          0 98001 Auburn
## 5        5816        0  11          0 98001 Auburn
## 6       15750        0  49          0 98001 Auburn
## 7        9450        1  71          0 98001 Auburn
## 8        9033        0  13          0 98001 Auburn
## 9       19200        0  79          0 98001 Auburn
## 10       8688        0  13          0 98001 Auburn

## Viewing the structure and summary statistics of the data
str(df)

## 'data.frame':    21597 obs. of  13 variables:
##  $ price        : num  225000 210000 230000 196000 321500 ...
##  $ bedrooms     : num  3 3 3 3 4 4 3 4 2 4 ...
##  $ bathrooms    : num  1 1 1 2.25 2.5 2 1 2.5 1 2.5 ...
##  $ floors       : num  1 1 1 2 2 1 1 2 1.5 2 ...
##  $ waterfront   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num  3 3 4 3 3 5 4 3 4 3 ...
##  $ sqft_living15: num  1660 1820 1140 1890 2740 ...
##  $ sqft_lot15   : num  7245 7210 9639 7519 5816 ...
##  $ basement     : num  1 1 0 0 0 0 1 0 0 0 ...
##  $ age          : num  51 51 48 40 11 49 71 13 79 13 ...
##  $ renovation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zip          : num  98001 98001 98001 98001 98001 ...
##  $ city         : chr  "Auburn" "Auburn" "Auburn" "Auburn" ...

summary(df)

##      price            bedrooms        bathrooms         floors     
##  Min.   :  78000   Min.   : 1.000   Min.   :0.500   Min.   :1.000  
##  1st Qu.: 322000   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.:1.000  
##  Median : 450000   Median : 3.000   Median :2.250   Median :1.500  
##  Mean   : 540297   Mean   : 3.373   Mean   :2.116   Mean   :1.494  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.:2.000  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :3.500  
##    waterfront         condition    sqft_living15    sqft_lot15    
##  Min.   :0.000000   Min.   :1.00   Min.   : 399   Min.   :   651  
##  1st Qu.:0.000000   1st Qu.:3.00   1st Qu.:1490   1st Qu.:  5100  
##  Median :0.000000   Median :3.00   Median :1840   Median :  7620  
##  Mean   :0.007547   Mean   :3.41   Mean   :1987   Mean   : 12758  
##  3rd Qu.:0.000000   3rd Qu.:4.00   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :1.000000   Max.   :5.00   Max.   :6210   Max.   :871200  
##     basement          age           renovation           zip       
##  Min.   :0.000   Min.   : -1.00   Min.   :0.00000   Min.   :98001  
##  1st Qu.:0.000   1st Qu.: 18.00   1st Qu.:0.00000   1st Qu.:98033  
##  Median :0.000   Median : 40.00   Median :0.00000   Median :98065  
##  Mean   :0.393   Mean   : 43.32   Mean   :0.04232   Mean   :98078  
##  3rd Qu.:1.000   3rd Qu.: 63.00   3rd Qu.:0.00000   3rd Qu.:98118  
##  Max.   :1.000   Max.   :115.00   Max.   :1.00000   Max.   :98199  
##      city          
##  Length:21597      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

## Checking for the missing values in the data.
NA_values=data.frame(no_of_na_values=colSums(is.na(df)))
head(NA_values,13)

##               no_of_na_values
## price                       0
## bedrooms                    0
## bathrooms                   0
## floors                      0
## waterfront                  0
## condition                   0
## sqft_living15               0
## sqft_lot15                  0
## basement                    0
## age                         0
## renovation                  0
## zip                         0
## city                        0

# Counting the number of groups with the same zip code and city
library(dplyr)
freq_zipcode <- as.data.frame(table(df$zip))
freq_zipcode <- freq_zipcode %>% rename(zipcode = Var1)
print(freq_zipcode)

##    zipcode Freq
## 1    98001  361
## 2    98002  199
## 3    98003  280
## 4    98004  317
## 5    98005  168
## 6    98006  498
## 7    98007  141
## 8    98008  283
## 9    98010  100
## 10   98011  195
## 11   98014  124
## 12   98019  190
## 13   98022  233
## 14   98023  499
## 15   98024   80
## 16   98027  412
## 17   98028  283
## 18   98029  321
## 19   98030  256
## 20   98031  273
## 21   98032  125
## 22   98033  432
## 23   98034  545
## 24   98038  589
## 25   98039   50
## 26   98040  282
## 27   98042  547
## 28   98045  220
## 29   98052  574
## 30   98053  403
## 31   98055  268
## 32   98056  406
## 33   98058  455
## 34   98059  468
## 35   98065  308
## 36   98070  117
## 37   98072  273
## 38   98074  441
## 39   98075  359
## 40   98077  198
## 41   98092  351
## 42   98102  104
## 43   98103  602
## 44   98105  229
## 45   98106  335
## 46   98107  266
## 47   98108  186
## 48   98109  109
## 49   98112  269
## 50   98115  583
## 51   98116  330
## 52   98117  553
## 53   98118  507
## 54   98119  184
## 55   98122  290
## 56   98125  409
## 57   98126  354
## 58   98133  493
## 59   98136  263
## 60   98144  343
## 61   98146  288
## 62   98148   57
## 63   98155  446
## 64   98166  254
## 65   98168  269
## 66   98177  255
## 67   98178  262
## 68   98188  136
## 69   98198  280
## 70   98199  317

freq_city <- as.data.frame(table(df$city))
freq_city <- freq_city %>% rename(city_name = Var1)
print(freq_city)

##        city_name Freq
## 1         Auburn  911
## 2       Bellevue 1407
## 3  Black Diamond  100
## 4        Bothell  195
## 5      Carnation  124
## 6         Duvall  190
## 7       Enumclaw  233
## 8      Fall City   80
## 9    Federal Way  779
## 10      Issaquah  733
## 11       Kenmore  283
## 12          Kent 1201
## 13      Kirkland  977
## 14  Maple Valley  589
## 15        Medina   50
## 16 Mercer Island  282
## 17    North Bend  220
## 18       Redmond  977
## 19        Renton 1597
## 20     Sammamish  800
## 21       Seattle 8973
## 22    Snoqualmie  308
## 23        Vashon  117
## 24   Woodinville  471

Dividing the data into Train and Test Data by using random sampling method. 70% of the data will be entering to train dataset whereas the rest will be test dataset. The reason of splitting dataset into two being to judge the performance of the prediction model on the test dataset afterwards.

library(caTools)

#  set seed to ensure you always have same random numbers generated
set.seed(123)

# splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE
sample = sample.split(df$zip,SplitRatio = 0.7)

# creates a training dataset named train_data with rows which are marked as TRUE
train_data =subset(df,sample ==TRUE)
test_data=subset(df, sample==FALSE)

# check the dimensions of train_data and test_data
dim(train_data)

## [1] 15116    13

dim(test_data)

## [1] 6481   13

Determining the association between the variables.

library(corrplot)

cor_data=data.frame(train_data[,1:12])
cor_data = cor_data[, -c(9,11)]
correlation=cor(cor_data)
par(mfrow=c(1, 1))
corrplot(correlation,method="color")

According to our corrplot price is positively correlated with bedroom, bathroom, floors and sqft_living15.

Draw some scatter plots to determine the relationship between positive-correlated variables.

library(ggplot2)
library(grid)
library(lattice)
library(gridExtra)

p1=ggplot(data = train_data, aes(x = bedrooms, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Bedrooms and Price", x="Bedrooms",y="Price")
p2=ggplot(data = train_data, aes(x = bathrooms, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Bathrooms and Price", x="Bathrooms",y="Price")
p3=ggplot(data = train_data, aes(x = floors, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Floors and Price", x="Floors",y="Price")
p4=ggplot(data = train_data, aes(x = sqft_living15, y = price)) +
  geom_jitter() +  geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Sqft_living15 and Price", x="Sqft_living15",y="Price")

grid.arrange(p1,p2,p3,p4,nrow=2)

From these scatter plots, we can say that the relationship between price (dependent variable) and bedrooms, bathrooms and sqft_living15 (independent variables) are linear and in the positive direction. However, the relationship between price and floors is not obviously linear.

The p-value from correlation test can be used to identify whether the correlation is significant.

# create an empty list to store p-value of correlation test
p_value_df <- data.frame("Explanatory_Variable" = character(0), "p_value" = double(0), stringsAsFactors = FALSE)


for (i in 2:ncol(cor_data)) {
  p_value <- cor.test(cor_data[,1], cor_data[,i])$p.value
  new_row <- c(names(cor_data[i]), p_value)
  p_value_df[nrow(p_value_df)+1,] <- new_row
}

print(p_value_df)

##   Explanatory_Variable               p_value
## 1             bedrooms                     0
## 2            bathrooms                     0
## 3               floors  2.7422081672573e-228
## 4           waterfront 3.50284064304626e-201
## 5            condition   0.00100763267925831
## 6        sqft_living15                     0
## 7           sqft_lot15  8.00758800521312e-24
## 8                  age  1.14243642311684e-11
## 9                  zip  1.55392256300633e-11

Based on the p-value from correlation test between price and every single explanatory variable, the relationships seem all to be significant.

For the 4 categorical variables(waterfront, basement, renovation and city) we draw boxplots to understand the relationships.

par(mfrow=c(2, 2))
boxplot(price~waterfront,data=train_data,main="Price vs Waterfront", xlab="waterfront",ylab="price",col="orange",border="brown")
boxplot(price~basement,data=train_data,main="Price vs Basement", xlab="basement",ylab="price",col="orange",border="brown")
boxplot(price~renovation,data=train_data,main="Price vs Renovation", xlab="renovation",ylab="price",col="orange",border="brown")
boxplot(price~city,data=train_data,main="Price vs City", xlab="city",ylab="price",col="orange",border="brown")

There seem to have relationships between price with these 4 categorical variables.

Checking for outliers in the dependent variable(price) using a boxplot.

ggplot(data=train_data)+geom_boxplot(aes(x=bedrooms,y=price))

outliers=boxplot(train_data$price,plot=FALSE)$out
outliers_data=train_data[which(train_data$price %in% outliers),]
train_data1= train_data[-which(train_data$price %in% outliers),]

par(mfrow=c(1, 2))
# Plot of original data without outliers.
plot(train_data$bedrooms, train_data$price, main="With Outliers", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bedrooms, data=train_data), col="blue", lwd=3, lty=2)

# Plot of data after removed outliers. Note the change of slope.
plot(train_data1$bedrooms, train_data1$price, main="Outliers removed", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~bedrooms, data=train_data1), col="blue", lwd=3, lty=2)

      Notice the change in slope of the best fit line after removing the outliers.
      It is obvious that if we remove the outliers to train the model, our predictions would be exaggerated (high error) for larger values of price because of the larger slope.
      Therefore we continue with the entire data.

Analysis of Variance (ANOVA)

Description

Idendified Questions

Is there a significant difference in the house sold price based on the house variables?

Objective

To investigate if the condition rate, renovation and city located affect house sold price.

In this section will focus on apply One-way ANOVA to find out if the categorical variables caused a difference in the mean of house sold price. The null hypothesis (H0) of the ANOVA is no difference in means, and the alternate hypothesis (H1) is that the means are different from one another. ANOVA assumes that each observation is randomly sampled, independent, normally distributed, and with unknown but equal variances.

ANOVA only reveals that are different between group means. Hence this section will also apply Tukey’s Honestly Significant Difference (Tukey’s HSD) test to tell which groups are statistically different from another.

Steps:

Analysis the factor level of the variable.
Perform ANOVA Test.
- The ANOVA model summary lists the independent variable being tested in the model and the model residuals (The variation not explained by the independent variables is called residual variance.)
Perform Tukey’s HSD Test.
Boxplot the price and variable to check the distribution.

Table below shows the variable that will be analysis. Not all of the variables from dataset were used in this analysis.

Variable	Description	Level
condition	How good the house condition is	1 to 5 where 1=worst; 5=best
renovation	Indicate if the house has been renovated	1=Yes; 0=NO
city	Indicate which city the house located	List of city name

Packages & Dataset

Loading Packages Library

library(ggplot2)
library(ggpubr)
library(gridExtra)
library(tidyverse)
library(broom)
library(AICcmodavg)
library(dplyr)

Loading dataset

# Read the variables that will be analyse later as factor
house <- read.csv('merged.csv', header = TRUE, stringsAsFactors=T,
                  colClasses=c(rep('numeric', 6), 'factor', rep('numeric', 4),rep('factor', 2)))
glimpse(house)

## Rows: 21,597
## Columns: 13
## $ zip           <dbl> 98001, 98001, 98001, 98001, 98001, 98001, 98001, 98001, ~
## $ price         <dbl> 225000, 210000, 230000, 196000, 321500, 360000, 207000, ~
## $ bedrooms      <dbl> 3, 3, 3, 3, 4, 4, 3, 4, 2, 4, 4, 4, 5, 3, 4, 3, 3, 3, 3,~
## $ bathrooms     <dbl> 1.00, 1.00, 1.00, 2.25, 2.50, 2.00, 1.00, 2.50, 1.00, 2.~
## $ floors        <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.5, 2.0, 1.0, 2~
## $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ condition     <fct> 3, 3, 4, 3, 3, 5, 4, 3, 4, 3, 4, 3, 4, 4, 3, 3, 3, 4, 4,~
## $ sqft_living15 <dbl> 1660, 1820, 1140, 1890, 2740, 1230, 1356, 2990, 1650, 27~
## $ sqft_lot15    <dbl> 7245, 7210, 9639, 7519, 5816, 15750, 9450, 9033, 19200, ~
## $ basement      <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,~
## $ age           <dbl> 51, 51, 48, 40, 11, 49, 71, 13, 79, 13, 37, 14, 59, 61, ~
## $ renovation    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ city          <fct> Auburn, Auburn, Auburn, Auburn, Auburn, Auburn, Auburn, ~

Price VS Condition

This section examine if there is a statistical difference between the mean of the house sold price according to the rate of condition given to the house.

1.Overview of the condition variable

#Calculate frequency, mean and standard deviation
house %>% 
  group_by(condition) %>% 
  summarise(condition_freq = n(), 
            price_mean = mean(price, na.rm = TRUE), 
            price_sd = sd(price, na.rm = TRUE))

## # A tibble: 5 x 4
##   condition condition_freq price_mean price_sd
##   <fct>              <int>      <dbl>    <dbl>
## 1 1                     29    341067.  273483.
## 2 2                    170    328179.  246987.
## 3 3                  14020    542173.  364650.
## 4 4                   5677    521374.  358796.
## 5 5                   1701    612578.  411318.

2.ANOVA Test

Hypothesis:

H0: The mean price is equal for all levels of condition categories.
H1: At least one of the condition categories has a mean price that is not the same as the other condition categories.

The p-value of the condition variable is low (p < 0.001, indicated by the ’***’), which implies that rate of condition gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the condition of house.

one_way <- aov(price ~ condition, data = house)
summary(one_way)

##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## condition       4 1.977e+13 4.942e+12   36.86 <2e-16 ***
## Residuals   21592 2.895e+15 1.341e+11                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.Tukey test

Base on the test results, there are statistically significant differences (p < 0.05) between condition groups 3-1 (3 and 1), 5-1, 3-2, 4-2, 5-2, 4-3, 5-3, 5-4. The difference between condition groups 2-1 and 4-1 are not statistically significant.

TukeyHSD(one_way)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ condition, data = house)
## 
## $condition
##          diff         lwr        upr     p adj
## 2-1 -12888.48 -213572.104 187795.138 0.9997895
## 3-1 201105.82   15428.628 386783.004 0.0260297
## 4-1 180307.21   -5651.399 366265.817 0.0625225
## 5-1 271510.50   84450.580 458570.422 0.0007181
## 3-2 213994.30  136921.420 291067.177 0.0000000
## 4-2 193195.69  115447.286 270944.097 0.0000000
## 5-2 284398.98  204052.083 364745.884 0.0000000
## 4-3 -20798.61  -36512.198  -5085.016 0.0028281
## 5-3  70404.69   44758.500  96050.870 0.0000000
## 5-4  91203.29   63593.282 118813.303 0.0000000

4.Boxplot of the distribution of price by condition:

options(scipen=999)
ggboxplot(house, x = "condition", y = "price", ylim=c(78000,7700000))

Price VS Renovation

This section examine if there is a statistical difference between the mean of the house sold price and renovation state of the house.

1.Overview of the renovation variable

#Calculate frequency, mean and standard deviation
house %>% 
  group_by(renovation) %>% 
  summarise(renovation_freq = n(), 
            price_mean = mean(price, na.rm = TRUE), 
            price_sd = sd(price, na.rm = TRUE))

## # A tibble: 2 x 4
##   renovation renovation_freq price_mean price_sd
##   <fct>                <int>      <dbl>    <dbl>
## 1 0                    20683    530560.  349805.
## 2 1                      914    760629.  608017.

2.ANOVA Test

Hypothesis:

H0: The mean price is equal for all levels of renovation categories.
H1: At least one of the renovation categories has a mean price that is not the same as the other renovation categories.

The p-value of the renovation variable is low (p < 0.001, indicated by the ’***’), which implies that state of renovation gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the renovation state of the house.

one_way <- aov(price ~ renovation, data = house)
summary(one_way)

##                Df           Sum Sq        Mean Sq F value              Pr(>F)
## renovation      1   46332107051977 46332107051977   348.8 <0.0000000000000002
## Residuals   21595 2868250023356214   132820098326                            
##                
## renovation  ***
## Residuals      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.Tukey test

Base on the test results, there are statistically significant differences (p < 0.05) between renovation group 1 and 0.

TukeyHSD(one_way)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ renovation, data = house)
## 
## $renovation
##         diff      lwr      upr p adj
## 1-0 230068.9 205924.2 254213.5     0

4.Boxplot of the distribution of price by renovation:

options(scipen=999)
ggboxplot(house, x = "renovation", y = "price", ylim=c(78000,7700000))

Price vs City

This section examine if there is a statistical difference between the mean of the house sold price according to the location of city of the house.

1.Overview of the city variable

#Calculate frequency, mean and standard deviation
options(dplyr.print_max = 1e9)
house %>% 
  group_by(city) %>% 
  summarise(city_freq = n(), 
            price_mean = mean(price, na.rm = TRUE), 
            price_sd = sd(price, na.rm = TRUE))

## # A tibble: 24 x 4
##    city          city_freq price_mean price_sd
##    <fct>             <int>      <dbl>    <dbl>
##  1 Auburn              911    291648.  108422.
##  2 Bellevue           1407    898466.  559782.
##  3 Black Diamond       100    423666.  195415.
##  4 Bothell             195    490377.  121971.
##  5 Carnation           124    455617.  258603.
##  6 Duvall              190    424815.  130638.
##  7 Enumclaw            233    316742.  122329.
##  8 Fall City            80    586121.  376719.
##  9 Federal Way         779    289391.  108399.
## 10 Issaquah            733    615122.  260451.
## 11 Kenmore             283    462489.  149530.
## 12 Kent               1201    299470.   91647.
## 13 Kirkland            977    646543.  409633.
## 14 Maple Valley        589    367091.  132721.
## 15 Medina               50   2161300  1166904.
## 16 Mercer Island       282   1194874.  607768.
## 17 North Bend          220    440232.  207554.
## 18 Redmond             977    658432.  231136.
## 19 Renton             1597    403468.  200725.
## 20 Sammamish           800    732821.  280951.
## 21 Seattle            8973    535086.  340519.
## 22 Snoqualmie          308    529630.  185254.
## 23 Vashon              117    489382.  201501.
## 24 Woodinville         471    617498.  244298.

2.ANOVA Test

Hypothesis:

H0: The mean price is equal for all levels of city categories.
H1: At least one of the city categories has a mean price that is not the same as the other city categories.

The p-value of the city variable is low (p < 0.001, indicated by the ’***’), which implies that location of city gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the location of house.

one_way <- aov(price ~ city, data = house)
summary(one_way)

##                Df           Sum Sq        Mean Sq F value              Pr(>F)
## city           23  738104329040975 32091492566999   318.1 <0.0000000000000002
## Residuals   21573 2176477801366934   100888972390                            
##                
## city        ***
## Residuals      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.Tukey test

There are statistically significant differences between city groups with p < 0.05.

TukeyHSD(one_way)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ city, data = house)
## 
## $city
##                                      diff          lwr           upr     p adj
## Bellevue-Auburn               606818.2632   557681.911   655954.6152 0.0000000
## Black Diamond-Auburn          132018.1700    10296.242   253740.0976 0.0167217
## Bothell-Auburn                198729.2877   107558.963   289899.6123 0.0000000
## Carnation-Auburn              163969.2929    53369.943   274568.6432 0.0000187
## Duvall-Auburn                 133167.2432    41014.111   225320.3756 0.0000393
## Enumclaw-Auburn                25094.5706   -59731.294   109920.4352 0.9999950
## Fall City-Auburn              294473.0550   159736.612   429209.4977 0.0000000
## Federal Way-Auburn             -2256.5222   -58642.078    54129.0338 1.0000000
## Issaquah-Auburn               323474.3533   266143.004   380805.7025 0.0000000
## Kenmore-Auburn                170841.0493    92208.530   249473.5683 0.0000000
## Kent-Auburn                     7822.2275   -42943.276    58587.7308 1.0000000
## Kirkland-Auburn               354895.0070   301678.428   408111.5865 0.0000000
## Maple Valley-Auburn            75443.6639    14352.102   136535.2253 0.0017262
## Medina-Auburn                1869652.1800  1701822.049  2037482.3115 0.0000000
## Mercer Island-Auburn          903225.8183   824486.997   981964.6399 0.0000000
## North Bend-Auburn             148584.4937    61785.797   235383.1906 0.0000000
## Redmond-Auburn                366783.9559   313567.376   420000.5353 0.0000000
## Renton-Auburn                 111820.6409    63846.793   159794.4885 0.0000000
## Sammamish-Auburn              441173.1875   385187.985   497158.3902 0.0000000
## Seattle-Auburn                243437.7799   203259.552   283616.0078 0.0000000
## Snoqualmie-Auburn             237981.9625   161823.225   314140.7003 0.0000000
## Vashon-Auburn                 197734.1971    84260.065   311208.3292 0.0000000
## Woodinville-Auburn            325850.0505   260275.275   391424.8258 0.0000000
## Black Diamond-Bellevue       -474800.0932  -594381.046  -355219.1404 0.0000000
## Bothell-Bellevue             -408088.9755  -496380.565  -319797.3864 0.0000000
## Carnation-Bellevue           -442848.9703  -551087.563  -334610.3779 0.0000000
## Duvall-Bellevue              -473651.0200  -562957.103  -384344.9372 0.0000000
## Enumclaw-Bellevue            -581723.6926  -663447.642  -499999.7431 0.0000000
## Fall City-Bellevue           -312345.2082  -445150.651  -179539.7655 0.0000000
## Federal Way-Bellevue         -609074.7853  -660676.215  -557473.3558 0.0000000
## Issaquah-Bellevue            -283343.9099  -335977.171  -230710.6484 0.0000000
## Kenmore-Bellevue             -435977.2139  -511253.038  -360701.3900 0.0000000
## Kent-Bellevue                -598996.0357  -644388.909  -553603.1625 0.0000000
## Kirkland-Bellevue            -251923.2561  -300041.662  -203804.8507 0.0000000
## Maple Valley-Bellevue        -531374.5993  -588080.467  -474668.7311 0.0000000
## Medina-Bellevue              1262833.9168  1096550.034  1429117.7996 0.0000000
## Mercer Island-Bellevue        296407.5551   221020.695   371794.4150 0.0000000
## North Bend-Bellevue          -458233.7695  -542003.635  -374463.9038 0.0000000
## Redmond-Bellevue             -240034.3073  -288152.713  -191915.9019 0.0000000
## Renton-Bellevue              -494997.6223  -537245.305  -452749.9399 0.0000000
## Sammamish-Bellevue           -165645.0757  -216808.730  -114481.4213 0.0000000
## Seattle-Bellevue             -363380.4832  -396511.535  -330249.4313 0.0000000
## Snoqualmie-Bellevue          -368836.3007  -441524.195  -296148.4066 0.0000000
## Vashon-Bellevue              -409084.0661  -520258.502  -297909.6299 0.0000000
## Woodinville-Bellevue         -280968.2127  -342477.787  -219458.6381 0.0000000
## Bothell-Black Diamond          66711.1177   -75405.916   208828.1513 0.9916108
## Carnation-Black Diamond        31951.1229  -123346.783   187249.0284 1.0000000
## Duvall-Black Diamond            1149.0732  -141600.438   143898.5845 1.0000000
## Enumclaw-Black Diamond       -106923.5994  -245056.327    31209.1283 0.4247155
## Fall City-Black Diamond       162454.8850   -10863.124   335772.8938 0.1023094
## Federal Way-Black Diamond    -134274.6922  -257012.437   -11536.9470 0.0146776
## Issaquah-Black Diamond        191456.1833    68281.077   314631.2894 0.0000042
## Kenmore-Black Diamond          38822.8793   -95595.483   173241.2419 0.9999968
## Kent-Black Diamond           -124195.9425  -244455.493    -3936.3920 0.0332820
## Kirkland-Black Diamond        222876.8370   101562.256   344191.4177 0.0000000
## Maple Valley-Black Diamond    -56574.5061  -181544.111    68395.0983 0.9948162
## Medina-Black Diamond         1737634.0100  1537503.612  1937764.4080 0.0000000
## Mercer Island-Black Diamond   771207.6483   636727.073   905688.2238 0.0000000
## North Bend-Black Diamond       16566.3236  -122786.599   155919.2459 1.0000000
## Redmond-Black Diamond         234765.7858   113451.205   356080.3665 0.0000000
## Renton-Black Diamond          -20197.5291  -139305.519    98910.4606 1.0000000
## Sammamish-Black Diamond       309155.0175   186600.678   431709.3568 0.0000000
## Seattle-Black Diamond         111419.6099    -4767.795   227607.0153 0.0806822
## Snoqualmie-Black Diamond      105963.7925   -27022.585   238950.1696 0.3631798
## Vashon-Black Diamond           65716.0271   -91642.169   223074.2230 0.9983475
## Woodinville-Black Diamond     193831.8805    66610.521   321053.2400 0.0000080
## Carnation-Bothell             -34759.9948  -167474.877    97954.8871 0.9999995
## Duvall-Bothell                -65562.0445  -183346.693    52222.6040 0.9399590
## Enumclaw-Bothell             -173634.7171  -285779.494   -61489.9406 0.0000048
## Fall City-Bothell              95743.7673   -57667.225   249154.7597 0.8331508
## Federal Way-Bothell          -200985.8099  -293507.994  -108463.6259 0.0000000
## Issaquah-Bothell              124745.0656    31643.469   217846.6618 0.0002839
## Kenmore-Bothell               -27888.2384  -135424.728    79648.2515 0.9999996
## Kent-Bothell                 -190907.0602  -280115.580  -101698.5408 0.0000000
## Kirkland-Bothell              156165.7193    65539.962   246791.4769 0.0000000
## Maple Valley-Bothell         -123285.6238  -218748.719   -27822.5289 0.0006671
## Medina-Bothell               1670922.8923  1487761.825  1854083.9598 0.0000000
## Mercer Island-Bothell         704496.5306   596882.286   812110.7752 0.0000000
## North Bend-Bothell            -50144.7941  -163789.140    63499.5520 0.9963581
## Redmond-Bothell               168054.6682    77428.911   258680.4257 0.0000000
## Renton-Bothell                -86908.6468  -174558.596      741.3028 0.0553451
## Sammamish-Bothell             242443.8998   150165.156   334722.6434 0.0000000
## Seattle-Bothell                44708.4922   -38929.499   128346.4836 0.9605835
## Snoqualmie-Bothell             39252.6748   -66488.410   144993.7593 0.9997348
## Vashon-Bothell                  -995.0906  -136115.046   134124.8645 1.0000000
## Woodinville-Bothell           127120.7628    28728.310   225513.2156 0.0006610
## Duvall-Carnation              -30802.0497  -164193.997   102589.8976 1.0000000
## Enumclaw-Carnation           -138874.7223  -267313.965   -10435.4795 0.0175068
## Fall City-Carnation           130503.7621   -35192.106   296199.6304 0.3875654
## Federal Way-Carnation        -166225.8151  -277942.164   -54509.4657 0.0000167
## Issaquah-Carnation            159505.0604    47308.379   271701.7421 0.0000619
## Kenmore-Carnation               6871.7564  -117564.110   131307.6228 1.0000000
## Kent-Carnation               -156147.0654  -265134.900   -47159.2307 0.0000500
## Kirkland-Carnation            190925.7141    80774.835   301076.5930 0.0000000
## Maple Valley-Carnation        -88525.6290  -202689.505    25638.2474 0.4209220
## Medina-Carnation             1705682.8871  1512115.935  1899249.8389 0.0000000
## Mercer Island-Carnation       739256.5254   614753.458   863759.5929 0.0000000
## North Bend-Carnation          -15384.7993  -145135.428   114365.8292 1.0000000
## Redmond-Carnation             202814.6629    92663.784   312965.5418 0.0000000
## Renton-Carnation              -52148.6520  -159864.490    55567.1864 0.9875638
## Sammamish-Carnation           277203.8946   165689.076   388718.7129 0.0000000
## Seattle-Carnation              79468.4870   -25008.792   183945.7658 0.4627436
## Snoqualmie-Carnation           74012.6696   -48874.942   196900.2814 0.8752603
## Vashon-Carnation               33764.9042  -115156.520   182686.3279 1.0000000
## Woodinville-Carnation         161880.7576    45256.305   278505.2102 0.0001172
## Enumclaw-Duvall              -108072.6726  -221017.892     4872.5471 0.0826059
## Fall City-Duvall              161305.8118     7308.720   315302.9038 0.0275181
## Federal Way-Duvall           -135423.7653  -228914.547   -41932.9837 0.0000368
## Issaquah-Duvall               190307.1101    96242.882   284371.3379 0.0000000
## Kenmore-Duvall                 37673.8061   -70697.170   146044.7819 0.9999111
## Kent-Duvall                  -125345.0157  -215557.718   -35132.3133 0.0001142
## Kirkland-Duvall               221727.7639   130113.356   313342.1714 0.0000000
## Maple Valley-Duvall           -57723.5793  -154125.728    38678.5691 0.8813423
## Medina-Duvall                1736484.9368  1552832.688  1920137.1854 0.0000000
## Mercer Island-Duvall          770058.5751   661610.443   878506.7073 0.0000000
## North Bend-Duvall              15417.2505   -99017.050   129851.5510 1.0000000
## Redmond-Duvall                233616.7127   142002.305   325231.1202 0.0000000
## Renton-Duvall                 -21346.6023  -110018.387    67325.1824 0.9999999
## Sammamish-Duvall              308005.9443   214756.074   401255.8143 0.0000000
## Seattle-Duvall                110270.5368    25562.300   194978.7733 0.0005577
## Snoqualmie-Duvall             104814.7193    -1774.908   211404.3468 0.0609801
## Vashon-Duvall                  64566.9539   -71218.075   200351.9825 0.9900914
## Woodinville-Duvall            192682.8073    93378.998   291986.6164 0.0000000
## Fall City-Enumclaw            269378.4844   119650.959   419106.0097 0.0000000
## Federal Way-Enumclaw          -27351.0927  -113628.284    58926.0986 0.9999823
## Issaquah-Enumclaw             298379.7827   211481.529   385278.0360 0.0000000
## Kenmore-Enumclaw              145746.4787    43533.486   247959.4714 0.0000571
## Kent-Enumclaw                 -17272.3431   -99986.061    65441.3750 1.0000000
## Kirkland-Enumclaw             329800.4365   245560.142   414040.7306 0.0000000
## Maple Valley-Enumclaw          50349.0933   -39074.627   139772.8133 0.9327362
## Medina-Enumclaw              1844557.6094  1664470.475  2024644.7436 0.0000000
## Mercer Island-Enumclaw        878131.2477   775836.454   980426.0416 0.0000000
## North Bend-Enumclaw           123489.9231    14869.314   232110.5317 0.0079356
## Redmond-Enumclaw              341689.3853   257449.091   425929.6794 0.0000000
## Renton-Enumclaw                86726.0703     5695.750   167756.3911 0.0203207
## Sammamish-Enumclaw            416078.6169   330062.539   502094.6953 0.0000000
## Seattle-Enumclaw              218343.2094   141670.445   295015.9739 0.0000000
## Snoqualmie-Enumclaw           212887.3919   112565.031   313209.7532 0.0000000
## Vashon-Enumclaw               172639.6265    41716.743   303562.5102 0.0004148
## Woodinville-Enumclaw          300755.4799   208211.037   393299.9232 0.0000000
## Federal Way-Fall City        -296729.5772  -432384.417  -161074.7379 0.0000000
## Issaquah-Fall City             29001.2983  -107049.384   165051.9804 1.0000000
## Kenmore-Fall City            -123632.0057  -269939.824    22675.8121 0.2488284
## Kent-Fall City               -286650.8275  -420067.621  -153234.0338 0.0000000
## Kirkland-Fall City             60421.9520   -73946.604   194790.5080 0.9952697
## Maple Valley-Fall City       -219029.3911  -356706.853   -81351.9297 0.0000018
## Medina-Fall City             1575179.1250  1366876.802  1783481.4475 0.0000000
## Mercer Island-Fall City       608752.7633   462387.786   755117.7405 0.0000000
## North Bend-Fall City         -145888.5614  -296742.525     4965.4021 0.0736057
## Redmond-Fall City              72310.9008   -62057.655   206679.4568 0.9576414
## Renton-Fall City             -182652.4141  -315032.150   -50272.6779 0.0001368
## Sammamish-Fall City           146700.1325    11211.213   282189.0525 0.0171507
## Seattle-Fall City             -51035.2751  -180793.491    78722.9405 0.9993344
## Snoqualmie-Fall City          -56491.0925  -201484.395    88502.2103 0.9994257
## Vashon-Fall City              -96738.8579  -264367.266    70889.5501 0.9148126
## Woodinville-Fall City          31376.9955  -108347.575   171101.5662 1.0000000
## Issaquah-Federal Way          325730.8754   266273.243   385188.5078 0.0000000
## Kenmore-Federal Way           173097.5714    92901.565   253293.5781 0.0000000
## Kent-Federal Way               10078.7496   -43076.333    63233.8319 1.0000000
## Kirkland-Federal Way          357151.5292   301650.802   412652.2568 0.0000000
## Maple Valley-Federal Way       77700.1861    14608.940   140791.4320 0.0018241
## Medina-Federal Way           1871908.7022  1703340.379  2040477.0251 0.0000000
## Mercer Island-Federal Way     905482.3405   825182.101   985782.5800 0.0000000
## North Bend-Federal Way        150841.0158    62623.443   239058.5890 0.0000000
## Redmond-Federal Way           369040.4780   313539.750   424541.2056 0.0000000
## Renton-Federal Way            114077.1630    63581.455   164572.8707 0.0000000
## Sammamish-Federal Way         443429.7097   385268.990   501590.4297 0.0000000
## Seattle-Federal Way           245694.3021   202536.262   288852.3423 0.0000000
## Snoqualmie-Federal Way        240238.4846   162466.512   318010.4576 0.0000000
## Vashon-Federal Way            199990.7193    85427.614   314553.8246 0.0000000
## Woodinville-Federal Way       328106.5727   260664.912   395548.2334 0.0000000
## Kenmore-Issaquah             -152633.3040  -233497.092   -71769.5162 0.0000000
## Kent-Issaquah                -315652.1258  -369809.447  -261494.8045 0.0000000
## Kirkland-Issaquah              31420.6538   -25040.691    87881.9985 0.9400971
## Maple Valley-Issaquah        -248030.6894  -311968.612  -184092.7663 0.0000000
## Medina-Issaquah              1546177.8267  1377290.787  1715064.8664 0.0000000
## Mercer Island-Issaquah        579751.4650   498784.304   660718.6260 0.0000000
## North Bend-Issaquah          -174889.8596  -263714.929   -86064.7905 0.0000000
## Redmond-Issaquah               43309.6026   -13151.742    99770.9474 0.4443948
## Renton-Issaquah              -211653.7124  -263203.389  -160104.0356 0.0000000
## Sammamish-Issaquah            117698.8342    58620.733   176776.9354 0.0000000
## Seattle-Issaquah              -80036.5733  -124423.161   -35649.9862 0.0000000
## Snoqualmie-Issaquah           -85492.3908  -163952.779    -7032.0029 0.0155976
## Vashon-Issaquah              -125740.1562  -240771.707   -10708.6049 0.0148637
## Woodinville-Issaquah            2375.6972   -65858.680    70610.0748 1.0000000
## Kent-Kenmore                 -163018.8218  -239368.052   -86669.5919 0.0000000
## Kirkland-Kenmore              184053.9578   106053.490   262054.4251 0.0000000
## Maple Valley-Kenmore          -95397.3854  -178969.196   -11825.5752 0.0074136
## Medina-Kenmore               1698811.1307  1521557.015  1876065.2467 0.0000000
## Mercer Island-Kenmore         732384.7690   635164.038   829605.5004 0.0000000
## North Bend-Kenmore            -22256.5556  -126112.621    81599.5100 1.0000000
## Redmond-Kenmore               195942.9066   117942.439   273943.3739 0.0000000
## Renton-Kenmore                -59020.4084  -133542.611    15501.7939 0.3758719
## Sammamish-Kenmore             270332.1382   190417.112   350247.1649 0.0000000
## Seattle-Kenmore                72596.7307     2837.440   142356.0209 0.0300539
## Snoqualmie-Kenmore             67140.9132   -28002.252   162284.0786 0.6234261
## Vashon-Kenmore                 26893.1478  -100104.676   153890.9715 1.0000000
## Woodinville-Kenmore           155009.0012    68106.068   241911.9340 0.0000000
## Kirkland-Kent                 347072.7796   297291.898   396853.6612 0.0000000
## Maple Valley-Kent              67621.4364     9498.199   125744.6738 0.0053375
## Medina-Kent                  1861829.9525  1695057.398  2028602.5069 0.0000000
## Mercer Island-Kent            895403.5908   818944.884   971862.2978 0.0000000
## North Bend-Kent               140762.2662    56026.526   225498.0062 0.0000003
## Redmond-Kent                  358961.7284   309180.847   408742.6100 0.0000000
## Renton-Kent                   103998.4134    59866.544   148130.2829 0.0000000
## Sammamish-Kent                433350.9600   380620.753   486081.1674 0.0000000
## Seattle-Kent                  235615.5525   200113.132   271117.9727 0.0000000
## Snoqualmie-Kent               230159.7350   156360.784   303958.6862 0.0000000
## Vashon-Kent                   189911.9696    78007.946   301815.9933 0.0000001
## Woodinville-Kent              318027.8230   255209.171   380846.4748 0.0000000
## Maple Valley-Kirkland        -279451.3432  -339727.198  -219175.4879 0.0000000
## Medina-Kirkland              1514757.1730  1347222.243  1682292.1033 0.0000000
## Mercer Island-Kirkland        548330.8113   470223.181   626438.4414 0.0000000
## North Bend-Kirkland          -206310.5134  -292537.039  -120083.9883 0.0000000
## Redmond-Kirkland               11888.9488   -40389.194    64167.0913 1.0000000
## Renton-Kirkland              -243074.3662  -290005.059  -196143.6736 0.0000000
## Sammamish-Kirkland             86278.1805    31184.236   141372.1254 0.0000032
## Seattle-Kirkland             -111457.2271  -150383.951   -72530.5031 0.0000000
## Snoqualmie-Kirkland          -116913.0446  -192419.026   -41407.0634 0.0000047
## Vashon-Kirkland              -157160.8099  -270197.877   -44123.7430 0.0001123
## Woodinville-Kirkland          -29044.9565   -93860.472    35770.5586 0.9954891
## Medina-Maple Valley          1794208.5161  1624008.259  1964408.7730 0.0000000
## Mercer Island-Maple Valley    827782.1544   744110.317   911453.9921 0.0000000
## North Bend-Maple Valley        73140.8298   -18156.422   164438.0816 0.3519570
## Redmond-Maple Valley          291340.2920   231064.437   351616.1472 0.0000000
## Renton-Maple Valley            36376.9770   -19324.589    92078.5430 0.7667756
## Sammamish-Maple Valley        365729.5236   302995.822   428463.2252 0.0000000
## Seattle-Maple Valley          167994.1160   118846.742   217141.4903 0.0000000
## Snoqualmie-Maple Valley       162538.2986    81289.743   243786.8539 0.0000000
## Vashon-Maple Valley           122290.5332     5339.462   239241.6044 0.0281733
## Woodinville-Maple Valley      250406.3866   178983.522   321829.2510 0.0000000
## Mercer Island-Medina         -966426.3617 -1143727.661  -789125.0628 0.0000000
## North Bend-Medina           -1721067.6864 -1902092.443 -1540042.9302 0.0000000
## Redmond-Medina              -1502868.2242 -1670403.155 -1335333.2938 0.0000000
## Renton-Medina               -1757831.5391 -1923775.622 -1591887.4563 0.0000000
## Sammamish-Medina            -1428478.9925 -1596913.821 -1260044.1639 0.0000000
## Seattle-Medina              -1626214.4001 -1790074.824 -1462353.9761 0.0000000
## Snoqualmie-Medina           -1631670.2175 -1807840.879 -1455499.5562 0.0000000
## Vashon-Medina               -1671917.9829 -1867141.770 -1476694.1954 0.0000000
## Woodinville-Medina          -1543802.1295 -1715662.537 -1371941.7218 0.0000000
## North Bend-Mercer Island     -754641.3247  -858577.898  -650704.7510 0.0000000
## Redmond-Mercer Island        -536441.8625  -614549.493  -458334.2324 0.0000000
## Renton-Mercer Island         -791405.1774  -866039.537  -716770.8180 0.0000000
## Sammamish-Mercer Island      -462052.6308  -542072.256  -382033.0053 0.0000000
## Seattle-Mercer Island        -659788.0384  -729667.131  -589908.9461 0.0000000
## Snoqualmie-Mercer Island     -665243.8558  -760474.895  -570012.8163 0.0000000
## Vashon-Mercer Island         -705491.6212  -832555.291  -578427.9513 0.0000000
## Woodinville-Mercer Island    -577375.7678  -664374.898  -490376.6373 0.0000000
## Redmond-North Bend            218199.4622   131972.937   304425.9873 0.0000000
## Renton-North Bend             -36763.8528  -119857.170    46329.4644 0.9962179
## Sammamish-North Bend          292588.6939   204626.473   380550.9143 0.0000000
## Seattle-North Bend             94853.2863    16003.433   173703.1392 0.0028893
## Snoqualmie-North Bend          89397.4688   -12598.426   191393.3641 0.1881561
## Vashon-North Bend              49149.7035   -83059.933   181359.3399 0.9997285
## Woodinville-North Bend        177265.5569    82909.526   271621.5872 0.0000000
## Renton-Redmond               -254963.3150  -301894.008  -208032.6224 0.0000000
## Sammamish-Redmond              74389.2317    19295.287   129483.1765 0.0002355
## Seattle-Redmond              -123346.1759  -162272.900   -84419.4519 0.0000000
## Snoqualmie-Redmond           -128801.9934  -204307.975   -53296.0122 0.0000000
## Vashon-Redmond               -169049.7588  -282086.826   -56012.6918 0.0000143
## Woodinville-Redmond           -40933.9054  -105749.421    23881.6098 0.8171775
## Sammamish-Renton              329352.5466   279304.285   379400.8082 0.0000000
## Seattle-Renton                131617.1390   100236.005   162998.2731 0.0000000
## Snoqualmie-Renton             126161.3216    54254.167   198068.4765 0.0000000
## Vashon-Renton                  85913.5562   -24751.995   196579.1075 0.4183959
## Woodinville-Renton            214029.4096   153444.455   274614.3643 0.0000000
## Seattle-Sammamish            -197735.4076  -240369.061  -155101.7538 0.0000000
## Snoqualmie-Sammamish         -203191.2250  -280673.428  -125709.0220 0.0000000
## Vashon-Sammamish             -243438.9904  -357805.581  -129072.3996 0.0000000
## Woodinville-Sammamish        -115323.1370  -182430.436   -48215.8379 0.0000000
## Snoqualmie-Seattle             -5455.8174   -72414.304    61502.6687 1.0000000
## Vashon-Seattle                -45703.5828  -153219.462    61812.2959 0.9978584
## Woodinville-Seattle            82412.2706    27792.341   137032.1999 0.0000109
## Vashon-Snoqualmie             -40247.7654  -165728.949    85233.4185 0.9999781
## Woodinville-Snoqualmie         87868.0880     3196.959   172539.2166 0.0312061
## Woodinville-Vashon            128115.8534     8761.652   247470.0545 0.0194816

4.Boxplot of the distribution of price by city:

Note that y-axis represents city and x-axis represent price.

options(scipen=999)
p <- ggboxplot(house, x = "city", y = "price", ylim=c(78000,7700000))
p + coord_flip()

Multiple Linear Regression (MLR)

Description

Idendified Questions

Can we predict the house price with the given attributes of each record of house sold?

Objective

To predict the future house price with at least 70% of accuracy.

Multiple linear regression will be conducted to get a prediction model on the house price based on the selected variables {bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation}. We will apply the machine learning algorithm on the multiple linear regression analysis.

Steps:

Construct a linear model
Influential points detection
- Remove outliers but keep the influential points for further analysis
Regression diagnostics - check on assumptions validity
Prediction of house price - check on the accuracy of the model

Model Fitting

library(tidyverse)
library(caret)
library(leaps)
library(MASS)

# Fit the full model
full.model <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation, 
             data=train_data)

summary(full.model)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     condition + sqft_living15 + sqft_lot15 + basement + age + 
##     renovation, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1368535  -138479   -19711    95065  5359381 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)   -618913.05670   16392.28963 -37.756 < 0.0000000000000002 ***
## bedrooms       -16207.19033    2800.96325  -5.786        0.00000000734 ***
## bathrooms      162750.13651    4511.18320  36.077 < 0.0000000000000002 ***
## floors          87845.32218    5069.17522  17.329 < 0.0000000000000002 ***
## waterfront     714137.93994   24169.48956  29.547 < 0.0000000000000002 ***
## condition       20575.69979    3507.33209   5.866        0.00000000454 ***
## sqft_living15     241.66378       3.81982  63.266 < 0.0000000000000002 ***
## sqft_lot15         -0.15926       0.08052  -1.978                0.048 *  
## basement        64369.36922    4784.86699  13.453 < 0.0000000000000002 ***
## age              3585.34355      98.54000  36.385 < 0.0000000000000002 ***
## renovation      64859.42790   10974.10387   5.910        0.00000000349 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 255000 on 15105 degrees of freedom
## Multiple R-squared:  0.5136, Adjusted R-squared:  0.5132 
## F-statistic:  1595 on 10 and 15105 DF,  p-value: < 0.00000000000000022

# stepwise regression
fit <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation, 
             data=train_data)
step <- stepAIC(fit, direction="both")

## Start:  AIC=376374.2
## price ~ bedrooms + bathrooms + floors + waterfront + condition + 
##     sqft_living15 + sqft_lot15 + basement + age + renovation
## 
##                 Df       Sum of Sq              RSS    AIC
## <none>                              982495104508443 376374
## - sqft_lot15     1    254430333018  982749534841461 376376
## - bedrooms       1   2177761472192  984672865980634 376406
## - condition      1   2238541329829  984733645838272 376407
## - renovation     1   2272049345704  984767153854146 376407
## - basement       1  11771408551175  994266513059617 376552
## - floors         1  19533152418298 1002028256926740 376670
## - waterfront     1  56785691806884 1039280796315327 377222
## - bathrooms      1  84658673487397 1067153777995840 377622
## - age            1  86108511845013 1068603616353456 377642
## - sqft_living15  1 260343083778339 1242838188286782 379925

step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## price ~ bedrooms + bathrooms + floors + waterfront + condition + 
##     sqft_living15 + sqft_lot15 + basement + age + renovation
## 
## Final Model:
## price ~ bedrooms + bathrooms + floors + waterfront + condition + 
##     sqft_living15 + sqft_lot15 + basement + age + renovation
## 
## 
##   Step Df Deviance Resid. Df      Resid. Dev      AIC
## 1                      15105 982495104508443 376374.2

The full model of multiple linear regression model is significant with 0.513567007867493 as the R-squared value. The relationships between these variables with price are considered strong with all significant p-values. Stepwise regression also shows that all variables should be retained in the model. We may try to remove sqft_lot15 from the model as the p-value at 0.048 is relatively higher compared to other variables.

lm1 <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation, 
             data=train_data)

summary(lm1)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     condition + sqft_living15 + basement + age + renovation, 
##     data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1371691  -138588   -19819    95116  5361376 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   -620294.448   16378.981 -37.871 < 0.0000000000000002 ***
## bedrooms       -15984.071    2798.960  -5.711        0.00000001146 ***
## bathrooms      162635.627    4511.246  36.051 < 0.0000000000000002 ***
## floors          88748.241    5049.062  17.577 < 0.0000000000000002 ***
## waterfront     713133.505   24166.482  29.509 < 0.0000000000000002 ***
## condition       20466.362    3507.234   5.835        0.00000000547 ***
## sqft_living15     240.318       3.759  63.930 < 0.0000000000000002 ***
## basement        65000.082    4774.688  13.613 < 0.0000000000000002 ***
## age              3592.837      98.477  36.484 < 0.0000000000000002 ***
## renovation      64493.689   10973.603   5.877        0.00000000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 255100 on 15106 degrees of freedom
## Multiple R-squared:  0.5134, Adjusted R-squared:  0.5132 
## F-statistic:  1771 on 9 and 15106 DF,  p-value: < 0.00000000000000022

Since the adjusted R-squared still the same, we may continue with the model without sqft_lot15. Now, we try to train the model using caret package.

# train the model
train.model <- train(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation, 
             data=train_data, method="lm")

summary(train.model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1371691  -138588   -19819    95116  5361376 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   -620294.448   16378.981 -37.871 < 0.0000000000000002 ***
## bedrooms       -15984.071    2798.960  -5.711        0.00000001146 ***
## bathrooms      162635.627    4511.246  36.051 < 0.0000000000000002 ***
## floors          88748.241    5049.062  17.577 < 0.0000000000000002 ***
## waterfront     713133.505   24166.482  29.509 < 0.0000000000000002 ***
## condition       20466.362    3507.234   5.835        0.00000000547 ***
## sqft_living15     240.318       3.759  63.930 < 0.0000000000000002 ***
## basement        65000.082    4774.688  13.613 < 0.0000000000000002 ***
## age              3592.837      98.477  36.484 < 0.0000000000000002 ***
## renovation      64493.689   10973.603   5.877        0.00000000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 255100 on 15106 degrees of freedom
## Multiple R-squared:  0.5134, Adjusted R-squared:  0.5132 
## F-statistic:  1771 on 9 and 15106 DF,  p-value: < 0.00000000000000022

# display the R-squared value
train.model_rsq <- summary(train.model$finalModel)$r.squared

# show the bootstrap results
train.model

## Linear Regression 
## 
## 15116 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 15116, 15116, 15116, 15116, 15116, 15116, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   255794.3  0.516118  162309.8
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# store the results of bootstrap in a dataframe
train.model_results <- as.data.frame(train.model$results)

The unrealistic R-squared value is 0.513441039496138. Besides, the caret package by default does the bootstrap resampling with 25 repetitions, the sample data were drawn with replacement. The realistic R-squared value after bootstrapping is 0.516118024219928, which is slightly lower than the train model shown just now.

Influential Points Detection

Cook’s distance can be used to detect the influential points in the model.

cooksd <- cooks.distance(lm1)
cat("The mean of Cook's distance is ", mean(cooksd))

## The mean of Cook's distance is  0.0002049408

# visualize the Cook's distance in a plot
# par(mfrow=c(1, 1))
# plot(cooksd, main="Influential Obs by Cooks distance",xlim=c(0,25000),ylim=c(0,0.1))
# axis(1, at=seq(0, 25000, 5000))
# axis(2, at=seq(0, 0.1, 0.0001))
# abline(h = 4*mean(cooksd, na.rm=T), col="green")  
# text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red")

par(mfrow=c(1, 1))
plot(cooksd, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="green")  
text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red")

Identify the position of influential points in the data.

# any point that have Cook's distance greater than 4 times of mean will be considered as influential
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  

# influential row numbers
head(train_data[influential, ])

##        price bedrooms bathrooms floors waterfront condition sqft_living15
## 533   200000        3      2.25    1.5          0         5          1260
## 802   225900        3      1.00    1.0          0         4          1290
## 1009  625000        3      2.00    1.0          0         4          2470
## 1189  790000        5      3.25    1.0          0         4          2590
## 1213 1000000        5      2.50    2.0          0         4          3600
## 1222  596000        3      2.50    2.0          0         3          1730
##      sqft_lot15 basement age renovation   zip        city
## 533        7556        0 101          0 98002      Auburn
## 802        8470        1  52          0 98003 Federal Way
## 1009      13594        0  28          0 98004    Bellevue
## 1189      12160        1  47          0 98005    Bellevue
## 1213      48787        0  46          0 98005    Bellevue
## 1222       2751        0  14          0 98005    Bellevue

influential_data <- train_data[influential, ]

# inner join outliers_data and influential_data
influential_outliers <- inner_join(outliers_data,influential_data)

# remove outliers_data from train_data but keep the influential_data
train_data2 <- rbind(train_data1,influential_outliers)

# relabel the index of train_data2
row.names(train_data2) <- NULL

There are 13 outliers yet influential observations in the train data. The outliers_data is removed but influential_outliers should be kept. Remodel the data by using train_data2.

lm2 <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation, 
             data=train_data2)

summary(lm2)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront + 
##     condition + sqft_living15 + basement + age + renovation, 
##     data = train_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627995 -105915  -11026   90600 1287030 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   -318110.996   10613.765 -29.972 < 0.0000000000000002 ***
## bedrooms        -4680.359    1754.009  -2.668              0.00763 ** 
## bathrooms       72914.349    2961.479  24.621 < 0.0000000000000002 ***
## floors          93313.267    3194.663  29.209 < 0.0000000000000002 ***
## waterfront     145556.462   22601.758   6.440       0.000000000123 ***
## condition       19991.399    2182.831   9.158 < 0.0000000000000002 ***
## sqft_living15     171.916       2.522  68.173 < 0.0000000000000002 ***
## basement        61381.203    2986.837  20.551 < 0.0000000000000002 ***
## age              2311.441      63.029  36.673 < 0.0000000000000002 ***
## renovation      29232.066    7206.310   4.056       0.000050087351 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154100 on 14323 degrees of freedom
## Multiple R-squared:  0.4693, Adjusted R-squared:  0.469 
## F-statistic:  1407 on 9 and 14323 DF,  p-value: < 0.00000000000000022

The Residual Standard Error decreased as compared to the model with outliers_data. The model with revised data is significant with all the variables strongly related to price.

Regression Diagnostics

The assumptions of the model need to be checked.

# Residual diagnostics with the 4 plot
par(mfrow = c(2, 2))
plot(lm2)

Residual vs Fitted Plot shows scatted values around indicated the relationship between price and predictors is linear, Normal Q-Q Plot shows a straight line indicated the residuals are normally distributed, whereas the Scale-Location plot shows an approximately uniform dispersion of the fitted values, hence the homoscedasticity of the residuals (equal variance) has been checked. Residuals vs Leverage Plot shows there are no leverage out of the boundaries.

Multicollinearity Test

Multicollinearity can be checked with multiple ways, one of the methods is by checking the Variance Inflation Factor (VIF). If the VIF value is greater than 4, there is investigation needed. If VIF above 10, it indicated there are serious multicollinearity between the predictors.

library(car)
vif(lm2)

##      bedrooms     bathrooms        floors    waterfront     condition 
##      1.445350      2.721895      1.762581      1.007715      1.203717 
## sqft_living15      basement           age    renovation 
##      1.480062      1.272886      2.029259      1.136379

There is no multicollinearity in the model.

Accuracy of the Model on Prediction

Prediction on Train Data

# fitted values of train data
pred=lm2$fitted.values

# store the actual price and fitted price in a data frame
actual_fitted=data.frame(actual=train_data2$price, predicted=pred)

# find the mean of absolute difference % between fitted values and actual values
abs_diff = mean(abs(actual_fitted$actual-actual_fitted$predicted)/actual_fitted$actual)

# accuracy of the model
accuracy=1-abs_diff
accuracy

## [1] 0.7045188

The accuracy of the model on train data is 70.45%.

Prediction on Test Data

# use the model to predict price for test data
pred_test=predict(newdata=test_data, lm2)

# store the actual price and predicted price in a data frame
actual_fitted_test=data.frame(actual=test_data$price, predicted=pred_test)

# find the mean of absolute difference % between predicted values and actual values
abs_diff_test = mean(abs(actual_fitted_test$actual-actual_fitted_test$predicted)/actual_fitted_test$actual)

# accuracy of the prediction
accuracy=1-abs_diff_test
accuracy

## [1] 0.6929814

The accuracy of the prediction is at 69.3%.

Conclusion

      The analysis was conducted by using the house price dataset from Kaggle, contains data for houses located in the state of Washington. The results from One-way ANOVA test show that there is significant difference on the average house price among houses with different conditions. Besides, a renovated house tends to have a different mean price as compared to a non-renovated house. There are multiple cities in the state of Washington, the dataset contained house price records from 24 cities. The One-way ANOVA test shows that there is significant difference in the average house price at different cities. Hence, these 3 variables would be significant factors in deciding the selling price of a house.

      The multiple linear regression model shows that bedrooms, bathrooms, floors, waterfront, condition, sqft_living15, basement, age, renovation are significant factors that affecting the house price. The prediction model able to predict the house price in Washington state by taking into account the 9 variables above at approximately 70% of accuracy.

      Further analysis can be done:
            1. perform non-linear regression
            2. perform robust regression

WQD 7004 GROUP PROJECT

Group 3 (Teoh Xu Hao, Chow Hui Ting, Pang Woon Yen)

06/06/2021

Multiple Linear Regression for the Prediction of House Prices

Introduction

Overview

Dataset Information

Methodology

Exploratory Data Analysis (EDA)

Analysis of Variance (ANOVA)

Multiple Linear Regression Analysis (MLR)

Data Preparation

Data Cleaning

Data Merging

Data Analysis

Exploratory Data Analysis (EDA)

Analysis of Variance (ANOVA)

Description

Packages & Dataset

Price VS Condition

Price VS Renovation

Price vs City

Multiple Linear Regression (MLR)

Description

Model Fitting

Influential Points Detection

Regression Diagnostics

Multicollinearity Test

Accuracy of the Model on Prediction

Conclusion