Introduction
House price floating every now and then. It became difficult to estimate the house price, especially during the pandemic.Regression models are powerful models to describe relationships between variables. We can observe the changes of dependent variables when independent variables change. Multiple Linear Regression (MLR) estimates the relationship between two or more independent variables and one dependent variable. Analysis of Variance (ANOVA) is a subset of MLR. ANOVA estimates how a quantitative dependent variable varies base on the levels of categorical independent variables.
This project focuses on exploring the house sales dataset, using regression models to examine the relationship between variables of the dataset and estimate the future house price.
Problem Statement
House price varies based on the condition of itself and the environment. From the number of bedrooms to the location of the house, any variable might be the key that affects the house price the most. This project will use ANOVA and MLR to determine the relation of house situations with sold price and predict the house price.
Idendified Questions
Objectives
The table below contains basic information of the dataset:
| Title | KC_Housesales_Data |
| Year | 2018 |
| Source | https://www.kaggle.com/swathiachath/kc-housesales-data?select=kc_house_data.csv |
| Purpose | Predict House sale prices using Multi-Linear Regression |
| Dimension | 21 columns x 21613 rows |
| Description | The dataset was obtained from Kaggle. The dataset contains house sold prices and house details in King County, an area in the Washington state of the US, from May 2014 to May 2015. |
The table below describes the 21 variables of the raw dataset:
| No. | Variable | Description | Datatype |
|---|---|---|---|
| 1 | id | - | Numeric |
| 2 | date | Date of house was sold | String |
| 3 | price | Price sold | Numeric |
| 4 | bedrooms | Number of bedroom | Numeric |
| 5 | bathrooms | Number of bathroom per bedroom | Numeric |
| 6 | sqft_living | House square footage | Numeric |
| 7 | sqft_lot | Lot square footage | Numeric |
| 8 | floors | Floor level | Numeric |
| 9 | waterfront | House has a view to a waterfront | Numeric |
| 10 | view | House has been viewed | Numeric |
| 11 | condition | How good the house condition is: 1=poor; 5=excellent | Numeric |
| 12 | grade | Grade given by King County: 1=poor; 5=excellent | Numeric |
| 13 | sqft_above | Square footage of house apart from basement | Numeric |
| 14 | sqrt_basement | Basement square footage | Numeric |
| 15 | yr_built | House built year | Numeric |
| 16 | yr_renovated | House renovated year | Numeric |
| 17 | zipcode | House zip code | Numeric |
| 18 | lat | House latitude coordinate | Numeric |
| 19 | long | House longitude coordinate | Numeric |
| 20 | sqft_living15 | Living room area in 2015 (implies renovated) | Numeric |
| 21 | sqft_loft15 | Lot area in 2015 (implies renovated) | Numeric |
rating <- c(5,2,3,2,4,3,10)
boxplot(rating, col="yellow")
title("Boxplot of ratings")
#There is an outlier in the rating data series.
Scale-Location Plot
# import 2 datasets to be stored in 2 different dataframes
library(readr)
df_house <- read_csv("kc_house_data.csv")
library(readxl)
df_zipcode <- read_excel("uszips.xlsx")
# KC_Housesales_Data
head(df_house)
## # A tibble: 6 x 21
## id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.13e9 10/1~ 2.22e5 3 1 1180 5650 1 0
## 2 6.41e9 12/9~ 5.38e5 3 2.25 2570 7242 2 0
## 3 5.63e9 2/25~ 1.8 e5 2 1 770 10000 1 0
## 4 2.49e9 12/9~ 6.04e5 4 3 1960 5000 1 0
## 5 1.95e9 2/18~ 5.1 e5 3 2 1680 8080 1 0
## 6 7.24e9 5/12~ 1.23e6 4 4.5 5420 101930 1 0
## # ... with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
colnames(df_house)
## [1] "id" "date" "price" "bedrooms"
## [5] "bathrooms" "sqft_living" "sqft_lot" "floors"
## [9] "waterfront" "view" "condition" "grade"
## [13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
## [17] "zipcode" "lat" "long" "sqft_living15"
## [21] "sqft_lot15"
# US Zip Codes Database
head(df_zipcode)
## # A tibble: 6 x 18
## zip lat lng city state_id state_name zcta parent_zcta population
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl> <dbl>
## 1 601 18.2 -66.8 Adjuntas PR Puerto Rico TRUE NA 17113
## 2 602 18.4 -67.2 Aguada PR Puerto Rico TRUE NA 37751
## 3 603 18.5 -67.1 Aguadilla PR Puerto Rico TRUE NA 47081
## 4 606 18.2 -66.9 Maricao PR Puerto Rico TRUE NA 6392
## 5 610 18.3 -67.1 Anasco PR Puerto Rico TRUE NA 26686
## 6 612 18.4 -66.7 Arecibo PR Puerto Rico TRUE NA 59369
## # ... with 9 more variables: density <dbl>, county_fips <dbl>,
## # county_name <chr>, county_weights <chr>, county_names_all <chr>,
## # county_fips_all <chr>, imprecise <chr>, military <chr>, timezone <chr>
colnames(df_zipcode)
## [1] "zip" "lat" "lng" "city"
## [5] "state_id" "state_name" "zcta" "parent_zcta"
## [9] "population" "density" "county_fips" "county_name"
## [13] "county_weights" "county_names_all" "county_fips_all" "imprecise"
## [17] "military" "timezone"
## Removing the unnecessary information
df_house$id <- NULL
df_house$sqft_living <- NULL
df_house$sqft_lot <- NULL
df_house$view <- NULL
df_house$grade <- NULL
df_house$view <- NULL
df_house$sqft_above <- NULL
df_house$zip <- df_house$zipcode
df_house$zipcode <- NULL
df_house$lat <- NULL
df_house$long <- NULL
## Changing sqft_basement to basement (1 for yes and 0 for no)
df_house$basement <- df_house$sqft_basement
df_house$sqft_basement <- NULL
df_house$basement = ifelse(df_house$basement>0,"1","0")
## Changing yr_built to age
library("stringr")
df_house$date <- str_sub(df_house$date, - 4, - 1) #Extract the year from date of building sold
df_house$date <- as.numeric(df_house$date) #changing it to numeric for subtraction operation
df_house$age <- df_house$date - df_house$yr_built #age of building when sold is calculated
df_house$date <- NULL
df_house$yr_built <- NULL
## Changing yr_renovated to renovation (1 for yes and 0 for no)
df_house$renovation = ifelse(df_house$"yr_renovated">0,"1","0")
df_house$yr_renovated <- NULL
## Checking for missing values
sum(is.na(df_house))
## [1] 0
## Saving new dataframe in csv
write.csv(df_house,'house_data.csv',row.names=FALSE)
# Detecting Outliers
boxplot(df_house$price, ylab = "price")
boxplot(df_house$bedrooms, ylab = "bedroom")
# reexamine the structure of the dataframe now
str(df_house)
## spec_tbl_df [21,597 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ price : num [1:21597] 221900 538000 180000 604000 510000 ...
## $ bedrooms : num [1:21597] 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num [1:21597] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ floors : num [1:21597] 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : num [1:21597] 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : num [1:21597] 3 3 3 5 3 3 3 3 3 3 ...
## $ sqft_living15: num [1:21597] 1340 1690 2720 1360 1800 ...
## $ sqft_lot15 : num [1:21597] 5650 7639 8062 5000 7503 ...
## $ zip : num [1:21597] 98178 98125 98028 98136 98074 ...
## $ basement : chr [1:21597] "0" "1" "0" "1" ...
## $ age : num [1:21597] 59 63 82 49 28 13 19 52 55 12 ...
## $ renovation : chr [1:21597] "0" "1" "0" "0" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. date = col_character(),
## .. price = col_double(),
## .. bedrooms = col_double(),
## .. bathrooms = col_double(),
## .. sqft_living = col_double(),
## .. sqft_lot = col_double(),
## .. floors = col_double(),
## .. waterfront = col_double(),
## .. view = col_double(),
## .. condition = col_double(),
## .. grade = col_double(),
## .. sqft_above = col_double(),
## .. sqft_basement = col_double(),
## .. yr_built = col_double(),
## .. yr_renovated = col_double(),
## .. zipcode = col_double(),
## .. lat = col_double(),
## .. long = col_double(),
## .. sqft_living15 = col_double(),
## .. sqft_lot15 = col_double()
## .. )
## Removing the unnecessary information
df_zipcode <- df_zipcode[ -c(2:3,5,7:18) ]
## Removing duplicate rows using the zipcode variable
library(dplyr)
df_zipcode %>% distinct(zip, .keep_all= TRUE)
## # A tibble: 33,121 x 3
## zip city state_name
## <dbl> <chr> <chr>
## 1 601 Adjuntas Puerto Rico
## 2 602 Aguada Puerto Rico
## 3 603 Aguadilla Puerto Rico
## 4 606 Maricao Puerto Rico
## 5 610 Anasco Puerto Rico
## 6 612 Arecibo Puerto Rico
## 7 616 Bajadero Puerto Rico
## 8 617 Barceloneta Puerto Rico
## 9 622 Boqueron Puerto Rico
## 10 623 Cabo Rojo Puerto Rico
## # ... with 33,111 more rows
write.csv(df_zipcode,'zipcode.csv',row.names=FALSE)
# Merging 2 datasets
df_house <- read_csv("house_data.csv")
df_zip <- read_csv("zipcode.csv")
df <- merge(df_house,df_zipcode,by="zip")
# Examine the frequency table of city and state_name
table(df$city)
##
## Auburn Bellevue Black Diamond Bothell Carnation
## 911 1407 100 195 124
## Duvall Enumclaw Fall City Federal Way Issaquah
## 190 233 80 779 733
## Kenmore Kent Kirkland Maple Valley Medina
## 283 1201 977 589 50
## Mercer Island North Bend Redmond Renton Sammamish
## 282 220 977 1597 800
## Seattle Snoqualmie Vashon Woodinville
## 8973 308 117 471
table(df$state_name)
##
## Washington
## 21597
# Since all state name is Washington, we can omit the state_name column
df$state_name <- NULL
# print the first few rows of the merged data
head(df, 10)
## zip price bedrooms bathrooms floors waterfront condition sqft_living15
## 1 98001 225000 3 1.00 1.0 0 3 1660
## 2 98001 210000 3 1.00 1.0 0 3 1820
## 3 98001 230000 3 1.00 1.0 0 4 1140
## 4 98001 196000 3 2.25 2.0 0 3 1890
## 5 98001 321500 4 2.50 2.0 0 3 2740
## 6 98001 360000 4 2.00 1.0 0 5 1230
## 7 98001 207000 3 1.00 1.0 0 4 1356
## 8 98001 420000 4 2.50 2.0 0 3 2990
## 9 98001 305000 2 1.00 1.5 0 4 1650
## 10 98001 429000 4 2.50 2.0 0 3 2730
## sqft_lot15 basement age renovation city
## 1 7245 1 51 0 Auburn
## 2 7210 1 51 0 Auburn
## 3 9639 0 48 0 Auburn
## 4 7519 0 40 0 Auburn
## 5 5816 0 11 0 Auburn
## 6 15750 0 49 0 Auburn
## 7 9450 1 71 0 Auburn
## 8 9033 0 13 0 Auburn
## 9 19200 0 79 0 Auburn
## 10 8688 0 13 0 Auburn
str(df)
## 'data.frame': 21597 obs. of 13 variables:
## $ zip : num 98001 98001 98001 98001 98001 ...
## $ price : num 225000 210000 230000 196000 321500 ...
## $ bedrooms : num 3 3 3 3 4 4 3 4 2 4 ...
## $ bathrooms : num 1 1 1 2.25 2.5 2 1 2.5 1 2.5 ...
## $ floors : num 1 1 1 2 2 1 1 2 1.5 2 ...
## $ waterfront : num 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : num 3 3 4 3 3 5 4 3 4 3 ...
## $ sqft_living15: num 1660 1820 1140 1890 2740 ...
## $ sqft_lot15 : num 7245 7210 9639 7519 5816 ...
## $ basement : num 1 1 0 0 0 0 1 0 0 0 ...
## $ age : num 51 51 48 40 11 49 71 13 79 13 ...
## $ renovation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ city : chr "Auburn" "Auburn" "Auburn" "Auburn" ...
write.csv(df,'merged.csv',row.names=FALSE)
## Rearranging the columns so that the price will be the first variable
df <- df[,c(2:12,1,13)]
## Viewing a Part of the Data
head(df,10)
## price bedrooms bathrooms floors waterfront condition sqft_living15
## 1 225000 3 1.00 1.0 0 3 1660
## 2 210000 3 1.00 1.0 0 3 1820
## 3 230000 3 1.00 1.0 0 4 1140
## 4 196000 3 2.25 2.0 0 3 1890
## 5 321500 4 2.50 2.0 0 3 2740
## 6 360000 4 2.00 1.0 0 5 1230
## 7 207000 3 1.00 1.0 0 4 1356
## 8 420000 4 2.50 2.0 0 3 2990
## 9 305000 2 1.00 1.5 0 4 1650
## 10 429000 4 2.50 2.0 0 3 2730
## sqft_lot15 basement age renovation zip city
## 1 7245 1 51 0 98001 Auburn
## 2 7210 1 51 0 98001 Auburn
## 3 9639 0 48 0 98001 Auburn
## 4 7519 0 40 0 98001 Auburn
## 5 5816 0 11 0 98001 Auburn
## 6 15750 0 49 0 98001 Auburn
## 7 9450 1 71 0 98001 Auburn
## 8 9033 0 13 0 98001 Auburn
## 9 19200 0 79 0 98001 Auburn
## 10 8688 0 13 0 98001 Auburn
## Viewing the structure and summary statistics of the data
str(df)
## 'data.frame': 21597 obs. of 13 variables:
## $ price : num 225000 210000 230000 196000 321500 ...
## $ bedrooms : num 3 3 3 3 4 4 3 4 2 4 ...
## $ bathrooms : num 1 1 1 2.25 2.5 2 1 2.5 1 2.5 ...
## $ floors : num 1 1 1 2 2 1 1 2 1.5 2 ...
## $ waterfront : num 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : num 3 3 4 3 3 5 4 3 4 3 ...
## $ sqft_living15: num 1660 1820 1140 1890 2740 ...
## $ sqft_lot15 : num 7245 7210 9639 7519 5816 ...
## $ basement : num 1 1 0 0 0 0 1 0 0 0 ...
## $ age : num 51 51 48 40 11 49 71 13 79 13 ...
## $ renovation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ zip : num 98001 98001 98001 98001 98001 ...
## $ city : chr "Auburn" "Auburn" "Auburn" "Auburn" ...
summary(df)
## price bedrooms bathrooms floors
## Min. : 78000 Min. : 1.000 Min. :0.500 Min. :1.000
## 1st Qu.: 322000 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.:1.000
## Median : 450000 Median : 3.000 Median :2.250 Median :1.500
## Mean : 540297 Mean : 3.373 Mean :2.116 Mean :1.494
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.:2.000
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :3.500
## waterfront condition sqft_living15 sqft_lot15
## Min. :0.000000 Min. :1.00 Min. : 399 Min. : 651
## 1st Qu.:0.000000 1st Qu.:3.00 1st Qu.:1490 1st Qu.: 5100
## Median :0.000000 Median :3.00 Median :1840 Median : 7620
## Mean :0.007547 Mean :3.41 Mean :1987 Mean : 12758
## 3rd Qu.:0.000000 3rd Qu.:4.00 3rd Qu.:2360 3rd Qu.: 10083
## Max. :1.000000 Max. :5.00 Max. :6210 Max. :871200
## basement age renovation zip
## Min. :0.000 Min. : -1.00 Min. :0.00000 Min. :98001
## 1st Qu.:0.000 1st Qu.: 18.00 1st Qu.:0.00000 1st Qu.:98033
## Median :0.000 Median : 40.00 Median :0.00000 Median :98065
## Mean :0.393 Mean : 43.32 Mean :0.04232 Mean :98078
## 3rd Qu.:1.000 3rd Qu.: 63.00 3rd Qu.:0.00000 3rd Qu.:98118
## Max. :1.000 Max. :115.00 Max. :1.00000 Max. :98199
## city
## Length:21597
## Class :character
## Mode :character
##
##
##
## Checking for the missing values in the data.
NA_values=data.frame(no_of_na_values=colSums(is.na(df)))
head(NA_values,13)
## no_of_na_values
## price 0
## bedrooms 0
## bathrooms 0
## floors 0
## waterfront 0
## condition 0
## sqft_living15 0
## sqft_lot15 0
## basement 0
## age 0
## renovation 0
## zip 0
## city 0
# Counting the number of groups with the same zip code and city
library(dplyr)
freq_zipcode <- as.data.frame(table(df$zip))
freq_zipcode <- freq_zipcode %>% rename(zipcode = Var1)
print(freq_zipcode)
## zipcode Freq
## 1 98001 361
## 2 98002 199
## 3 98003 280
## 4 98004 317
## 5 98005 168
## 6 98006 498
## 7 98007 141
## 8 98008 283
## 9 98010 100
## 10 98011 195
## 11 98014 124
## 12 98019 190
## 13 98022 233
## 14 98023 499
## 15 98024 80
## 16 98027 412
## 17 98028 283
## 18 98029 321
## 19 98030 256
## 20 98031 273
## 21 98032 125
## 22 98033 432
## 23 98034 545
## 24 98038 589
## 25 98039 50
## 26 98040 282
## 27 98042 547
## 28 98045 220
## 29 98052 574
## 30 98053 403
## 31 98055 268
## 32 98056 406
## 33 98058 455
## 34 98059 468
## 35 98065 308
## 36 98070 117
## 37 98072 273
## 38 98074 441
## 39 98075 359
## 40 98077 198
## 41 98092 351
## 42 98102 104
## 43 98103 602
## 44 98105 229
## 45 98106 335
## 46 98107 266
## 47 98108 186
## 48 98109 109
## 49 98112 269
## 50 98115 583
## 51 98116 330
## 52 98117 553
## 53 98118 507
## 54 98119 184
## 55 98122 290
## 56 98125 409
## 57 98126 354
## 58 98133 493
## 59 98136 263
## 60 98144 343
## 61 98146 288
## 62 98148 57
## 63 98155 446
## 64 98166 254
## 65 98168 269
## 66 98177 255
## 67 98178 262
## 68 98188 136
## 69 98198 280
## 70 98199 317
freq_city <- as.data.frame(table(df$city))
freq_city <- freq_city %>% rename(city_name = Var1)
print(freq_city)
## city_name Freq
## 1 Auburn 911
## 2 Bellevue 1407
## 3 Black Diamond 100
## 4 Bothell 195
## 5 Carnation 124
## 6 Duvall 190
## 7 Enumclaw 233
## 8 Fall City 80
## 9 Federal Way 779
## 10 Issaquah 733
## 11 Kenmore 283
## 12 Kent 1201
## 13 Kirkland 977
## 14 Maple Valley 589
## 15 Medina 50
## 16 Mercer Island 282
## 17 North Bend 220
## 18 Redmond 977
## 19 Renton 1597
## 20 Sammamish 800
## 21 Seattle 8973
## 22 Snoqualmie 308
## 23 Vashon 117
## 24 Woodinville 471
library(caTools)
# set seed to ensure you always have same random numbers generated
set.seed(123)
# splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE
sample = sample.split(df$zip,SplitRatio = 0.7)
# creates a training dataset named train_data with rows which are marked as TRUE
train_data =subset(df,sample ==TRUE)
test_data=subset(df, sample==FALSE)
# check the dimensions of train_data and test_data
dim(train_data)
## [1] 15116 13
dim(test_data)
## [1] 6481 13
library(corrplot)
cor_data=data.frame(train_data[,1:12])
cor_data = cor_data[, -c(9,11)]
correlation=cor(cor_data)
par(mfrow=c(1, 1))
corrplot(correlation,method="color")
library(ggplot2)
library(grid)
library(lattice)
library(gridExtra)
p1=ggplot(data = train_data, aes(x = bedrooms, y = price)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Bedrooms and Price", x="Bedrooms",y="Price")
p2=ggplot(data = train_data, aes(x = bathrooms, y = price)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Bathrooms and Price", x="Bathrooms",y="Price")
p3=ggplot(data = train_data, aes(x = floors, y = price)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Floors and Price", x="Floors",y="Price")
p4=ggplot(data = train_data, aes(x = sqft_living15, y = price)) +
geom_jitter() + geom_smooth(method = "lm", se = FALSE)+labs(title="Scatter plot of Sqft_living15 and Price", x="Sqft_living15",y="Price")
grid.arrange(p1,p2,p3,p4,nrow=2)
# create an empty list to store p-value of correlation test
p_value_df <- data.frame("Explanatory_Variable" = character(0), "p_value" = double(0), stringsAsFactors = FALSE)
for (i in 2:ncol(cor_data)) {
p_value <- cor.test(cor_data[,1], cor_data[,i])$p.value
new_row <- c(names(cor_data[i]), p_value)
p_value_df[nrow(p_value_df)+1,] <- new_row
}
print(p_value_df)
## Explanatory_Variable p_value
## 1 bedrooms 0
## 2 bathrooms 0
## 3 floors 2.7422081672573e-228
## 4 waterfront 3.50284064304626e-201
## 5 condition 0.00100763267925831
## 6 sqft_living15 0
## 7 sqft_lot15 8.00758800521312e-24
## 8 age 1.14243642311684e-11
## 9 zip 1.55392256300633e-11
par(mfrow=c(2, 2))
boxplot(price~waterfront,data=train_data,main="Price vs Waterfront", xlab="waterfront",ylab="price",col="orange",border="brown")
boxplot(price~basement,data=train_data,main="Price vs Basement", xlab="basement",ylab="price",col="orange",border="brown")
boxplot(price~renovation,data=train_data,main="Price vs Renovation", xlab="renovation",ylab="price",col="orange",border="brown")
boxplot(price~city,data=train_data,main="Price vs City", xlab="city",ylab="price",col="orange",border="brown")
ggplot(data=train_data)+geom_boxplot(aes(x=bedrooms,y=price))
outliers=boxplot(train_data$price,plot=FALSE)$out
outliers_data=train_data[which(train_data$price %in% outliers),]
train_data1= train_data[-which(train_data$price %in% outliers),]
par(mfrow=c(1, 2))
# Plot of original data without outliers.
plot(train_data$bedrooms, train_data$price, main="With Outliers", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bedrooms, data=train_data), col="blue", lwd=3, lty=2)
# Plot of data after removed outliers. Note the change of slope.
plot(train_data1$bedrooms, train_data1$price, main="Outliers removed", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~bedrooms, data=train_data1), col="blue", lwd=3, lty=2)
Idendified Questions
Objective
In this section will focus on apply One-way ANOVA to find out if the categorical variables caused a difference in the mean of house sold price. The null hypothesis (H0) of the ANOVA is no difference in means, and the alternate hypothesis (H1) is that the means are different from one another. ANOVA assumes that each observation is randomly sampled, independent, normally distributed, and with unknown but equal variances.
ANOVA only reveals that are different between group means. Hence this section will also apply Tukey’s Honestly Significant Difference (Tukey’s HSD) test to tell which groups are statistically different from another.
Table below shows the variable that will be analysis. Not all of the variables from dataset were used in this analysis.
| Variable | Description | Level |
|---|---|---|
| condition | How good the house condition is | 1 to 5 where 1=worst; 5=best |
| renovation | Indicate if the house has been renovated | 1=Yes; 0=NO |
| city | Indicate which city the house located | List of city name |
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(tidyverse)
library(broom)
library(AICcmodavg)
library(dplyr)
# Read the variables that will be analyse later as factor
house <- read.csv('merged.csv', header = TRUE, stringsAsFactors=T,
colClasses=c(rep('numeric', 6), 'factor', rep('numeric', 4),rep('factor', 2)))
glimpse(house)
## Rows: 21,597
## Columns: 13
## $ zip <dbl> 98001, 98001, 98001, 98001, 98001, 98001, 98001, 98001, ~
## $ price <dbl> 225000, 210000, 230000, 196000, 321500, 360000, 207000, ~
## $ bedrooms <dbl> 3, 3, 3, 3, 4, 4, 3, 4, 2, 4, 4, 4, 5, 3, 4, 3, 3, 3, 3,~
## $ bathrooms <dbl> 1.00, 1.00, 1.00, 2.25, 2.50, 2.00, 1.00, 2.50, 1.00, 2.~
## $ floors <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.5, 2.0, 1.0, 2~
## $ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ condition <fct> 3, 3, 4, 3, 3, 5, 4, 3, 4, 3, 4, 3, 4, 4, 3, 3, 3, 4, 4,~
## $ sqft_living15 <dbl> 1660, 1820, 1140, 1890, 2740, 1230, 1356, 2990, 1650, 27~
## $ sqft_lot15 <dbl> 7245, 7210, 9639, 7519, 5816, 15750, 9450, 9033, 19200, ~
## $ basement <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,~
## $ age <dbl> 51, 51, 48, 40, 11, 49, 71, 13, 79, 13, 37, 14, 59, 61, ~
## $ renovation <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ city <fct> Auburn, Auburn, Auburn, Auburn, Auburn, Auburn, Auburn, ~
This section examine if there is a statistical difference between the mean of the house sold price according to the rate of condition given to the house.
1.Overview of the
condition variable
#Calculate frequency, mean and standard deviation
house %>%
group_by(condition) %>%
summarise(condition_freq = n(),
price_mean = mean(price, na.rm = TRUE),
price_sd = sd(price, na.rm = TRUE))
## # A tibble: 5 x 4
## condition condition_freq price_mean price_sd
## <fct> <int> <dbl> <dbl>
## 1 1 29 341067. 273483.
## 2 2 170 328179. 246987.
## 3 3 14020 542173. 364650.
## 4 4 5677 521374. 358796.
## 5 5 1701 612578. 411318.
Hypothesis:
price is equal for all levels of condition categories.condition categories has a mean price that is not the same as the other condition categories.
The p-value of the condition variable is low (p < 0.001, indicated by the ’***’), which implies that rate of condition gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the condition of house.
one_way <- aov(price ~ condition, data = house)
summary(one_way)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 4 1.977e+13 4.942e+12 36.86 <2e-16 ***
## Residuals 21592 2.895e+15 1.341e+11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Base on the test results, there are statistically significant differences (p < 0.05) between condition groups 3-1 (3 and 1), 5-1, 3-2, 4-2, 5-2, 4-3, 5-3, 5-4. The difference between condition groups 2-1 and 4-1 are not statistically significant.
TukeyHSD(one_way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ condition, data = house)
##
## $condition
## diff lwr upr p adj
## 2-1 -12888.48 -213572.104 187795.138 0.9997895
## 3-1 201105.82 15428.628 386783.004 0.0260297
## 4-1 180307.21 -5651.399 366265.817 0.0625225
## 5-1 271510.50 84450.580 458570.422 0.0007181
## 3-2 213994.30 136921.420 291067.177 0.0000000
## 4-2 193195.69 115447.286 270944.097 0.0000000
## 5-2 284398.98 204052.083 364745.884 0.0000000
## 4-3 -20798.61 -36512.198 -5085.016 0.0028281
## 5-3 70404.69 44758.500 96050.870 0.0000000
## 5-4 91203.29 63593.282 118813.303 0.0000000
price by condition:
options(scipen=999)
ggboxplot(house, x = "condition", y = "price", ylim=c(78000,7700000))
This section examine if there is a statistical difference between the mean of the house sold price and renovation state of the house.
1.Overview of the
renovation variable
#Calculate frequency, mean and standard deviation
house %>%
group_by(renovation) %>%
summarise(renovation_freq = n(),
price_mean = mean(price, na.rm = TRUE),
price_sd = sd(price, na.rm = TRUE))
## # A tibble: 2 x 4
## renovation renovation_freq price_mean price_sd
## <fct> <int> <dbl> <dbl>
## 1 0 20683 530560. 349805.
## 2 1 914 760629. 608017.
Hypothesis:
price is equal for all levels of renovation categories.renovation categories has a mean price that is not the same as the other renovation categories.
The p-value of the renovation variable is low (p < 0.001, indicated by the ’***’), which implies that state of renovation gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the renovation state of the house.
one_way <- aov(price ~ renovation, data = house)
summary(one_way)
## Df Sum Sq Mean Sq F value Pr(>F)
## renovation 1 46332107051977 46332107051977 348.8 <0.0000000000000002
## Residuals 21595 2868250023356214 132820098326
##
## renovation ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Base on the test results, there are statistically significant differences (p < 0.05) between renovation group 1 and 0.
TukeyHSD(one_way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ renovation, data = house)
##
## $renovation
## diff lwr upr p adj
## 1-0 230068.9 205924.2 254213.5 0
price by renovation:
options(scipen=999)
ggboxplot(house, x = "renovation", y = "price", ylim=c(78000,7700000))
This section examine if there is a statistical difference between the mean of the house sold price according to the location of city of the house.
1.Overview of the
city variable
#Calculate frequency, mean and standard deviation
options(dplyr.print_max = 1e9)
house %>%
group_by(city) %>%
summarise(city_freq = n(),
price_mean = mean(price, na.rm = TRUE),
price_sd = sd(price, na.rm = TRUE))
## # A tibble: 24 x 4
## city city_freq price_mean price_sd
## <fct> <int> <dbl> <dbl>
## 1 Auburn 911 291648. 108422.
## 2 Bellevue 1407 898466. 559782.
## 3 Black Diamond 100 423666. 195415.
## 4 Bothell 195 490377. 121971.
## 5 Carnation 124 455617. 258603.
## 6 Duvall 190 424815. 130638.
## 7 Enumclaw 233 316742. 122329.
## 8 Fall City 80 586121. 376719.
## 9 Federal Way 779 289391. 108399.
## 10 Issaquah 733 615122. 260451.
## 11 Kenmore 283 462489. 149530.
## 12 Kent 1201 299470. 91647.
## 13 Kirkland 977 646543. 409633.
## 14 Maple Valley 589 367091. 132721.
## 15 Medina 50 2161300 1166904.
## 16 Mercer Island 282 1194874. 607768.
## 17 North Bend 220 440232. 207554.
## 18 Redmond 977 658432. 231136.
## 19 Renton 1597 403468. 200725.
## 20 Sammamish 800 732821. 280951.
## 21 Seattle 8973 535086. 340519.
## 22 Snoqualmie 308 529630. 185254.
## 23 Vashon 117 489382. 201501.
## 24 Woodinville 471 617498. 244298.
Hypothesis:
price is equal for all levels of city categories.city categories has a mean price that is not the same as the other city categories.
The p-value of the city variable is low (p < 0.001, indicated by the ’***’), which implies that location of city gives impact on the house sold price. Hence the H0 is rejected. There is a significant difference in the average price of house based on the location of house.
one_way <- aov(price ~ city, data = house)
summary(one_way)
## Df Sum Sq Mean Sq F value Pr(>F)
## city 23 738104329040975 32091492566999 318.1 <0.0000000000000002
## Residuals 21573 2176477801366934 100888972390
##
## city ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are statistically significant differences between city groups with p < 0.05.
TukeyHSD(one_way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ city, data = house)
##
## $city
## diff lwr upr p adj
## Bellevue-Auburn 606818.2632 557681.911 655954.6152 0.0000000
## Black Diamond-Auburn 132018.1700 10296.242 253740.0976 0.0167217
## Bothell-Auburn 198729.2877 107558.963 289899.6123 0.0000000
## Carnation-Auburn 163969.2929 53369.943 274568.6432 0.0000187
## Duvall-Auburn 133167.2432 41014.111 225320.3756 0.0000393
## Enumclaw-Auburn 25094.5706 -59731.294 109920.4352 0.9999950
## Fall City-Auburn 294473.0550 159736.612 429209.4977 0.0000000
## Federal Way-Auburn -2256.5222 -58642.078 54129.0338 1.0000000
## Issaquah-Auburn 323474.3533 266143.004 380805.7025 0.0000000
## Kenmore-Auburn 170841.0493 92208.530 249473.5683 0.0000000
## Kent-Auburn 7822.2275 -42943.276 58587.7308 1.0000000
## Kirkland-Auburn 354895.0070 301678.428 408111.5865 0.0000000
## Maple Valley-Auburn 75443.6639 14352.102 136535.2253 0.0017262
## Medina-Auburn 1869652.1800 1701822.049 2037482.3115 0.0000000
## Mercer Island-Auburn 903225.8183 824486.997 981964.6399 0.0000000
## North Bend-Auburn 148584.4937 61785.797 235383.1906 0.0000000
## Redmond-Auburn 366783.9559 313567.376 420000.5353 0.0000000
## Renton-Auburn 111820.6409 63846.793 159794.4885 0.0000000
## Sammamish-Auburn 441173.1875 385187.985 497158.3902 0.0000000
## Seattle-Auburn 243437.7799 203259.552 283616.0078 0.0000000
## Snoqualmie-Auburn 237981.9625 161823.225 314140.7003 0.0000000
## Vashon-Auburn 197734.1971 84260.065 311208.3292 0.0000000
## Woodinville-Auburn 325850.0505 260275.275 391424.8258 0.0000000
## Black Diamond-Bellevue -474800.0932 -594381.046 -355219.1404 0.0000000
## Bothell-Bellevue -408088.9755 -496380.565 -319797.3864 0.0000000
## Carnation-Bellevue -442848.9703 -551087.563 -334610.3779 0.0000000
## Duvall-Bellevue -473651.0200 -562957.103 -384344.9372 0.0000000
## Enumclaw-Bellevue -581723.6926 -663447.642 -499999.7431 0.0000000
## Fall City-Bellevue -312345.2082 -445150.651 -179539.7655 0.0000000
## Federal Way-Bellevue -609074.7853 -660676.215 -557473.3558 0.0000000
## Issaquah-Bellevue -283343.9099 -335977.171 -230710.6484 0.0000000
## Kenmore-Bellevue -435977.2139 -511253.038 -360701.3900 0.0000000
## Kent-Bellevue -598996.0357 -644388.909 -553603.1625 0.0000000
## Kirkland-Bellevue -251923.2561 -300041.662 -203804.8507 0.0000000
## Maple Valley-Bellevue -531374.5993 -588080.467 -474668.7311 0.0000000
## Medina-Bellevue 1262833.9168 1096550.034 1429117.7996 0.0000000
## Mercer Island-Bellevue 296407.5551 221020.695 371794.4150 0.0000000
## North Bend-Bellevue -458233.7695 -542003.635 -374463.9038 0.0000000
## Redmond-Bellevue -240034.3073 -288152.713 -191915.9019 0.0000000
## Renton-Bellevue -494997.6223 -537245.305 -452749.9399 0.0000000
## Sammamish-Bellevue -165645.0757 -216808.730 -114481.4213 0.0000000
## Seattle-Bellevue -363380.4832 -396511.535 -330249.4313 0.0000000
## Snoqualmie-Bellevue -368836.3007 -441524.195 -296148.4066 0.0000000
## Vashon-Bellevue -409084.0661 -520258.502 -297909.6299 0.0000000
## Woodinville-Bellevue -280968.2127 -342477.787 -219458.6381 0.0000000
## Bothell-Black Diamond 66711.1177 -75405.916 208828.1513 0.9916108
## Carnation-Black Diamond 31951.1229 -123346.783 187249.0284 1.0000000
## Duvall-Black Diamond 1149.0732 -141600.438 143898.5845 1.0000000
## Enumclaw-Black Diamond -106923.5994 -245056.327 31209.1283 0.4247155
## Fall City-Black Diamond 162454.8850 -10863.124 335772.8938 0.1023094
## Federal Way-Black Diamond -134274.6922 -257012.437 -11536.9470 0.0146776
## Issaquah-Black Diamond 191456.1833 68281.077 314631.2894 0.0000042
## Kenmore-Black Diamond 38822.8793 -95595.483 173241.2419 0.9999968
## Kent-Black Diamond -124195.9425 -244455.493 -3936.3920 0.0332820
## Kirkland-Black Diamond 222876.8370 101562.256 344191.4177 0.0000000
## Maple Valley-Black Diamond -56574.5061 -181544.111 68395.0983 0.9948162
## Medina-Black Diamond 1737634.0100 1537503.612 1937764.4080 0.0000000
## Mercer Island-Black Diamond 771207.6483 636727.073 905688.2238 0.0000000
## North Bend-Black Diamond 16566.3236 -122786.599 155919.2459 1.0000000
## Redmond-Black Diamond 234765.7858 113451.205 356080.3665 0.0000000
## Renton-Black Diamond -20197.5291 -139305.519 98910.4606 1.0000000
## Sammamish-Black Diamond 309155.0175 186600.678 431709.3568 0.0000000
## Seattle-Black Diamond 111419.6099 -4767.795 227607.0153 0.0806822
## Snoqualmie-Black Diamond 105963.7925 -27022.585 238950.1696 0.3631798
## Vashon-Black Diamond 65716.0271 -91642.169 223074.2230 0.9983475
## Woodinville-Black Diamond 193831.8805 66610.521 321053.2400 0.0000080
## Carnation-Bothell -34759.9948 -167474.877 97954.8871 0.9999995
## Duvall-Bothell -65562.0445 -183346.693 52222.6040 0.9399590
## Enumclaw-Bothell -173634.7171 -285779.494 -61489.9406 0.0000048
## Fall City-Bothell 95743.7673 -57667.225 249154.7597 0.8331508
## Federal Way-Bothell -200985.8099 -293507.994 -108463.6259 0.0000000
## Issaquah-Bothell 124745.0656 31643.469 217846.6618 0.0002839
## Kenmore-Bothell -27888.2384 -135424.728 79648.2515 0.9999996
## Kent-Bothell -190907.0602 -280115.580 -101698.5408 0.0000000
## Kirkland-Bothell 156165.7193 65539.962 246791.4769 0.0000000
## Maple Valley-Bothell -123285.6238 -218748.719 -27822.5289 0.0006671
## Medina-Bothell 1670922.8923 1487761.825 1854083.9598 0.0000000
## Mercer Island-Bothell 704496.5306 596882.286 812110.7752 0.0000000
## North Bend-Bothell -50144.7941 -163789.140 63499.5520 0.9963581
## Redmond-Bothell 168054.6682 77428.911 258680.4257 0.0000000
## Renton-Bothell -86908.6468 -174558.596 741.3028 0.0553451
## Sammamish-Bothell 242443.8998 150165.156 334722.6434 0.0000000
## Seattle-Bothell 44708.4922 -38929.499 128346.4836 0.9605835
## Snoqualmie-Bothell 39252.6748 -66488.410 144993.7593 0.9997348
## Vashon-Bothell -995.0906 -136115.046 134124.8645 1.0000000
## Woodinville-Bothell 127120.7628 28728.310 225513.2156 0.0006610
## Duvall-Carnation -30802.0497 -164193.997 102589.8976 1.0000000
## Enumclaw-Carnation -138874.7223 -267313.965 -10435.4795 0.0175068
## Fall City-Carnation 130503.7621 -35192.106 296199.6304 0.3875654
## Federal Way-Carnation -166225.8151 -277942.164 -54509.4657 0.0000167
## Issaquah-Carnation 159505.0604 47308.379 271701.7421 0.0000619
## Kenmore-Carnation 6871.7564 -117564.110 131307.6228 1.0000000
## Kent-Carnation -156147.0654 -265134.900 -47159.2307 0.0000500
## Kirkland-Carnation 190925.7141 80774.835 301076.5930 0.0000000
## Maple Valley-Carnation -88525.6290 -202689.505 25638.2474 0.4209220
## Medina-Carnation 1705682.8871 1512115.935 1899249.8389 0.0000000
## Mercer Island-Carnation 739256.5254 614753.458 863759.5929 0.0000000
## North Bend-Carnation -15384.7993 -145135.428 114365.8292 1.0000000
## Redmond-Carnation 202814.6629 92663.784 312965.5418 0.0000000
## Renton-Carnation -52148.6520 -159864.490 55567.1864 0.9875638
## Sammamish-Carnation 277203.8946 165689.076 388718.7129 0.0000000
## Seattle-Carnation 79468.4870 -25008.792 183945.7658 0.4627436
## Snoqualmie-Carnation 74012.6696 -48874.942 196900.2814 0.8752603
## Vashon-Carnation 33764.9042 -115156.520 182686.3279 1.0000000
## Woodinville-Carnation 161880.7576 45256.305 278505.2102 0.0001172
## Enumclaw-Duvall -108072.6726 -221017.892 4872.5471 0.0826059
## Fall City-Duvall 161305.8118 7308.720 315302.9038 0.0275181
## Federal Way-Duvall -135423.7653 -228914.547 -41932.9837 0.0000368
## Issaquah-Duvall 190307.1101 96242.882 284371.3379 0.0000000
## Kenmore-Duvall 37673.8061 -70697.170 146044.7819 0.9999111
## Kent-Duvall -125345.0157 -215557.718 -35132.3133 0.0001142
## Kirkland-Duvall 221727.7639 130113.356 313342.1714 0.0000000
## Maple Valley-Duvall -57723.5793 -154125.728 38678.5691 0.8813423
## Medina-Duvall 1736484.9368 1552832.688 1920137.1854 0.0000000
## Mercer Island-Duvall 770058.5751 661610.443 878506.7073 0.0000000
## North Bend-Duvall 15417.2505 -99017.050 129851.5510 1.0000000
## Redmond-Duvall 233616.7127 142002.305 325231.1202 0.0000000
## Renton-Duvall -21346.6023 -110018.387 67325.1824 0.9999999
## Sammamish-Duvall 308005.9443 214756.074 401255.8143 0.0000000
## Seattle-Duvall 110270.5368 25562.300 194978.7733 0.0005577
## Snoqualmie-Duvall 104814.7193 -1774.908 211404.3468 0.0609801
## Vashon-Duvall 64566.9539 -71218.075 200351.9825 0.9900914
## Woodinville-Duvall 192682.8073 93378.998 291986.6164 0.0000000
## Fall City-Enumclaw 269378.4844 119650.959 419106.0097 0.0000000
## Federal Way-Enumclaw -27351.0927 -113628.284 58926.0986 0.9999823
## Issaquah-Enumclaw 298379.7827 211481.529 385278.0360 0.0000000
## Kenmore-Enumclaw 145746.4787 43533.486 247959.4714 0.0000571
## Kent-Enumclaw -17272.3431 -99986.061 65441.3750 1.0000000
## Kirkland-Enumclaw 329800.4365 245560.142 414040.7306 0.0000000
## Maple Valley-Enumclaw 50349.0933 -39074.627 139772.8133 0.9327362
## Medina-Enumclaw 1844557.6094 1664470.475 2024644.7436 0.0000000
## Mercer Island-Enumclaw 878131.2477 775836.454 980426.0416 0.0000000
## North Bend-Enumclaw 123489.9231 14869.314 232110.5317 0.0079356
## Redmond-Enumclaw 341689.3853 257449.091 425929.6794 0.0000000
## Renton-Enumclaw 86726.0703 5695.750 167756.3911 0.0203207
## Sammamish-Enumclaw 416078.6169 330062.539 502094.6953 0.0000000
## Seattle-Enumclaw 218343.2094 141670.445 295015.9739 0.0000000
## Snoqualmie-Enumclaw 212887.3919 112565.031 313209.7532 0.0000000
## Vashon-Enumclaw 172639.6265 41716.743 303562.5102 0.0004148
## Woodinville-Enumclaw 300755.4799 208211.037 393299.9232 0.0000000
## Federal Way-Fall City -296729.5772 -432384.417 -161074.7379 0.0000000
## Issaquah-Fall City 29001.2983 -107049.384 165051.9804 1.0000000
## Kenmore-Fall City -123632.0057 -269939.824 22675.8121 0.2488284
## Kent-Fall City -286650.8275 -420067.621 -153234.0338 0.0000000
## Kirkland-Fall City 60421.9520 -73946.604 194790.5080 0.9952697
## Maple Valley-Fall City -219029.3911 -356706.853 -81351.9297 0.0000018
## Medina-Fall City 1575179.1250 1366876.802 1783481.4475 0.0000000
## Mercer Island-Fall City 608752.7633 462387.786 755117.7405 0.0000000
## North Bend-Fall City -145888.5614 -296742.525 4965.4021 0.0736057
## Redmond-Fall City 72310.9008 -62057.655 206679.4568 0.9576414
## Renton-Fall City -182652.4141 -315032.150 -50272.6779 0.0001368
## Sammamish-Fall City 146700.1325 11211.213 282189.0525 0.0171507
## Seattle-Fall City -51035.2751 -180793.491 78722.9405 0.9993344
## Snoqualmie-Fall City -56491.0925 -201484.395 88502.2103 0.9994257
## Vashon-Fall City -96738.8579 -264367.266 70889.5501 0.9148126
## Woodinville-Fall City 31376.9955 -108347.575 171101.5662 1.0000000
## Issaquah-Federal Way 325730.8754 266273.243 385188.5078 0.0000000
## Kenmore-Federal Way 173097.5714 92901.565 253293.5781 0.0000000
## Kent-Federal Way 10078.7496 -43076.333 63233.8319 1.0000000
## Kirkland-Federal Way 357151.5292 301650.802 412652.2568 0.0000000
## Maple Valley-Federal Way 77700.1861 14608.940 140791.4320 0.0018241
## Medina-Federal Way 1871908.7022 1703340.379 2040477.0251 0.0000000
## Mercer Island-Federal Way 905482.3405 825182.101 985782.5800 0.0000000
## North Bend-Federal Way 150841.0158 62623.443 239058.5890 0.0000000
## Redmond-Federal Way 369040.4780 313539.750 424541.2056 0.0000000
## Renton-Federal Way 114077.1630 63581.455 164572.8707 0.0000000
## Sammamish-Federal Way 443429.7097 385268.990 501590.4297 0.0000000
## Seattle-Federal Way 245694.3021 202536.262 288852.3423 0.0000000
## Snoqualmie-Federal Way 240238.4846 162466.512 318010.4576 0.0000000
## Vashon-Federal Way 199990.7193 85427.614 314553.8246 0.0000000
## Woodinville-Federal Way 328106.5727 260664.912 395548.2334 0.0000000
## Kenmore-Issaquah -152633.3040 -233497.092 -71769.5162 0.0000000
## Kent-Issaquah -315652.1258 -369809.447 -261494.8045 0.0000000
## Kirkland-Issaquah 31420.6538 -25040.691 87881.9985 0.9400971
## Maple Valley-Issaquah -248030.6894 -311968.612 -184092.7663 0.0000000
## Medina-Issaquah 1546177.8267 1377290.787 1715064.8664 0.0000000
## Mercer Island-Issaquah 579751.4650 498784.304 660718.6260 0.0000000
## North Bend-Issaquah -174889.8596 -263714.929 -86064.7905 0.0000000
## Redmond-Issaquah 43309.6026 -13151.742 99770.9474 0.4443948
## Renton-Issaquah -211653.7124 -263203.389 -160104.0356 0.0000000
## Sammamish-Issaquah 117698.8342 58620.733 176776.9354 0.0000000
## Seattle-Issaquah -80036.5733 -124423.161 -35649.9862 0.0000000
## Snoqualmie-Issaquah -85492.3908 -163952.779 -7032.0029 0.0155976
## Vashon-Issaquah -125740.1562 -240771.707 -10708.6049 0.0148637
## Woodinville-Issaquah 2375.6972 -65858.680 70610.0748 1.0000000
## Kent-Kenmore -163018.8218 -239368.052 -86669.5919 0.0000000
## Kirkland-Kenmore 184053.9578 106053.490 262054.4251 0.0000000
## Maple Valley-Kenmore -95397.3854 -178969.196 -11825.5752 0.0074136
## Medina-Kenmore 1698811.1307 1521557.015 1876065.2467 0.0000000
## Mercer Island-Kenmore 732384.7690 635164.038 829605.5004 0.0000000
## North Bend-Kenmore -22256.5556 -126112.621 81599.5100 1.0000000
## Redmond-Kenmore 195942.9066 117942.439 273943.3739 0.0000000
## Renton-Kenmore -59020.4084 -133542.611 15501.7939 0.3758719
## Sammamish-Kenmore 270332.1382 190417.112 350247.1649 0.0000000
## Seattle-Kenmore 72596.7307 2837.440 142356.0209 0.0300539
## Snoqualmie-Kenmore 67140.9132 -28002.252 162284.0786 0.6234261
## Vashon-Kenmore 26893.1478 -100104.676 153890.9715 1.0000000
## Woodinville-Kenmore 155009.0012 68106.068 241911.9340 0.0000000
## Kirkland-Kent 347072.7796 297291.898 396853.6612 0.0000000
## Maple Valley-Kent 67621.4364 9498.199 125744.6738 0.0053375
## Medina-Kent 1861829.9525 1695057.398 2028602.5069 0.0000000
## Mercer Island-Kent 895403.5908 818944.884 971862.2978 0.0000000
## North Bend-Kent 140762.2662 56026.526 225498.0062 0.0000003
## Redmond-Kent 358961.7284 309180.847 408742.6100 0.0000000
## Renton-Kent 103998.4134 59866.544 148130.2829 0.0000000
## Sammamish-Kent 433350.9600 380620.753 486081.1674 0.0000000
## Seattle-Kent 235615.5525 200113.132 271117.9727 0.0000000
## Snoqualmie-Kent 230159.7350 156360.784 303958.6862 0.0000000
## Vashon-Kent 189911.9696 78007.946 301815.9933 0.0000001
## Woodinville-Kent 318027.8230 255209.171 380846.4748 0.0000000
## Maple Valley-Kirkland -279451.3432 -339727.198 -219175.4879 0.0000000
## Medina-Kirkland 1514757.1730 1347222.243 1682292.1033 0.0000000
## Mercer Island-Kirkland 548330.8113 470223.181 626438.4414 0.0000000
## North Bend-Kirkland -206310.5134 -292537.039 -120083.9883 0.0000000
## Redmond-Kirkland 11888.9488 -40389.194 64167.0913 1.0000000
## Renton-Kirkland -243074.3662 -290005.059 -196143.6736 0.0000000
## Sammamish-Kirkland 86278.1805 31184.236 141372.1254 0.0000032
## Seattle-Kirkland -111457.2271 -150383.951 -72530.5031 0.0000000
## Snoqualmie-Kirkland -116913.0446 -192419.026 -41407.0634 0.0000047
## Vashon-Kirkland -157160.8099 -270197.877 -44123.7430 0.0001123
## Woodinville-Kirkland -29044.9565 -93860.472 35770.5586 0.9954891
## Medina-Maple Valley 1794208.5161 1624008.259 1964408.7730 0.0000000
## Mercer Island-Maple Valley 827782.1544 744110.317 911453.9921 0.0000000
## North Bend-Maple Valley 73140.8298 -18156.422 164438.0816 0.3519570
## Redmond-Maple Valley 291340.2920 231064.437 351616.1472 0.0000000
## Renton-Maple Valley 36376.9770 -19324.589 92078.5430 0.7667756
## Sammamish-Maple Valley 365729.5236 302995.822 428463.2252 0.0000000
## Seattle-Maple Valley 167994.1160 118846.742 217141.4903 0.0000000
## Snoqualmie-Maple Valley 162538.2986 81289.743 243786.8539 0.0000000
## Vashon-Maple Valley 122290.5332 5339.462 239241.6044 0.0281733
## Woodinville-Maple Valley 250406.3866 178983.522 321829.2510 0.0000000
## Mercer Island-Medina -966426.3617 -1143727.661 -789125.0628 0.0000000
## North Bend-Medina -1721067.6864 -1902092.443 -1540042.9302 0.0000000
## Redmond-Medina -1502868.2242 -1670403.155 -1335333.2938 0.0000000
## Renton-Medina -1757831.5391 -1923775.622 -1591887.4563 0.0000000
## Sammamish-Medina -1428478.9925 -1596913.821 -1260044.1639 0.0000000
## Seattle-Medina -1626214.4001 -1790074.824 -1462353.9761 0.0000000
## Snoqualmie-Medina -1631670.2175 -1807840.879 -1455499.5562 0.0000000
## Vashon-Medina -1671917.9829 -1867141.770 -1476694.1954 0.0000000
## Woodinville-Medina -1543802.1295 -1715662.537 -1371941.7218 0.0000000
## North Bend-Mercer Island -754641.3247 -858577.898 -650704.7510 0.0000000
## Redmond-Mercer Island -536441.8625 -614549.493 -458334.2324 0.0000000
## Renton-Mercer Island -791405.1774 -866039.537 -716770.8180 0.0000000
## Sammamish-Mercer Island -462052.6308 -542072.256 -382033.0053 0.0000000
## Seattle-Mercer Island -659788.0384 -729667.131 -589908.9461 0.0000000
## Snoqualmie-Mercer Island -665243.8558 -760474.895 -570012.8163 0.0000000
## Vashon-Mercer Island -705491.6212 -832555.291 -578427.9513 0.0000000
## Woodinville-Mercer Island -577375.7678 -664374.898 -490376.6373 0.0000000
## Redmond-North Bend 218199.4622 131972.937 304425.9873 0.0000000
## Renton-North Bend -36763.8528 -119857.170 46329.4644 0.9962179
## Sammamish-North Bend 292588.6939 204626.473 380550.9143 0.0000000
## Seattle-North Bend 94853.2863 16003.433 173703.1392 0.0028893
## Snoqualmie-North Bend 89397.4688 -12598.426 191393.3641 0.1881561
## Vashon-North Bend 49149.7035 -83059.933 181359.3399 0.9997285
## Woodinville-North Bend 177265.5569 82909.526 271621.5872 0.0000000
## Renton-Redmond -254963.3150 -301894.008 -208032.6224 0.0000000
## Sammamish-Redmond 74389.2317 19295.287 129483.1765 0.0002355
## Seattle-Redmond -123346.1759 -162272.900 -84419.4519 0.0000000
## Snoqualmie-Redmond -128801.9934 -204307.975 -53296.0122 0.0000000
## Vashon-Redmond -169049.7588 -282086.826 -56012.6918 0.0000143
## Woodinville-Redmond -40933.9054 -105749.421 23881.6098 0.8171775
## Sammamish-Renton 329352.5466 279304.285 379400.8082 0.0000000
## Seattle-Renton 131617.1390 100236.005 162998.2731 0.0000000
## Snoqualmie-Renton 126161.3216 54254.167 198068.4765 0.0000000
## Vashon-Renton 85913.5562 -24751.995 196579.1075 0.4183959
## Woodinville-Renton 214029.4096 153444.455 274614.3643 0.0000000
## Seattle-Sammamish -197735.4076 -240369.061 -155101.7538 0.0000000
## Snoqualmie-Sammamish -203191.2250 -280673.428 -125709.0220 0.0000000
## Vashon-Sammamish -243438.9904 -357805.581 -129072.3996 0.0000000
## Woodinville-Sammamish -115323.1370 -182430.436 -48215.8379 0.0000000
## Snoqualmie-Seattle -5455.8174 -72414.304 61502.6687 1.0000000
## Vashon-Seattle -45703.5828 -153219.462 61812.2959 0.9978584
## Woodinville-Seattle 82412.2706 27792.341 137032.1999 0.0000109
## Vashon-Snoqualmie -40247.7654 -165728.949 85233.4185 0.9999781
## Woodinville-Snoqualmie 87868.0880 3196.959 172539.2166 0.0312061
## Woodinville-Vashon 128115.8534 8761.652 247470.0545 0.0194816
price by city:
Note that y-axis represents
city and x-axis represent price.
options(scipen=999)
p <- ggboxplot(house, x = "city", y = "price", ylim=c(78000,7700000))
p + coord_flip()
Idendified Questions
Objective
Multiple linear regression will be conducted to get a prediction model on the house price based on the selected variables {bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation}. We will apply the machine learning algorithm on the multiple linear regression analysis.
library(tidyverse)
library(caret)
library(leaps)
library(MASS)
# Fit the full model
full.model <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation,
data=train_data)
summary(full.model)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## condition + sqft_living15 + sqft_lot15 + basement + age +
## renovation, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1368535 -138479 -19711 95065 5359381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -618913.05670 16392.28963 -37.756 < 0.0000000000000002 ***
## bedrooms -16207.19033 2800.96325 -5.786 0.00000000734 ***
## bathrooms 162750.13651 4511.18320 36.077 < 0.0000000000000002 ***
## floors 87845.32218 5069.17522 17.329 < 0.0000000000000002 ***
## waterfront 714137.93994 24169.48956 29.547 < 0.0000000000000002 ***
## condition 20575.69979 3507.33209 5.866 0.00000000454 ***
## sqft_living15 241.66378 3.81982 63.266 < 0.0000000000000002 ***
## sqft_lot15 -0.15926 0.08052 -1.978 0.048 *
## basement 64369.36922 4784.86699 13.453 < 0.0000000000000002 ***
## age 3585.34355 98.54000 36.385 < 0.0000000000000002 ***
## renovation 64859.42790 10974.10387 5.910 0.00000000349 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 255000 on 15105 degrees of freedom
## Multiple R-squared: 0.5136, Adjusted R-squared: 0.5132
## F-statistic: 1595 on 10 and 15105 DF, p-value: < 0.00000000000000022
# stepwise regression
fit <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+sqft_lot15+basement+age+renovation,
data=train_data)
step <- stepAIC(fit, direction="both")
## Start: AIC=376374.2
## price ~ bedrooms + bathrooms + floors + waterfront + condition +
## sqft_living15 + sqft_lot15 + basement + age + renovation
##
## Df Sum of Sq RSS AIC
## <none> 982495104508443 376374
## - sqft_lot15 1 254430333018 982749534841461 376376
## - bedrooms 1 2177761472192 984672865980634 376406
## - condition 1 2238541329829 984733645838272 376407
## - renovation 1 2272049345704 984767153854146 376407
## - basement 1 11771408551175 994266513059617 376552
## - floors 1 19533152418298 1002028256926740 376670
## - waterfront 1 56785691806884 1039280796315327 377222
## - bathrooms 1 84658673487397 1067153777995840 377622
## - age 1 86108511845013 1068603616353456 377642
## - sqft_living15 1 260343083778339 1242838188286782 379925
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## price ~ bedrooms + bathrooms + floors + waterfront + condition +
## sqft_living15 + sqft_lot15 + basement + age + renovation
##
## Final Model:
## price ~ bedrooms + bathrooms + floors + waterfront + condition +
## sqft_living15 + sqft_lot15 + basement + age + renovation
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 15105 982495104508443 376374.2
lm1 <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation,
data=train_data)
summary(lm1)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## condition + sqft_living15 + basement + age + renovation,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1371691 -138588 -19819 95116 5361376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -620294.448 16378.981 -37.871 < 0.0000000000000002 ***
## bedrooms -15984.071 2798.960 -5.711 0.00000001146 ***
## bathrooms 162635.627 4511.246 36.051 < 0.0000000000000002 ***
## floors 88748.241 5049.062 17.577 < 0.0000000000000002 ***
## waterfront 713133.505 24166.482 29.509 < 0.0000000000000002 ***
## condition 20466.362 3507.234 5.835 0.00000000547 ***
## sqft_living15 240.318 3.759 63.930 < 0.0000000000000002 ***
## basement 65000.082 4774.688 13.613 < 0.0000000000000002 ***
## age 3592.837 98.477 36.484 < 0.0000000000000002 ***
## renovation 64493.689 10973.603 5.877 0.00000000426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 255100 on 15106 degrees of freedom
## Multiple R-squared: 0.5134, Adjusted R-squared: 0.5132
## F-statistic: 1771 on 9 and 15106 DF, p-value: < 0.00000000000000022
caret package.# train the model
train.model <- train(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation,
data=train_data, method="lm")
summary(train.model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1371691 -138588 -19819 95116 5361376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -620294.448 16378.981 -37.871 < 0.0000000000000002 ***
## bedrooms -15984.071 2798.960 -5.711 0.00000001146 ***
## bathrooms 162635.627 4511.246 36.051 < 0.0000000000000002 ***
## floors 88748.241 5049.062 17.577 < 0.0000000000000002 ***
## waterfront 713133.505 24166.482 29.509 < 0.0000000000000002 ***
## condition 20466.362 3507.234 5.835 0.00000000547 ***
## sqft_living15 240.318 3.759 63.930 < 0.0000000000000002 ***
## basement 65000.082 4774.688 13.613 < 0.0000000000000002 ***
## age 3592.837 98.477 36.484 < 0.0000000000000002 ***
## renovation 64493.689 10973.603 5.877 0.00000000426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 255100 on 15106 degrees of freedom
## Multiple R-squared: 0.5134, Adjusted R-squared: 0.5132
## F-statistic: 1771 on 9 and 15106 DF, p-value: < 0.00000000000000022
# display the R-squared value
train.model_rsq <- summary(train.model$finalModel)$r.squared
# show the bootstrap results
train.model
## Linear Regression
##
## 15116 samples
## 9 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 15116, 15116, 15116, 15116, 15116, 15116, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 255794.3 0.516118 162309.8
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# store the results of bootstrap in a dataframe
train.model_results <- as.data.frame(train.model$results)
cooksd <- cooks.distance(lm1)
cat("The mean of Cook's distance is ", mean(cooksd))
## The mean of Cook's distance is 0.0002049408
# visualize the Cook's distance in a plot
# par(mfrow=c(1, 1))
# plot(cooksd, main="Influential Obs by Cooks distance",xlim=c(0,25000),ylim=c(0,0.1))
# axis(1, at=seq(0, 25000, 5000))
# axis(2, at=seq(0, 0.1, 0.0001))
# abline(h = 4*mean(cooksd, na.rm=T), col="green")
# text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red")
par(mfrow=c(1, 1))
plot(cooksd, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="green")
text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red")
# any point that have Cook's distance greater than 4 times of mean will be considered as influential
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])
# influential row numbers
head(train_data[influential, ])
## price bedrooms bathrooms floors waterfront condition sqft_living15
## 533 200000 3 2.25 1.5 0 5 1260
## 802 225900 3 1.00 1.0 0 4 1290
## 1009 625000 3 2.00 1.0 0 4 2470
## 1189 790000 5 3.25 1.0 0 4 2590
## 1213 1000000 5 2.50 2.0 0 4 3600
## 1222 596000 3 2.50 2.0 0 3 1730
## sqft_lot15 basement age renovation zip city
## 533 7556 0 101 0 98002 Auburn
## 802 8470 1 52 0 98003 Federal Way
## 1009 13594 0 28 0 98004 Bellevue
## 1189 12160 1 47 0 98005 Bellevue
## 1213 48787 0 46 0 98005 Bellevue
## 1222 2751 0 14 0 98005 Bellevue
influential_data <- train_data[influential, ]
# inner join outliers_data and influential_data
influential_outliers <- inner_join(outliers_data,influential_data)
# remove outliers_data from train_data but keep the influential_data
train_data2 <- rbind(train_data1,influential_outliers)
# relabel the index of train_data2
row.names(train_data2) <- NULL
train_data2.lm2 <- lm(price~bedrooms+bathrooms+floors+waterfront+condition+sqft_living15+basement+age+renovation,
data=train_data2)
summary(lm2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## condition + sqft_living15 + basement + age + renovation,
## data = train_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627995 -105915 -11026 90600 1287030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -318110.996 10613.765 -29.972 < 0.0000000000000002 ***
## bedrooms -4680.359 1754.009 -2.668 0.00763 **
## bathrooms 72914.349 2961.479 24.621 < 0.0000000000000002 ***
## floors 93313.267 3194.663 29.209 < 0.0000000000000002 ***
## waterfront 145556.462 22601.758 6.440 0.000000000123 ***
## condition 19991.399 2182.831 9.158 < 0.0000000000000002 ***
## sqft_living15 171.916 2.522 68.173 < 0.0000000000000002 ***
## basement 61381.203 2986.837 20.551 < 0.0000000000000002 ***
## age 2311.441 63.029 36.673 < 0.0000000000000002 ***
## renovation 29232.066 7206.310 4.056 0.000050087351 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154100 on 14323 degrees of freedom
## Multiple R-squared: 0.4693, Adjusted R-squared: 0.469
## F-statistic: 1407 on 9 and 14323 DF, p-value: < 0.00000000000000022
# Residual diagnostics with the 4 plot
par(mfrow = c(2, 2))
plot(lm2)
library(car)
vif(lm2)
## bedrooms bathrooms floors waterfront condition
## 1.445350 2.721895 1.762581 1.007715 1.203717
## sqft_living15 basement age renovation
## 1.480062 1.272886 2.029259 1.136379
Prediction on Train Data
# fitted values of train data
pred=lm2$fitted.values
# store the actual price and fitted price in a data frame
actual_fitted=data.frame(actual=train_data2$price, predicted=pred)
# find the mean of absolute difference % between fitted values and actual values
abs_diff = mean(abs(actual_fitted$actual-actual_fitted$predicted)/actual_fitted$actual)
# accuracy of the model
accuracy=1-abs_diff
accuracy
## [1] 0.7045188
Prediction on Test Data
# use the model to predict price for test data
pred_test=predict(newdata=test_data, lm2)
# store the actual price and predicted price in a data frame
actual_fitted_test=data.frame(actual=test_data$price, predicted=pred_test)
# find the mean of absolute difference % between predicted values and actual values
abs_diff_test = mean(abs(actual_fitted_test$actual-actual_fitted_test$predicted)/actual_fitted_test$actual)
# accuracy of the prediction
accuracy=1-abs_diff_test
accuracy
## [1] 0.6929814
bedrooms, bathrooms, floors, waterfront, condition, sqft_living15, basement, age, renovation are significant factors that affecting the house price. The prediction model able to predict the house price in Washington state by taking into account the 9 variables above at approximately 70% of accuracy.