1. Introduction

This is a final Project for the statistical programming in R (STAT612) course. The dataset fot this project is taken from Kaggle and it is about House Sales in King Country, USA.

The data set was obtained from www.kaggle.com website. It is Kaggle dataset which is about House Sales in King County of Washington State, USA. the link for this dataset is https://www.kaggle.com/gabriellima/house-sales-in-king-county-usa/data

Most of the time, housing prices are subject to different market forces. Sometimes the prices rise, other times the prices falls. The market forces that affect the housing prices may include interest rates, economic factors (such as GDP, employment, manufacturing, prices of goods), import/export and government subsidies. These forces are out of our control and can not be easily predictable. As the result, this paper does not explore the effect of those different external factors on the price of houses. Instead, we will be focusing to explore effect of various internal factors such as number of bedrooms, bathrooms, view, condition, grade, location, square foot, etc. - on the housing prices.

2. Overview and objective of the Study

The data set contains the prices of houses against a various parameters that may or may not affect the house price. The objective of the study is to use statistical analysis in order to find out the dependence of these variables on the price of houses. it is to assess which parameters highly affect the housing prices and which variables have minimal affect on the price of houses. The statistical tools that we will be focusing to use are Correlation, box plot, various scatter plot and bar plots. in addition to that geospatial representation of those house sale price were plotted on ESRI maps to see which prices are higher on which part of the study area using the longitude and latitude points. Over all, important insights between the variables were drawn from boxplots, histogram, scatter, corrgrams and geospatial mapping.

3. The Dataset and Description

The data for these sales comes from the official public records of home sales in the King County area, Washington State, USA. The data sets contains 21613 rows and 21 columns. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:

[, 1] id - Unique ID for each home sold.

[, 2] date - Date of the home sale.

[, 3] price - Price of each home sold.

[, 4] bedrooms - Number of bedrooms.

[, 5] bathrooms - Number of bathrooms, where - 0.5 accounts for a room with a toilet but no shower.

[, 6] sqft_living - Square footage of the apartments interior living space.

[, 7] sqft_lot - Square footage of the land space.

[, 8] floors - Number of floors.

[, 9] waterfront - A variable for whether the apartment was overlooking the waterfront or not.

[, 10] view - An index from 0 to 4 of how good the view of the property was.

[, 11] condition - An index from 1 to 5 on the condition of the apartment.

[, 12] grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

[, 13] sqft_above - The square footage of the interior housing space that is above ground level.

[, 14] sqft_basement - The square footage of the interior housing space that is below ground level.

[, 15] yr_built - The year the house was initially built.

[, 16] yr_renovated - The year of the house’s last renovation.

[, 17] zipcode - What zipcode area the house is in.

[, 18] lat - Lattitude.

[, 19] long - Longitude.

[, 20] sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors.

[, 21] sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors.

4. problem question to be adressed

  1. Which variables seem to affect house sales pirce in the king county?
  2. What role does visualization play to see effect of various variables on housing price?
  3. Does the spatial location of the houses affect the house price?

5. loading the packages that may require for the data assessment

options(scipen = 999) 
options(warn=-1)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(forcats)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(lattice)
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(leaps)
library(tidyr)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(dplyr)
library(ggplot2)
library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
## The following object is masked from 'package:lattice':
## 
##     panel.fill
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(date)
library(FactoMineR)
library(tree)
library(corrplot)
## corrplot 0.84 loaded
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(rpart)
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(corrplot)
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
## 
##     ozone
## The following object is masked from 'package:purrr':
## 
##     map
library(rbokeh)
library(stringr)
library(leaflet)
library(corrplot)

6. reading dataset in R.

the data set was tried to read directly from the website link however it takes forever. thus, we prefered to download the data set to our personal computer and read it from our folder.

### 4. Reading data

KC_Data <- read.csv("kc_house_data.csv", sep=",", header=T, stringsAsFactors=F)

Now, let’s see how our data looks like by using the head() function (looking at the insight of the data):

head(KC_Data)
##           id            date   price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000  221900        3      1.00        1180
## 2 6414100192 20141209T000000  538000        3      2.25        2570
## 3 5631500400 20150225T000000  180000        2      1.00         770
## 4 2487200875 20141209T000000  604000        4      3.00        1960
## 5 1954400510 20150218T000000  510000        3      2.00        1680
## 6 7237550310 20140512T000000 1230000        4      4.50        5420
##   sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1     5650      1          0    0         3     7       1180             0
## 2     7242      2          0    0         3     7       2170           400
## 3    10000      1          0    0         3     6        770             0
## 4     5000      1          0    0         5     7       1050           910
## 5     8080      1          0    0         3     8       1680             0
## 6   101930      1          0    0         3    11       3890          1530
##   yr_built yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1     1955            0   98178 47.5112 -122.257          1340       5650
## 2     1951         1991   98125 47.7210 -122.319          1690       7639
## 3     1933            0   98028 47.7379 -122.233          2720       8062
## 4     1965            0   98136 47.5208 -122.393          1360       5000
## 5     1987            0   98074 47.6168 -122.045          1800       7503
## 6     2001            0   98053 47.6561 -122.005          4760     101930
glimpse(KC_Data)
## Observations: 21,613
## Variables: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, ...
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T...
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000,...
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, ...
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0...
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...

7. Assessing how many rows and columns the data set has.

# Number of observations
nrow(KC_Data)
## [1] 21613
# Number of variables
ncol(KC_Data)
## [1] 21

Or

dim(KC_Data)
## [1] 21613    21

We have 21613 observations (rows) and 21 columns (variables) in our data set.

8. Formatting Date

Formatting date as date format from string

KC_Data$date <- as.Date(as.Date(as.character(KC_Data$date),"%Y%m%d"))

head(KC_Data)
##           id       date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13  221900        3      1.00        1180     5650
## 2 6414100192 2014-12-09  538000        3      2.25        2570     7242
## 3 5631500400 2015-02-25  180000        2      1.00         770    10000
## 4 2487200875 2014-12-09  604000        4      3.00        1960     5000
## 5 1954400510 2015-02-18  510000        3      2.00        1680     8080
## 6 7237550310 2014-05-12 1230000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

9. Transforming data - converting certain dummy variables in to factor.

Variables such as bathrooms, bedrooms, floors, condition, waterfront, view and grade should be converted in to factor variables as they seems dammy variables in our data set. for example Condition has three maximum factors, while grade has at least five. So it is better to convert them in to factor variables as they are not real continous numeric variables.

converting bathrooms, bedrooms, floors, condition, waterfront, view and grade in to factor variables.

KC_Data$bedrooms <- as.factor(KC_Data$bedrooms)
KC_Data$bathrooms <- as.factor(KC_Data$bathrooms)
KC_Data$waterfront <- as.factor(KC_Data$waterfront)
KC_Data$view <- as.factor(KC_Data$view)
KC_Data$grade <- as.factor(KC_Data$grade)
KC_Data$floors <- as.factor(KC_Data$floors)
KC_Data$condition <- as.factor(KC_Data$condition)
KC_Data$zipcode <- as.character(KC_Data$zipcode)
head(KC_Data)
##           id       date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13  221900        3         1        1180     5650
## 2 6414100192 2014-12-09  538000        3      2.25        2570     7242
## 3 5631500400 2015-02-25  180000        2         1         770    10000
## 4 2487200875 2014-12-09  604000        4         3        1960     5000
## 5 1954400510 2015-02-18  510000        3         2        1680     8080
## 6 7237550310 2014-05-12 1230000        4       4.5        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

10. Checking for NA values in entire dataset

table(is.na(KC_Data))
## 
##  FALSE 
## 453873

11. Creating a variable called ‘age’

year built (yr_built) doesn’t make sense. Age at selling the home (after built ) matters most.

KC_Data$age <- as.numeric(format(KC_Data$date, "%Y"))-(KC_Data$yr_built) 

12. Creating a variable called ‘renage’.

renage means renovation age. it reflects the age between the house renovated and sold.

KC_Data$yr_renovated[KC_Data$yr_renovated == 0] <- NA
KC_Data$renage <- as.numeric(format(KC_Data$date, "%Y")) - (KC_Data$yr_renovated)
head(KC_Data)
##           id       date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13  221900        3         1        1180     5650
## 2 6414100192 2014-12-09  538000        3      2.25        2570     7242
## 3 5631500400 2015-02-25  180000        2         1         770    10000
## 4 2487200875 2014-12-09  604000        4         3        1960     5000
## 5 1954400510 2015-02-18  510000        3         2        1680     8080
## 6 7237550310 2014-05-12 1230000        4       4.5        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15 age
## 1           NA   98178 47.5112 -122.257          1340       5650  59
## 2         1991   98125 47.7210 -122.319          1690       7639  63
## 3           NA   98028 47.7379 -122.233          2720       8062  82
## 4           NA   98136 47.5208 -122.393          1360       5000  49
## 5           NA   98074 47.6168 -122.045          1800       7503  28
## 6           NA   98053 47.6561 -122.005          4760     101930  13
##   renage
## 1     NA
## 2     23
## 3     NA
## 4     NA
## 5     NA
## 6     NA

12. looking in depth in to factor variables.

Here, a through scanning on factor variables is done to know what they look like. assessment was done by using bar graphics and tables.

table(is.na(KC_Data$renage))   # only approax 5% houses are renovated)
## 
## FALSE  TRUE 
##   914 20699

only approax 5% houses are renovated

12.1. variable waterfront visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(waterfront = waterfront %>% fct_infreq()) %>% 
  ggplot(aes(waterfront)) +
  geom_bar() 

using table command

table(KC_Data$waterfront)      # less than 0.5% have waterfront
## 
##     0     1 
## 21450   163

less than 0.5% of the houses have waterfront

12.2 - variable view visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(view = view %>% fct_infreq()) %>% 
  ggplot(aes(view)) +
  geom_bar() 

using table command

table(KC_Data$view)            # approax 10% has other than zero views 1,2,3,4
## 
##     0     1     2     3     4 
## 19489   332   963   510   319

Approximatelly 10% of the houses have other than zero views such as 1,2,3,4. Almost 90% of the houses do not have view. meaning only 10% of the houses have view in our data set

12.3 variable bedrooms visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(bedrooms = bedrooms %>% fct_infreq()) %>% 
  ggplot(aes(bedrooms)) +
  geom_bar() 

using table command

table(KC_Data$bedrooms)        # mostly bedrooms are between 1-6
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   33 
##   13  199 2760 9824 6882 1601  272   38   13    6    3    1    1

mostly bedrooms are between 1-6. But the houses have bedrooms up to 11. 33 is outlier and must be removed.

12.4 - variable bathrooms visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(bathrooms = bathrooms %>% fct_infreq()) %>% 
  ggplot(aes(bathrooms)) +
  geom_bar() +
  coord_flip()

using table command

table(KC_Data$bathrooms)      # mostly accounts for 1,1.5,1.75,2,2.25,2.5,2.5,3,3.5 total 30
## 
##    0  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75 
##   10    4   72 3852    9 1446 3048 1930 2047 5380 1185  753  589  731  155 
##    4 4.25  4.5 4.75    5 5.25  5.5 5.75    6 6.25  6.5 6.75  7.5 7.75    8 
##  136   79  100   23   21   13   10    4    6    2    2    2    1    1    2

Majority of bathrooms account for 1,1.5,1.75,2,2.25,2.5,2.5,3,3.5. Bathrooms are total of 30 and they range from zero to 8 in our data set.

12.5 - variable condition visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(condition = condition %>% fct_infreq()) %>% 
  ggplot(aes(condition)) +
  geom_bar() 

using table command

table(KC_Data$condition)      # mostly 3 then 4 then 5 then 2 then 1
## 
##     1     2     3     4     5 
##    30   172 14031  5679  1701

mostly condition 3 appartment is common followed by 4 then 5 then 2 and lastly 1

12.6 - variable Grade visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(grade = grade %>% fct_infreq()) %>% 
  ggplot(aes(grade)) +
  geom_bar() 

using table command

table(KC_Data$grade)          # mostly 5-9 out of 1-12
## 
##    1    3    4    5    6    7    8    9   10   11   12   13 
##    1    3   29  242 2038 8981 6068 2615 1134  399   90   13

grade from 5-10 are more common out of 1-12. It means according to our data set level of construction and design of the housing is in average range

12.7 - variable floors visualization using bar plot and table

using bar plot

KC_Data %>% 
  mutate(floors = floors %>% fct_infreq()) %>% 
  ggplot(aes(floors)) +
  geom_bar() 

using table command

table(KC_Data$floors)         # mostly 1 and 2 then 1.5
## 
##     1   1.5     2   2.5     3   3.5 
## 10680  1910  8241   161   613     8

majority of the houses in our data set have 1st and 2nd floors followed by 1.5. generally according data set the houses have 1, 1.5, 2, 2.5, 3 and 3.5 floors.

13. new variable ‘rate’

Column ‘rate’ is created which is selling price per square feet to help in assessment

KC_Data$rate <- KC_Data$price/KC_Data$sqft_living

14 Checking for the updated structure of dataset

it is mandatory to check the new structure of data set to see whether all needed varibles are created correctly and also to see factor variables and converted date variable.

str(KC_Data)
## 'data.frame':    21613 obs. of  24 variables:
##  $ id           : num  7129300520 6414100192 5631500400 2487200875 1954400510 ...
##  $ date         : Date, format: "2014-10-13" "2014-12-09" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : Factor w/ 13 levels "0","1","2","3",..: 4 4 3 5 4 5 4 4 4 4 ...
##  $ bathrooms    : Factor w/ 30 levels "0","0.5","0.75",..: 4 9 4 12 8 18 9 6 4 10 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
##  $ waterfront   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ view         : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ condition    : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : Factor w/ 12 levels "1","3","4","5",..: 6 6 5 6 7 10 6 6 6 6 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  NA 1991 NA NA NA NA NA NA NA NA ...
##  $ zipcode      : chr  "98178" "98125" "98028" "98136" ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
##  $ age          : num  59 63 82 49 28 13 19 52 55 12 ...
##  $ renage       : num  NA 23 NA NA NA NA NA NA NA NA ...
##  $ rate         : num  188 209 234 308 304 ...

15 - using tidyr function to analyze the relation of every variable with price

this is to show how every variable is related to house Price using tidyr function

this will give us general insight of how every variable is affecting price. However later on on this report, diving in to each variable in detail will be done.

KC_DatarGraph <- gather(KC_Data, variable, value, -price)
ggplot(KC_DatarGraph) +
    geom_jitter(aes(value,price, colour=variable)) + 
    geom_smooth(aes(value,price, colour=variable), method=lm, se=FALSE) +
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relationship Of Price With Other variables")

16. using tidyr function to see the relation of every numeric variable with price

  • Plotting the all numeric variables against the price.
KC_Data %>%
  select(-id, -yr_renovated, -yr_built) %>%
  keep(is.numeric) %>% 
  gather(key,value,-price) %>%
  ggplot(aes(x=value,y=price)) +
  geom_jitter(color = 'blue',alpha = .6) +
  geom_smooth(method = 'gam', color= 'red', fill = 'grey', alpha = .2) +
  facet_wrap(~key, scales = 'free') + 
  theme_bw()

17. using tidyr function to see the relationship of every factor variable with price

  • Factor variables plotted against price.
KC_Data %>%
  select(waterfront,bedrooms,bathrooms,view,price,zipcode,grade,condition) %>%
  gather(key,value,-price) %>%
  ggplot(aes(x=value,y=price)) +
  geom_point(color= 'blue', fill = 'grey', alpha = .5) +
  facet_wrap(~key, scales = 'free') + 
  theme_bw() +
  coord_flip()

Bar graph of factor variables using Dplyr function

KC_Data %>%
  keep(is.factor) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_bar(fill = 'blue') + 
  theme_bw()

18 - ploting all the numeric variables using facet_wrap

  • All numeric variables are plotted using facet_wrap.
KC_Data %>%
  keep(is.numeric) %>% 
  select(-id, -yr_built, -yr_renovated) %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_density(fill= 'blue') +
  theme_bw()

KC_Data %>%
  keep(is.numeric) %>% 
  select(-id, -lat, -long, -yr_renovated, -yr_built) %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(fill= 'blue') +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

19. Looking at facilities of house (some numeric variables with price)

scatterplotMatrix(~price+ sqft_living + sqft_above  + sqft_lot+sqft_basement,     data = KC_Data,
    main="Price vs size of house")

** five variables seems affecting the housing prices well. these are sqft_living, bathrooms, bedrooms, grade, view, lat and sqft_basement.** Others may have effect on the price of the house too.

Lets verify this using box plots and scatter plots and later mapping of prices will occur.

20. Removing outlier

  • before plotting variables, let us first clean our data from outliers

Removing outliers improve the quality and generalization of modelsby reduceing the variance of the model. In Our data, indeed we can find costly houses, which are usually outlier with a price very different from the rest.

remove_outliers <- function(x, na.rm = TRUE, ...) {  
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)  
  H <- 4 * IQR(x, na.rm = na.rm)  
  y = 
    y = c(which(x < (qnt[1] - H)),which(x > (qnt[2] + H)))
  y
}
num = 0
for(name  in names(KC_Data)){
  if(grepl("sqft", name) || name == "price"){
    outliers = remove_outliers(KC_Data[,name])
    num = num + length(outliers)
    KC_Data = KC_Data[-outliers,]
  }
}
# Number of data removed
print(num)
## [1] 2030
# Number of data still available
print(nrow(KC_Data))
## [1] 19583

thus, the number of data points removed as outlier is 2030 and the remaining observations kept is 19583.

Removing abortive data (outliers) will reduce the RSS, MSE calculated from our data if we want to proceed on modeling part . This will undoubtedly help the models to converge towards a solution that will generalize better. However, the objective of this project is not modeling. It only focuses on visualization and reaching in to conclusions. So modeling will be left out for now.

21. Visualization - Exploratory Data Analysis and Data Checking using box plot, bar chart and scatter diagrams

  • Using Box plots to indicate the relationship of each factor variables with house sell prices

### 21.1. assessing Price vs. bedrooms using boxplots and bar chart

Using simple Box plot

## Price vs. bedrooms ->> There is relationship between price and bedrooms (significant relationship exists)

boxplot1=boxplot(price~bedrooms, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. bedrooms", xlab="bedrooms", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bedrooms, y = price, fill = bedrooms, main = "Price vs. bedrooms" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(bedrooms = as.factor(bedrooms)) %>%
  group_by(bedrooms) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(bedrooms = reorder(bedrooms,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = bedrooms,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = bedrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'bedrooms', 
       y = 'Median Price', 
       title = 'bedrooms and Median Price') +
  coord_flip() + 
  theme_bw()

Price and bedrooms have nice correlation. As number of bedrooms increases price also increases.

21.2 - assessing Price vs. bathrooms using boxplots and bar chart.

  • to examine how the number of bath rooms affect the price.

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bathrooms, y = price, fill = bathrooms,main = "Price vs. Bathrooms" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2) +
coord_flip()

Using simple Box plot

## Price vs. Bathrooms ->> Nice correlation, as # of bahtrooms increases [median of bar plot], price increases as well, with one exception when bathroom=7

boxplot2=boxplot(price~bathrooms, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Bathrooms", xlab="Bathrooms", ylab="Price")

Using bar chart in ggplot

KC_Data %>%
  mutate(bathrooms = as.factor(bathrooms)) %>%
  group_by(bathrooms) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(bathrooms = reorder(bathrooms,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = bathrooms,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = bathrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 3.5, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'bathrooms', 
       y = 'Median Price', 
       title = 'bathrooms and Median Price') +
  coord_flip() + 
  theme_bw()

Price of house and its associated number of bathrooms have nice correlation. As number of bahtrooms increases (median of bar plot), price increases as well

21.3 - Assessing how the Grade affects the price using boxplots, and bar chart.

Using simple Box plot

## Price vs. Grade ->> Nice correlation, grade increases [median of bar plot], price increases as well

boxplot3=boxplot(price~grade, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Grade", xlab="Grade", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = grade, y = price, fill = grade,main = "Price vs. Grade" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(grade = as.factor(grade)) %>%
  group_by(grade) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(grade = reorder(grade,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = grade,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = grade, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'grade', 
       y = 'Median Price', 
       title = 'grade and Median Price') +
  coord_flip() + 
  theme_bw()

**Price and Grade have also nice correlation. As grade increases (median of bar plot), price also increases.

21.4. Assessing how the number of view affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. View ->> Nice correlation, view increases [median of bar plot], price increases as well

boxplot4=boxplot(price~view, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. View", xlab="View", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = view, y = price, fill = view, main = "Price vs. View" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(view = as.factor(view)) %>%
  group_by(view) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(view = reorder(view,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = view,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = view, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'view', 
       y = 'Median Price', 
       title = 'view and Median Price') +
  coord_flip() + 
  theme_bw()

Price and View has nice correlation. AS view increases (median of bar plot), the price of house also increases.

21.5 - Assessing how condition affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. condition ->> This is almost no relationship between price and condition
boxplot5=boxplot(price~condition, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. condition", xlab="condition", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = condition, y = price, fill = condition, main = "Price vs. condition" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(condition = as.factor(condition)) %>%
  group_by(condition) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(condition = reorder(condition,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = condition,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = condition, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'condition', 
       y = 'Median Price', 
       title = 'condition and Median Price') +
  coord_flip() + 
  theme_bw()

** there is almost very little or no relationship between price and condition. the relation ship that we see is almost insignificant.**

21.6 - Assessing how number of floors affect the price using boxplots and bar chart.

Using simple Box plot

## Price vs. floors ->> This is almost no relationship between price and floors (insignificant relationship exists)

boxplot6=boxplot(price~floors, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. floors", xlab="floors", ylab="Price")

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = floors, y = price, fill = floors, main = "Price vs. floors" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(floors = as.factor(floors)) %>%
  group_by(floors) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(floors = reorder(floors,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = floors,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = floors, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'floors', 
       y = 'Median Price', 
       title = 'floors and Median Price') +
  coord_flip() + 
  theme_bw()

** the relationship that we see between floors and price is almost insignificant. However it shows some sort of positive correlation **

21.7. Assessing how waterfront affect the price using boxplots and bar chart.

Using ggplot Box plot with outliers in red color

# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = waterfront, y = price, fill = waterfront, main = "Price vs. waterfront" )) + 
  geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)

Using bar chart in ggplot

KC_Data %>%
  mutate(waterfront = as.factor(waterfront)) %>%
  group_by(waterfront) %>%
  dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(waterfront = reorder(waterfront,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  
  ggplot(aes(x = waterfront,y = Median_Price)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = waterfront, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'waterfront', 
       y = 'Median Price', 
       title = 'waterfront and Median Price') +
  coord_flip() + 
  theme_bw()

21.8 - Assessing how Year Renovated affect the price using bar chart.

Using bar chart in ggplot

KC_Data %>%
  group_by(yr_renovated) %>%
  dplyr::summarise(Median_Price = median(price, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(yr_renovated = reorder(yr_renovated,Median_Price)) %>%
  arrange(desc(Median_Price)) %>%
  head(10) %>%
  
  
  ggplot(aes(x = yr_renovated,y = Median_Price)) +
  geom_bar(stat='identity',colour="white",fill = "blue") +
  geom_text(aes(x = yr_renovated, y = 1, label = paste0("(",Median_Price,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'yellow',
            fontface = 'bold') +
  labs(x = 'year renovated', 
       y = 'Median Price', 
       title = 'Year renovated and Median Price') +
   coord_flip() +
  theme_bw()

Year renovated doesn’t affect the price of the house.

22 Using scater plots to indicate the relationship of numeric and date variables with house prices

22.1 - Price Plots

We plot the Price Plot , unfortunately the graph does not reveal much.

case 1 Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

case 2 Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  scale_x_continuous(limits=c(0,2e6)) +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

case 3 - Price Plot

KC_Data %>%
  
  ggplot(aes(x = price)) +    
  geom_histogram(alpha = 0.8,fill = "blue") +
  scale_x_continuous(limits=c(0,1e6)) +
  
  labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

22.2. Sqft Living and Price scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(x=sqft_living,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living)")+
  ylab("Price")

Price and Sqft_living have nice correlation. As sqft_living increases price also increases.

22.3 - Assessing how Sqft_living15 affects Price using scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living15,y=price))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=sqft_living15,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living15)")+
  ylab("Price")

Price and Sqft_living15 have nice correlation. As sqft_living15 increases price also increases.

22.4 - Assessing whether Sqft Living and Sqft Living15 correlates using scatter Plot

KC_Data %>% 
  filter(!is.na(sqft_living15)) %>% 
  filter(!is.na(sqft_living)) %>% 
 
  ggplot(aes(x=sqft_living15,y=sqft_living))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=sqft_living15,y=sqft_living),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Living15)")+
  ylab("sqft_living")

** sqft_living and sqft_living15 has high correlation**

22.5 - Assessing whether Sqft Lot and Price correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_lot)) %>% 
  
  ggplot(aes(x=sqft_lot,y=price))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
  stat_smooth(aes(x=sqft_lot,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot)")+
  ylab("Price")

** Sqft_lot and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot**

22.6 - Assessing whether Sqft Lot and Price correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_lot15)) %>% 
  
  ggplot(aes(x=sqft_lot15,y=price))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot15))) +
  stat_smooth(aes(x=sqft_lot15,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot15)")+
  ylab("Price")

** Sqft_lot15 and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot15**

22.7 - Assessing whether sqft_lot and sqft_lot15 correlates (using scatter Plot)

KC_Data %>% 
  filter(!is.na(sqft_lot15)) %>% 
  filter(!is.na(sqft_lot)) %>% 
  
  ggplot(aes(x=sqft_lot,y=sqft_lot15))+
  geom_point(color = "blue")+
  
  scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
  stat_smooth(aes(x=sqft_lot,y=sqft_lot15),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Sqft Lot)")+
  ylab("sqft_lot15")

** sqft_lot and sqft_lot15 has high positive correlation**

22.8 - Assessing how lat affets Price of the housing (using scatter Plot)

## Price vs. Lat ->> This is more like a normal dist relationship, price peaks around when lat= 47.64 and declines afterwards, but this can be modeled easily. we would say Lat explains the price as well.

boxplot5=boxplot(price~lat, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. Lat", xlab="Lat", ylab="Price")

or

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(lat)) %>% 
 
  ggplot(aes(x=lat,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(x=lat,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("lat")+
  ylab("Price")

** Price vs. Lat looks to have more likely a normal distribution relationship. The house price peaks around when lat= 47.64 and declines afterwards. Generally, we would say that Lat explains the price well.**

22.9 - Assessing how age of house affets Price of the housing using scatter Plot

## Price vs. age ->> This is almost no relationship between price and age (insignificant relationship exists)
boxplot12=boxplot(price~age, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. age", xlab="age", ylab="Price")

OR

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(age)) %>% 
 
  ggplot(aes(x=age,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(age,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("age")+
  ylab("Price")

Price and age have almost no relationship and age is insignificant to explain the housing price.

23. Assessing how renage affets Price of the housing (by using scatter Plot)

## Price vs. renage  ->> This is almost no relationship between price and renage  (insignificant relationship exists)

boxplot13=boxplot(price~renage , data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. renage ", xlab="renage ", ylab="Price")

OR

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(renage)) %>% 
 
  ggplot(aes(x=renage,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(renage,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("renage")+
  ylab("Price")

Price and renage have almost no relationship and renage is insignificant to explain the housing price.

24. Assessing how sqft_basement affets Price of the housing (by using scatter Plot)

## Price vs. sqft_basement 

boxplot6=boxplot(price~sqft_basement, data=KC_Data, 
  col=(c("gold","darkgreen")),
  main="Price vs. sqft_basement", xlab="sqft_basement", ylab="Price")

OR

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(sqft_basement)) %>% 
 
  ggplot(aes(x=sqft_basement,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(sqft_basement,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("sqft_basement")+
  ylab("Price")

** Price and sqft_basement have good correlation. We would say sqft_basement explains the price well.**

25. assessing how date affets Price of the housing using scatter Plot

KC_Data %>% 
  filter(!is.na(price)) %>% 
  filter(!is.na(date)) %>% 
 
  ggplot(aes(x=date,y=price))+
  geom_point(color = "blue")+
  stat_smooth(aes(date,y=price),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("date")+
  ylab("Price")

Price and date have almost no relationship. Thus, date doesn’t explain house price.

26 - More visualizations

ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_above, y = price, color = price))

viridis::scale_color_viridis(discrete=TRUE)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: viridis
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
  geom_smooth(mapping = aes(x = sqft_living, y = price))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(data = KC_Data) +
  geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
  
  geom_smooth(mapping = aes(x = sqft_living, y = price, linetype = waterfront))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

based on above visualization procedures variables such as number of bedrooms, number of bathrooms, grade, view of the houses, condition of the house, whether the has have water front or not, sqft_basement, sqft_living, sqft_living15, lat, sqft_lot and sqft_lot15 affect positively the price of the houses eventhough their degree may differ.

**sqft_living, sqft_living, bathrooms, grade, view, bedrooms, condition and sqft_basement are factors that have effect on the prices of the houses .

27. conducting More correlations to Strengthen our claim

Testing correlation of the variables against price

  • To strengthen our claim, we also computed more correlation between price and variables,

27.1 -Let us run the correlation function.

vctCorr = numeric(0)
for (i in names(KC_Data))
{
    cor.result <- cor(KC_Data$price, as.numeric(KC_Data[,i])) 
    vctCorr <- c(vctCorr, cor.result)
}
KC_DatarCorr <- vctCorr
names(KC_DatarCorr) <- names(KC_Data)
KC_DatarCorr
##            id          date         price      bedrooms     bathrooms 
##   0.006811293  -0.006670429   1.000000000   0.312944125   0.482055213 
##   sqft_living      sqft_lot        floors    waterfront          view 
##   0.660267550   0.112404847   0.266922130   0.145903977   0.360685672 
##     condition         grade    sqft_above sqft_basement      yr_built 
##   0.052516561   0.663308821   0.550784781   0.303967093   0.029900519 
##  yr_renovated       zipcode           lat          long sqft_living15 
##            NA  -0.024979857   0.377864416   0.007129642   0.582457178 
##    sqft_lot15           age        renage          rate 
##   0.114853357  -0.029848287            NA   0.545482836

27.2 Correlation between house variables continued ….

ggcorr(KC_Data, hjust = 0.8, layout.exp = 1) + 
ggtitle("Correlation between house variables")

27.3 A correlation matrix to see how predictors relate.

KC_Data %>%
  select (-date, -id, -yr_built, -yr_renovated, -renage) %>%
  apply(2,as.character) %>%
  apply(2,as.numeric) %>%
  cor(use='everything',method='pearson') %>%
  corrplot(type='lower', diag = F)

28. More visualizations - Maps of Houses

are house located near to water coast costier?

  • To answer this question, different map visualization is conducted at various price ranges.
KC_Data$Price_Bin<-cut(KC_Data$price, c(0,250e3,500e3,750e3,1e6,999e6))

center_lon = median(KC_Data$long,na.rm = TRUE)
center_lat = median(KC_Data$lat,na.rm = TRUE)

factpal <- colorFactor(c("black","blue","yellow", "orange", "red"), 
                       KC_Data$Price_Bin)



leaflet(KC_Data) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircles(lng = ~long, lat = ~lat, 
             color = ~factpal(Price_Bin))  %>%
  # controls
  setView(lng=center_lon, lat=center_lat,zoom = 12) %>%
  
  addLegend("bottomright", pal = factpal, values = ~Price_Bin,
            title = "House Price Distribution",
            opacity = 1)

28.1 - Price Bins Count

Most of the houses are in the range 250 thousand to 500 thousands. The next highest categories are

500 to 750 thousand

0 to 250 thousand

750 thousand to 1 million

1 million and above

KC_Data %>%
  mutate(Price_Bin = as.factor(Price_Bin)) %>%
  group_by(Price_Bin) %>%
  dplyr::summarise(Count = n()) %>%
  ungroup() %>%
  mutate(Price_Bin = reorder(Price_Bin,Count)) %>%
  arrange(desc(Count)) %>%
  
  ggplot(aes(x = Price_Bin,y = Count)) +
  geom_bar(stat='identity',colour="white", fill = "blue") +
  geom_text(aes(x = Price_Bin, y = 1, label = paste0("(",Count,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'red',
            fontface = 'bold') +
  labs(x = 'Price_Bin', 
       y = 'Count', 
       title = 'Price_Bin and Count') +
  coord_flip() + 
  theme_bw()

29. Price Bins and Maps

PriceBinGrouping = function(limit1, limit2)
{
  return(
    
    KC_Data %>%
      filter(price > limit1) %>%
      filter(price <= limit2)
  )
}

PriceGroup1 = PriceBinGrouping(0,250e3)

PriceGroup2 = PriceBinGrouping(250e3,500e3)

PriceGroup3 = PriceBinGrouping(500e3,750e3)

PriceGroup4 = PriceBinGrouping(750e3,1e6)

PriceGroup5 = PriceBinGrouping(1e6,999e6)

29.1 The map showing houses in the price range from 0 to 250 thousands.

MapPriceGroups = function(PriceGroupName,color)
{
  center_lon = median(PriceGroupName$long,na.rm = TRUE)
  center_lat = median(PriceGroupName$lat,na.rm = TRUE)

leaflet(PriceGroup2) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircles(lng = ~long, lat = ~lat, 
             color = ~c(color))  %>%
  # controls
  setView(lng=center_lon, lat=center_lat,zoom = 12)
}

MapPriceGroups(PriceGroup1,"black")

** As can be seen from above map, houses in the price range between from 0 to 250 thousands (black points) are scattered every where in the terrain. it didn’t show us certain trend.**

29.2 The map showing houses in the price range from 250 to 500 thousands.

MapPriceGroups(PriceGroup2,"blue")

** The blue points indicate the houses in the price range from 250 to 500 thousands. still those houses are located inland. they are not concentrating much in costal areas. we don’t see much noticiable trend.**

29.3 The map showing houses in the price range from 500 to 750 thousands.

MapPriceGroups(PriceGroup3,"orange")

** The orange points indicate the houses in the price range from 500 to 750 thousands. still those houses are located inland are costal areas. we don’t notice much noticiable trend.**

29.4 The map showing houses in the price range from 750 thousands to 1 million.

MapPriceGroups(PriceGroup4,"fuchsia")

** The fuchsia points indicate the houses in the price range from 750 thousands to 1 million. still those houses are located inland and costal areas. the houses are much concentrating to the costal side, much more than inland.**

29.5 The map showing houses in the price range from 1 million and above.

MapPriceGroups(PriceGroup5,"red")

** The red points indicate the houses in the price range from 1 million and above. The houses are much more located on costal areas. The houses are much more concentrating to the costal area than inland.**

30 - Conclusion from our map visualization

  • Houses near the coast area are much costlier.

31. Summary of findings

  • When we move towards the coast from inland, House prices become more and more expensive.

  • Those variables that are affecting house prices positively are

From Catagorical variables - Number of bedrooms, Number of bathrooms, grade, view, condition, and water front. - number of floors affects price very slightly and can be ignored.

From numerical variables

** - sqft_living, lat, sqft_basement,sqft_living - sqft_lot and sqft_lot15 affects prices very slightly.**

  • We observe that the columns sqft_living15 and sqft_lot15 have a strong correlation with sqft_living and sqft_lot respectively. thus, sqft_living15 and sqft_lot15 can be dropped from analysis for further studies, when we want to proceed to modeling.

  • ** longitude, age renage, and date sold doesn’t affect the price of the house.**