This is a final Project for the statistical programming in R (STAT612) course. The dataset fot this project is taken from Kaggle and it is about House Sales in King Country, USA.
The data set was obtained from www.kaggle.com website. It is Kaggle dataset which is about House Sales in King County of Washington State, USA. the link for this dataset is https://www.kaggle.com/gabriellima/house-sales-in-king-county-usa/data
Most of the time, housing prices are subject to different market forces. Sometimes the prices rise, other times the prices falls. The market forces that affect the housing prices may include interest rates, economic factors (such as GDP, employment, manufacturing, prices of goods), import/export and government subsidies. These forces are out of our control and can not be easily predictable. As the result, this paper does not explore the effect of those different external factors on the price of houses. Instead, we will be focusing to explore effect of various internal factors such as number of bedrooms, bathrooms, view, condition, grade, location, square foot, etc. - on the housing prices.
The data set contains the prices of houses against a various parameters that may or may not affect the house price. The objective of the study is to use statistical analysis in order to find out the dependence of these variables on the price of houses. it is to assess which parameters highly affect the housing prices and which variables have minimal affect on the price of houses. The statistical tools that we will be focusing to use are Correlation, box plot, various scatter plot and bar plots. in addition to that geospatial representation of those house sale price were plotted on ESRI maps to see which prices are higher on which part of the study area using the longitude and latitude points. Over all, important insights between the variables were drawn from boxplots, histogram, scatter, corrgrams and geospatial mapping.
The data for these sales comes from the official public records of home sales in the King County area, Washington State, USA. The data sets contains 21613 rows and 21 columns. Each represents a home sold from May 2014 through May 2015. Below is a breakdown of the variables involved:
[, 1] id - Unique ID for each home sold.
[, 2] date - Date of the home sale.
[, 3] price - Price of each home sold.
[, 4] bedrooms - Number of bedrooms.
[, 5] bathrooms - Number of bathrooms, where - 0.5 accounts for a room with a toilet but no shower.
[, 6] sqft_living - Square footage of the apartments interior living space.
[, 7] sqft_lot - Square footage of the land space.
[, 8] floors - Number of floors.
[, 9] waterfront - A variable for whether the apartment was overlooking the waterfront or not.
[, 10] view - An index from 0 to 4 of how good the view of the property was.
[, 11] condition - An index from 1 to 5 on the condition of the apartment.
[, 12] grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
[, 13] sqft_above - The square footage of the interior housing space that is above ground level.
[, 14] sqft_basement - The square footage of the interior housing space that is below ground level.
[, 15] yr_built - The year the house was initially built.
[, 16] yr_renovated - The year of the house’s last renovation.
[, 17] zipcode - What zipcode area the house is in.
[, 18] lat - Lattitude.
[, 19] long - Longitude.
[, 20] sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors.
[, 21] sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors.
options(scipen = 999)
options(warn=-1)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(forcats)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(lattice)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(leaps)
library(tidyr)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(dplyr)
library(ggplot2)
library(corrgram)
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
##
## baseball
## The following object is masked from 'package:lattice':
##
## panel.fill
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
##
## here
## The following object is masked from 'package:base':
##
## date
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(date)
library(FactoMineR)
library(tree)
library(corrplot)
## corrplot 0.84 loaded
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(rpart)
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(corrplot)
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
## The following object is masked from 'package:purrr':
##
## map
library(rbokeh)
library(stringr)
library(leaflet)
library(corrplot)
the data set was tried to read directly from the website link however it takes forever. thus, we prefered to download the data set to our personal computer and read it from our folder.
### 4. Reading data
KC_Data <- read.csv("kc_house_data.csv", sep=",", header=T, stringsAsFactors=F)
head(KC_Data)
## id date price bedrooms bathrooms sqft_living
## 1 7129300520 20141013T000000 221900 3 1.00 1180
## 2 6414100192 20141209T000000 538000 3 2.25 2570
## 3 5631500400 20150225T000000 180000 2 1.00 770
## 4 2487200875 20141209T000000 604000 4 3.00 1960
## 5 1954400510 20150218T000000 510000 3 2.00 1680
## 6 7237550310 20140512T000000 1230000 4 4.50 5420
## sqft_lot floors waterfront view condition grade sqft_above sqft_basement
## 1 5650 1 0 0 3 7 1180 0
## 2 7242 2 0 0 3 7 2170 400
## 3 10000 1 0 0 3 6 770 0
## 4 5000 1 0 0 5 7 1050 910
## 5 8080 1 0 0 3 8 1680 0
## 6 101930 1 0 0 3 11 3890 1530
## yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 1955 0 98178 47.5112 -122.257 1340 5650
## 2 1951 1991 98125 47.7210 -122.319 1690 7639
## 3 1933 0 98028 47.7379 -122.233 2720 8062
## 4 1965 0 98136 47.5208 -122.393 1360 5000
## 5 1987 0 98074 47.6168 -122.045 1800 7503
## 6 2001 0 98053 47.6561 -122.005 4760 101930
glimpse(KC_Data)
## Observations: 21,613
## Variables: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, ...
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T...
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000,...
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, ...
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0...
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
# Number of observations
nrow(KC_Data)
## [1] 21613
# Number of variables
ncol(KC_Data)
## [1] 21
Or
dim(KC_Data)
## [1] 21613 21
We have 21613 observations (rows) and 21 columns (variables) in our data set.
KC_Data$date <- as.Date(as.Date(as.character(KC_Data$date),"%Y%m%d"))
head(KC_Data)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13 221900 3 1.00 1180 5650
## 2 6414100192 2014-12-09 538000 3 2.25 2570 7242
## 3 5631500400 2015-02-25 180000 2 1.00 770 10000
## 4 2487200875 2014-12-09 604000 4 3.00 1960 5000
## 5 1954400510 2015-02-18 510000 3 2.00 1680 8080
## 6 7237550310 2014-05-12 1230000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
Variables such as bathrooms, bedrooms, floors, condition, waterfront, view and grade should be converted in to factor variables as they seems dammy variables in our data set. for example Condition has three maximum factors, while grade has at least five. So it is better to convert them in to factor variables as they are not real continous numeric variables.
converting bathrooms, bedrooms, floors, condition, waterfront, view and grade in to factor variables.
KC_Data$bedrooms <- as.factor(KC_Data$bedrooms)
KC_Data$bathrooms <- as.factor(KC_Data$bathrooms)
KC_Data$waterfront <- as.factor(KC_Data$waterfront)
KC_Data$view <- as.factor(KC_Data$view)
KC_Data$grade <- as.factor(KC_Data$grade)
KC_Data$floors <- as.factor(KC_Data$floors)
KC_Data$condition <- as.factor(KC_Data$condition)
KC_Data$zipcode <- as.character(KC_Data$zipcode)
head(KC_Data)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13 221900 3 1 1180 5650
## 2 6414100192 2014-12-09 538000 3 2.25 2570 7242
## 3 5631500400 2015-02-25 180000 2 1 770 10000
## 4 2487200875 2014-12-09 604000 4 3 1960 5000
## 5 1954400510 2015-02-18 510000 3 2 1680 8080
## 6 7237550310 2014-05-12 1230000 4 4.5 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
table(is.na(KC_Data))
##
## FALSE
## 453873
year built (yr_built) doesn’t make sense. Age at selling the home (after built ) matters most.
KC_Data$age <- as.numeric(format(KC_Data$date, "%Y"))-(KC_Data$yr_built)
renage means renovation age. it reflects the age between the house renovated and sold.
KC_Data$yr_renovated[KC_Data$yr_renovated == 0] <- NA
KC_Data$renage <- as.numeric(format(KC_Data$date, "%Y")) - (KC_Data$yr_renovated)
head(KC_Data)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 2014-10-13 221900 3 1 1180 5650
## 2 6414100192 2014-12-09 538000 3 2.25 2570 7242
## 3 5631500400 2015-02-25 180000 2 1 770 10000
## 4 2487200875 2014-12-09 604000 4 3 1960 5000
## 5 1954400510 2015-02-18 510000 3 2 1680 8080
## 6 7237550310 2014-05-12 1230000 4 4.5 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15 age
## 1 NA 98178 47.5112 -122.257 1340 5650 59
## 2 1991 98125 47.7210 -122.319 1690 7639 63
## 3 NA 98028 47.7379 -122.233 2720 8062 82
## 4 NA 98136 47.5208 -122.393 1360 5000 49
## 5 NA 98074 47.6168 -122.045 1800 7503 28
## 6 NA 98053 47.6561 -122.005 4760 101930 13
## renage
## 1 NA
## 2 23
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Here, a through scanning on factor variables is done to know what they look like. assessment was done by using bar graphics and tables.
table(is.na(KC_Data$renage)) # only approax 5% houses are renovated)
##
## FALSE TRUE
## 914 20699
only approax 5% houses are renovated
using bar plot
KC_Data %>%
mutate(waterfront = waterfront %>% fct_infreq()) %>%
ggplot(aes(waterfront)) +
geom_bar()
using table command
table(KC_Data$waterfront) # less than 0.5% have waterfront
##
## 0 1
## 21450 163
less than 0.5% of the houses have waterfront
using bar plot
KC_Data %>%
mutate(view = view %>% fct_infreq()) %>%
ggplot(aes(view)) +
geom_bar()
using table command
table(KC_Data$view) # approax 10% has other than zero views 1,2,3,4
##
## 0 1 2 3 4
## 19489 332 963 510 319
Approximatelly 10% of the houses have other than zero views such as 1,2,3,4. Almost 90% of the houses do not have view. meaning only 10% of the houses have view in our data set
using bar plot
KC_Data %>%
mutate(bedrooms = bedrooms %>% fct_infreq()) %>%
ggplot(aes(bedrooms)) +
geom_bar()
using table command
table(KC_Data$bedrooms) # mostly bedrooms are between 1-6
##
## 0 1 2 3 4 5 6 7 8 9 10 11 33
## 13 199 2760 9824 6882 1601 272 38 13 6 3 1 1
mostly bedrooms are between 1-6. But the houses have bedrooms up to 11. 33 is outlier and must be removed.
using bar plot
KC_Data %>%
mutate(bathrooms = bathrooms %>% fct_infreq()) %>%
ggplot(aes(bathrooms)) +
geom_bar() +
coord_flip()
using table command
table(KC_Data$bathrooms) # mostly accounts for 1,1.5,1.75,2,2.25,2.5,2.5,3,3.5 total 30
##
## 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75
## 10 4 72 3852 9 1446 3048 1930 2047 5380 1185 753 589 731 155
## 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5 6.75 7.5 7.75 8
## 136 79 100 23 21 13 10 4 6 2 2 2 1 1 2
Majority of bathrooms account for 1,1.5,1.75,2,2.25,2.5,2.5,3,3.5. Bathrooms are total of 30 and they range from zero to 8 in our data set.
using bar plot
KC_Data %>%
mutate(condition = condition %>% fct_infreq()) %>%
ggplot(aes(condition)) +
geom_bar()
using table command
table(KC_Data$condition) # mostly 3 then 4 then 5 then 2 then 1
##
## 1 2 3 4 5
## 30 172 14031 5679 1701
mostly condition 3 appartment is common followed by 4 then 5 then 2 and lastly 1
using bar plot
KC_Data %>%
mutate(grade = grade %>% fct_infreq()) %>%
ggplot(aes(grade)) +
geom_bar()
using table command
table(KC_Data$grade) # mostly 5-9 out of 1-12
##
## 1 3 4 5 6 7 8 9 10 11 12 13
## 1 3 29 242 2038 8981 6068 2615 1134 399 90 13
grade from 5-10 are more common out of 1-12. It means according to our data set level of construction and design of the housing is in average range
using bar plot
KC_Data %>%
mutate(floors = floors %>% fct_infreq()) %>%
ggplot(aes(floors)) +
geom_bar()
using table command
table(KC_Data$floors) # mostly 1 and 2 then 1.5
##
## 1 1.5 2 2.5 3 3.5
## 10680 1910 8241 161 613 8
majority of the houses in our data set have 1st and 2nd floors followed by 1.5. generally according data set the houses have 1, 1.5, 2, 2.5, 3 and 3.5 floors.
Column ‘rate’ is created which is selling price per square feet to help in assessment
KC_Data$rate <- KC_Data$price/KC_Data$sqft_living
it is mandatory to check the new structure of data set to see whether all needed varibles are created correctly and also to see factor variables and converted date variable.
str(KC_Data)
## 'data.frame': 21613 obs. of 24 variables:
## $ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
## $ date : Date, format: "2014-10-13" "2014-12-09" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : Factor w/ 13 levels "0","1","2","3",..: 4 4 3 5 4 5 4 4 4 4 ...
## $ bathrooms : Factor w/ 30 levels "0","0.5","0.75",..: 4 9 4 12 8 18 9 6 4 10 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
## $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ view : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : Factor w/ 12 levels "1","3","4","5",..: 6 6 5 6 7 10 6 6 6 6 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int NA 1991 NA NA NA NA NA NA NA NA ...
## $ zipcode : chr "98178" "98125" "98028" "98136" ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## $ age : num 59 63 82 49 28 13 19 52 55 12 ...
## $ renage : num NA 23 NA NA NA NA NA NA NA NA ...
## $ rate : num 188 209 234 308 304 ...
this is to show how every variable is related to house Price using tidyr function
this will give us general insight of how every variable is affecting price. However later on on this report, diving in to each variable in detail will be done.
KC_DatarGraph <- gather(KC_Data, variable, value, -price)
ggplot(KC_DatarGraph) +
geom_jitter(aes(value,price, colour=variable)) +
geom_smooth(aes(value,price, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relationship Of Price With Other variables")
KC_Data %>%
select(-id, -yr_renovated, -yr_built) %>%
keep(is.numeric) %>%
gather(key,value,-price) %>%
ggplot(aes(x=value,y=price)) +
geom_jitter(color = 'blue',alpha = .6) +
geom_smooth(method = 'gam', color= 'red', fill = 'grey', alpha = .2) +
facet_wrap(~key, scales = 'free') +
theme_bw()
KC_Data %>%
select(waterfront,bedrooms,bathrooms,view,price,zipcode,grade,condition) %>%
gather(key,value,-price) %>%
ggplot(aes(x=value,y=price)) +
geom_point(color= 'blue', fill = 'grey', alpha = .5) +
facet_wrap(~key, scales = 'free') +
theme_bw() +
coord_flip()
Bar graph of factor variables using Dplyr function
KC_Data %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_bar(fill = 'blue') +
theme_bw()
KC_Data %>%
keep(is.numeric) %>%
select(-id, -yr_built, -yr_renovated) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density(fill= 'blue') +
theme_bw()
KC_Data %>%
keep(is.numeric) %>%
select(-id, -lat, -long, -yr_renovated, -yr_built) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill= 'blue') +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
scatterplotMatrix(~price+ sqft_living + sqft_above + sqft_lot+sqft_basement, data = KC_Data,
main="Price vs size of house")
** five variables seems affecting the housing prices well. these are sqft_living, bathrooms, bedrooms, grade, view, lat and sqft_basement.** Others may have effect on the price of the house too.
Lets verify this using box plots and scatter plots and later mapping of prices will occur.
Removing outliers improve the quality and generalization of modelsby reduceing the variance of the model. In Our data, indeed we can find costly houses, which are usually outlier with a price very different from the rest.
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 4 * IQR(x, na.rm = na.rm)
y =
y = c(which(x < (qnt[1] - H)),which(x > (qnt[2] + H)))
y
}
num = 0
for(name in names(KC_Data)){
if(grepl("sqft", name) || name == "price"){
outliers = remove_outliers(KC_Data[,name])
num = num + length(outliers)
KC_Data = KC_Data[-outliers,]
}
}
# Number of data removed
print(num)
## [1] 2030
# Number of data still available
print(nrow(KC_Data))
## [1] 19583
thus, the number of data points removed as outlier is 2030 and the remaining observations kept is 19583.
Removing abortive data (outliers) will reduce the RSS, MSE calculated from our data if we want to proceed on modeling part . This will undoubtedly help the models to converge towards a solution that will generalize better. However, the objective of this project is not modeling. It only focuses on visualization and reaching in to conclusions. So modeling will be left out for now.
### 21.1. assessing Price vs. bedrooms using boxplots and bar chart
Using simple Box plot
## Price vs. bedrooms ->> There is relationship between price and bedrooms (significant relationship exists)
boxplot1=boxplot(price~bedrooms, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. bedrooms", xlab="bedrooms", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bedrooms, y = price, fill = bedrooms, main = "Price vs. bedrooms" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(bedrooms = as.factor(bedrooms)) %>%
group_by(bedrooms) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(bedrooms = reorder(bedrooms,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = bedrooms,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = bedrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'bedrooms',
y = 'Median Price',
title = 'bedrooms and Median Price') +
coord_flip() +
theme_bw()
Price and bedrooms have nice correlation. As number of bedrooms increases price also increases.
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = bathrooms, y = price, fill = bathrooms,main = "Price vs. Bathrooms" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2) +
coord_flip()
Using simple Box plot
## Price vs. Bathrooms ->> Nice correlation, as # of bahtrooms increases [median of bar plot], price increases as well, with one exception when bathroom=7
boxplot2=boxplot(price~bathrooms, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Bathrooms", xlab="Bathrooms", ylab="Price")
Using bar chart in ggplot
KC_Data %>%
mutate(bathrooms = as.factor(bathrooms)) %>%
group_by(bathrooms) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(bathrooms = reorder(bathrooms,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = bathrooms,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = bathrooms, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 3.5, colour = 'yellow',
fontface = 'bold') +
labs(x = 'bathrooms',
y = 'Median Price',
title = 'bathrooms and Median Price') +
coord_flip() +
theme_bw()
Price of house and its associated number of bathrooms have nice correlation. As number of bahtrooms increases (median of bar plot), price increases as well
Using simple Box plot
## Price vs. Grade ->> Nice correlation, grade increases [median of bar plot], price increases as well
boxplot3=boxplot(price~grade, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Grade", xlab="Grade", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = grade, y = price, fill = grade,main = "Price vs. Grade" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(grade = as.factor(grade)) %>%
group_by(grade) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(grade = reorder(grade,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = grade,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = grade, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'grade',
y = 'Median Price',
title = 'grade and Median Price') +
coord_flip() +
theme_bw()
**Price and Grade have also nice correlation. As grade increases (median of bar plot), price also increases.
Using simple Box plot
## Price vs. View ->> Nice correlation, view increases [median of bar plot], price increases as well
boxplot4=boxplot(price~view, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. View", xlab="View", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = view, y = price, fill = view, main = "Price vs. View" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(view = as.factor(view)) %>%
group_by(view) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(view = reorder(view,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = view,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = view, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'view',
y = 'Median Price',
title = 'view and Median Price') +
coord_flip() +
theme_bw()
Price and View has nice correlation. AS view increases (median of bar plot), the price of house also increases.
Using simple Box plot
## Price vs. condition ->> This is almost no relationship between price and condition
boxplot5=boxplot(price~condition, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. condition", xlab="condition", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = condition, y = price, fill = condition, main = "Price vs. condition" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(condition = as.factor(condition)) %>%
group_by(condition) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(condition = reorder(condition,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = condition,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = condition, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'condition',
y = 'Median Price',
title = 'condition and Median Price') +
coord_flip() +
theme_bw()
** there is almost very little or no relationship between price and condition. the relation ship that we see is almost insignificant.**
Using simple Box plot
## Price vs. floors ->> This is almost no relationship between price and floors (insignificant relationship exists)
boxplot6=boxplot(price~floors, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. floors", xlab="floors", ylab="Price")
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = floors, y = price, fill = floors, main = "Price vs. floors" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(floors = as.factor(floors)) %>%
group_by(floors) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(floors = reorder(floors,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = floors,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = floors, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'floors',
y = 'Median Price',
title = 'floors and Median Price') +
coord_flip() +
theme_bw()
** the relationship that we see between floors and price is almost insignificant. However it shows some sort of positive correlation **
Using ggplot Box plot with outliers in red color
# Create a Boxplot and Change Outliers' color in a R ggplot boxplot
ggplot(KC_Data, aes(x = waterfront, y = price, fill = waterfront, main = "Price vs. waterfront" )) +
geom_boxplot(outlier.color = "red", outlier.shape = 8, outlier.size = 2)
Using bar chart in ggplot
KC_Data %>%
mutate(waterfront = as.factor(waterfront)) %>%
group_by(waterfront) %>%
dplyr::summarise(Median_Price= median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(waterfront = reorder(waterfront,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
ggplot(aes(x = waterfront,y = Median_Price)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = waterfront, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'waterfront',
y = 'Median Price',
title = 'waterfront and Median Price') +
coord_flip() +
theme_bw()
Using bar chart in ggplot
KC_Data %>%
group_by(yr_renovated) %>%
dplyr::summarise(Median_Price = median(price, na.rm = TRUE)) %>%
ungroup() %>%
mutate(yr_renovated = reorder(yr_renovated,Median_Price)) %>%
arrange(desc(Median_Price)) %>%
head(10) %>%
ggplot(aes(x = yr_renovated,y = Median_Price)) +
geom_bar(stat='identity',colour="white",fill = "blue") +
geom_text(aes(x = yr_renovated, y = 1, label = paste0("(",Median_Price,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'yellow',
fontface = 'bold') +
labs(x = 'year renovated',
y = 'Median Price',
title = 'Year renovated and Median Price') +
coord_flip() +
theme_bw()
Year renovated doesn’t affect the price of the house.
We plot the Price Plot , unfortunately the graph does not reveal much.
case 1 Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
case 2 Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
scale_x_continuous(limits=c(0,2e6)) +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
case 3 - Price Plot
KC_Data %>%
ggplot(aes(x = price)) +
geom_histogram(alpha = 0.8,fill = "blue") +
scale_x_continuous(limits=c(0,1e6)) +
labs(x= 'Price',y = 'Count', title = paste("Distribution of", ' Price ')) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living)")+
ylab("Price")
Price and Sqft_living have nice correlation. As sqft_living increases price also increases.
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living15,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living15,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living15)")+
ylab("Price")
Price and Sqft_living15 have nice correlation. As sqft_living15 increases price also increases.
KC_Data %>%
filter(!is.na(sqft_living15)) %>%
filter(!is.na(sqft_living)) %>%
ggplot(aes(x=sqft_living15,y=sqft_living))+
geom_point(color = "blue")+
stat_smooth(aes(x=sqft_living15,y=sqft_living),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Living15)")+
ylab("sqft_living")
** sqft_living and sqft_living15 has high correlation**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_lot)) %>%
ggplot(aes(x=sqft_lot,y=price))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
stat_smooth(aes(x=sqft_lot,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot)")+
ylab("Price")
** Sqft_lot and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_lot15)) %>%
ggplot(aes(x=sqft_lot15,y=price))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot15))) +
stat_smooth(aes(x=sqft_lot15,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot15)")+
ylab("Price")
** Sqft_lot15 and Price have very insignificant relationship. But still have silght increment of price with increment of Sqft_lot15**
KC_Data %>%
filter(!is.na(sqft_lot15)) %>%
filter(!is.na(sqft_lot)) %>%
ggplot(aes(x=sqft_lot,y=sqft_lot15))+
geom_point(color = "blue")+
scale_x_continuous(limits=c(0,max(KC_Data$sqft_lot))) +
stat_smooth(aes(x=sqft_lot,y=sqft_lot15),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Sqft Lot)")+
ylab("sqft_lot15")
** sqft_lot and sqft_lot15 has high positive correlation**
## Price vs. Lat ->> This is more like a normal dist relationship, price peaks around when lat= 47.64 and declines afterwards, but this can be modeled easily. we would say Lat explains the price as well.
boxplot5=boxplot(price~lat, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. Lat", xlab="Lat", ylab="Price")
or
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(lat)) %>%
ggplot(aes(x=lat,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(x=lat,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("lat")+
ylab("Price")
** Price vs. Lat looks to have more likely a normal distribution relationship. The house price peaks around when lat= 47.64 and declines afterwards. Generally, we would say that Lat explains the price well.**
## Price vs. age ->> This is almost no relationship between price and age (insignificant relationship exists)
boxplot12=boxplot(price~age, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. age", xlab="age", ylab="Price")
OR
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(age)) %>%
ggplot(aes(x=age,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(age,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("age")+
ylab("Price")
Price and age have almost no relationship and age is insignificant to explain the housing price.
## Price vs. renage ->> This is almost no relationship between price and renage (insignificant relationship exists)
boxplot13=boxplot(price~renage , data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. renage ", xlab="renage ", ylab="Price")
OR
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(renage)) %>%
ggplot(aes(x=renage,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(renage,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("renage")+
ylab("Price")
Price and renage have almost no relationship and renage is insignificant to explain the housing price.
## Price vs. sqft_basement
boxplot6=boxplot(price~sqft_basement, data=KC_Data,
col=(c("gold","darkgreen")),
main="Price vs. sqft_basement", xlab="sqft_basement", ylab="Price")
OR
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(sqft_basement)) %>%
ggplot(aes(x=sqft_basement,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(sqft_basement,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("sqft_basement")+
ylab("Price")
** Price and sqft_basement have good correlation. We would say sqft_basement explains the price well.**
KC_Data %>%
filter(!is.na(price)) %>%
filter(!is.na(date)) %>%
ggplot(aes(x=date,y=price))+
geom_point(color = "blue")+
stat_smooth(aes(date,y=price),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("date")+
ylab("Price")
Price and date have almost no relationship. Thus, date doesn’t explain house price.
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_above, y = price, color = price))
viridis::scale_color_viridis(discrete=TRUE)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: NA
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## reset: function
## scale_name: viridis
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
geom_smooth(mapping = aes(x = sqft_living, y = price))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(data = KC_Data) +
geom_point(mapping = aes(x = sqft_living, y = price, color = price))+
geom_smooth(mapping = aes(x = sqft_living, y = price, linetype = waterfront))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
based on above visualization procedures variables such as number of bedrooms, number of bathrooms, grade, view of the houses, condition of the house, whether the has have water front or not, sqft_basement, sqft_living, sqft_living15, lat, sqft_lot and sqft_lot15 affect positively the price of the houses eventhough their degree may differ.
**sqft_living, sqft_living, bathrooms, grade, view, bedrooms, condition and sqft_basement are factors that have effect on the prices of the houses .
Testing correlation of the variables against price
vctCorr = numeric(0)
for (i in names(KC_Data))
{
cor.result <- cor(KC_Data$price, as.numeric(KC_Data[,i]))
vctCorr <- c(vctCorr, cor.result)
}
KC_DatarCorr <- vctCorr
names(KC_DatarCorr) <- names(KC_Data)
KC_DatarCorr
## id date price bedrooms bathrooms
## 0.006811293 -0.006670429 1.000000000 0.312944125 0.482055213
## sqft_living sqft_lot floors waterfront view
## 0.660267550 0.112404847 0.266922130 0.145903977 0.360685672
## condition grade sqft_above sqft_basement yr_built
## 0.052516561 0.663308821 0.550784781 0.303967093 0.029900519
## yr_renovated zipcode lat long sqft_living15
## NA -0.024979857 0.377864416 0.007129642 0.582457178
## sqft_lot15 age renage rate
## 0.114853357 -0.029848287 NA 0.545482836
ggcorr(KC_Data, hjust = 0.8, layout.exp = 1) +
ggtitle("Correlation between house variables")
KC_Data %>%
select (-date, -id, -yr_built, -yr_renovated, -renage) %>%
apply(2,as.character) %>%
apply(2,as.numeric) %>%
cor(use='everything',method='pearson') %>%
corrplot(type='lower', diag = F)
are house located near to water coast costier?
KC_Data$Price_Bin<-cut(KC_Data$price, c(0,250e3,500e3,750e3,1e6,999e6))
center_lon = median(KC_Data$long,na.rm = TRUE)
center_lat = median(KC_Data$lat,na.rm = TRUE)
factpal <- colorFactor(c("black","blue","yellow", "orange", "red"),
KC_Data$Price_Bin)
leaflet(KC_Data) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircles(lng = ~long, lat = ~lat,
color = ~factpal(Price_Bin)) %>%
# controls
setView(lng=center_lon, lat=center_lat,zoom = 12) %>%
addLegend("bottomright", pal = factpal, values = ~Price_Bin,
title = "House Price Distribution",
opacity = 1)
Most of the houses are in the range 250 thousand to 500 thousands. The next highest categories are
500 to 750 thousand
0 to 250 thousand
750 thousand to 1 million
1 million and above
KC_Data %>%
mutate(Price_Bin = as.factor(Price_Bin)) %>%
group_by(Price_Bin) %>%
dplyr::summarise(Count = n()) %>%
ungroup() %>%
mutate(Price_Bin = reorder(Price_Bin,Count)) %>%
arrange(desc(Count)) %>%
ggplot(aes(x = Price_Bin,y = Count)) +
geom_bar(stat='identity',colour="white", fill = "blue") +
geom_text(aes(x = Price_Bin, y = 1, label = paste0("(",Count,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'red',
fontface = 'bold') +
labs(x = 'Price_Bin',
y = 'Count',
title = 'Price_Bin and Count') +
coord_flip() +
theme_bw()
PriceBinGrouping = function(limit1, limit2)
{
return(
KC_Data %>%
filter(price > limit1) %>%
filter(price <= limit2)
)
}
PriceGroup1 = PriceBinGrouping(0,250e3)
PriceGroup2 = PriceBinGrouping(250e3,500e3)
PriceGroup3 = PriceBinGrouping(500e3,750e3)
PriceGroup4 = PriceBinGrouping(750e3,1e6)
PriceGroup5 = PriceBinGrouping(1e6,999e6)
MapPriceGroups = function(PriceGroupName,color)
{
center_lon = median(PriceGroupName$long,na.rm = TRUE)
center_lat = median(PriceGroupName$lat,na.rm = TRUE)
leaflet(PriceGroup2) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircles(lng = ~long, lat = ~lat,
color = ~c(color)) %>%
# controls
setView(lng=center_lon, lat=center_lat,zoom = 12)
}
MapPriceGroups(PriceGroup1,"black")
** As can be seen from above map, houses in the price range between from 0 to 250 thousands (black points) are scattered every where in the terrain. it didn’t show us certain trend.**
MapPriceGroups(PriceGroup2,"blue")
** The blue points indicate the houses in the price range from 250 to 500 thousands. still those houses are located inland. they are not concentrating much in costal areas. we don’t see much noticiable trend.**
MapPriceGroups(PriceGroup3,"orange")
** The orange points indicate the houses in the price range from 500 to 750 thousands. still those houses are located inland are costal areas. we don’t notice much noticiable trend.**
MapPriceGroups(PriceGroup4,"fuchsia")
** The fuchsia points indicate the houses in the price range from 750 thousands to 1 million. still those houses are located inland and costal areas. the houses are much concentrating to the costal side, much more than inland.**
MapPriceGroups(PriceGroup5,"red")
** The red points indicate the houses in the price range from 1 million and above. The houses are much more located on costal areas. The houses are much more concentrating to the costal area than inland.**
When we move towards the coast from inland, House prices become more and more expensive.
Those variables that are affecting house prices positively are
From Catagorical variables - Number of bedrooms, Number of bathrooms, grade, view, condition, and water front. - number of floors affects price very slightly and can be ignored.
From numerical variables
** - sqft_living, lat, sqft_basement,sqft_living - sqft_lot and sqft_lot15 affects prices very slightly.**
We observe that the columns sqft_living15 and sqft_lot15 have a strong correlation with sqft_living and sqft_lot respectively. thus, sqft_living15 and sqft_lot15 can be dropped from analysis for further studies, when we want to proceed to modeling.
** longitude, age renage, and date sold doesn’t affect the price of the house.**