We will examine various characteristics of the data set
'nanaimo.csv'. From wherein we will choose three variables
and compare the impact it has on the prices in the Nanaimo Housing
Market. The functions we will be using in our analysis are
geom_point(), geom_smooth(), and the
lm().
The three variables that we will be considering for assessing the pricing impact on Nanaimo Housing Market are:
Area of the property ('area')
Bedrooms in the property ('bed')
Price of the property ('price')
Parking availability ('parking') - for the Dummy
Variable
url <- "http://latul.be/mbaa_531/data/nanaimo.csv"
nanaimo <- read.csv(url, header = TRUE)
nanaimo <- nanaimo %>% mutate(price=ifelse(is.na(price), 0, price))
nanaimo <- nanaimo %>% mutate(area=ifelse(is.na(area), 0, area))
nanaimo <- nanaimo %>% mutate(bed=ifelse(is.na(bed), 0, bed))
nanaimo <- nanaimo %>% mutate(parking=ifelse(is.na(parking), "No", "Yes"))
rel_col <- c("address","price","area","bed","parking")
nanaimo <- nanaimo[rel_col]
index <- nanaimo$price > 0
nanaimo <-nanaimo[index,]
index <- nanaimo$area > 0
nanaimo <-nanaimo[index,]
#str(nanaimo)
#summary(nanaimo)
#writexl::write_xlsx(nanaimo, "file_output.xlsx")
The original data set 'nanaimo.csv' consisted of 549
observations and 40 variables.
The following manipulations were done to the data set:
replaced NA values in 'price' with 0 and we are not
considering any listing with a 'price' = 0 (19 listings),
the rationale here is that the data is junk and not relevant for
comparing effect on housing prices
replaced NA values in 'area' with 0 and we are not
considering any listing with an 'area' = 0 (157 listings),
the rationale here is that since we are evaluating housing prices, the
price comparison of freehold land ownership will not be like to like
with a constructed house
replaced NA values in 'bed' with 0 and we are
considering such data, as it could be a studio apartment
replaced NA values in 'parking' with 0 and we are
considering NA to signify non-availability of parking
trimmed the data set for data that were not relevant in the limited exercise
The updated data set had 373 relevant observations and 5 variables.
The Scatter Plot below plots the 'price' for the
independent variable 'area'. At a glance, the higher the
square foot area of the house, the price is also higher. To be explored
further.
base <- ggplot( data = nanaimo, aes( x = area, y = price) )
base +
geom_point() +
labs(title = "Nanaimo Housing Prices",
subtitle = "by Area",
x = "Area (in sq. ft.)",
y = "Price (in CAD)") +
theme_classic() +
theme(axis.title = element_text())
When Linear Regression was performed on 'price' and
'area', we find that the intercept of the linear model is
106,124.05 and that the slope is 138.95. This means that one can expect
to predict the price for a house in Nanaimo based on the size, wherein
for every additional square feet the price would increase by 138.95
whereas the intercept represents a market premium factor which a
customer can expect to pay for buying a property in Nanaimo. This will
be further explained by the linear regression in the graph below.
mhousing <- lm( formula = price ~ area, data = nanaimo )
mhousing$coefficients
## (Intercept) area
## 106124.0503 138.9489
The regression line shows the relationship between the data points
plotted, i.e. 'area' and 'price'. The line (of
best fit) can also be used to plot and predict price for any given area
size based on linear growth model.
base +
geom_smooth( method = 'lm',
formula = y ~ x,
se = FALSE) +
geom_point()+
labs(title = "Nanaimo Housing Prices",
subtitle = "by Area",
x = "Area (in sq. ft.)",
y = "Price (in CAD)") +
theme_classic()+
theme(axis.title = element_text())
The Scatter Plot below plots the 'price' for the
independent variable 'bed'. At a glance, the higher the
number of bedrooms, the price also tends to increase. To be explored
further.
base <- ggplot( data = nanaimo, aes( x = bed, y = price ) )
base +
geom_point() +
labs(title = "Nanaimo Housing Prices",
subtitle = "by Bedrooms",
x = "Bedroom (in nos.)",
y = "Price (in CAD)") +
theme_classic()+
theme(axis.title = element_text())
When Linear Regression was performed on 'price' and
'bed', we find that the intercept of the linear model is
204,771 and that the slope is 54,984. This means that one can expect to
predict the price for a house in Nanaimo based on the number of
bedrooms, wherein for every bedroom the price would increase by 54,984
whereas the intercept represents a market premium factor which a
customer can expect to pay for buying a property in Nanaimo. The market
premium factor is different for bedrooms as compared to area, due to the
distribution of data and lack of standardization in constructing
bedrooms to the size of the house. This will be further explained by the
linear regression in the graph below.
mhousing <- lm( formula = price ~ bed, data = nanaimo )
mhousing$coefficients
## (Intercept) bed
## 204770.47 54984.11
The regression line shows the relationship between the data points
plotted, i.e. 'bed' and 'price'. The line (of
best fit) can also be used to plot and predict price for any given
number of bedrooms based on linear growth model. . The slope of the
regression model for bedroom is less steep as compared to
'area' which implies that the degree of correlation is not
as strong.
base +
geom_smooth( method = 'lm',
formula = y ~ x,
se = FALSE) +
geom_point()+
labs(title = "Nanaimo Housing Prices",
subtitle = "by Bedrooms",
x = "Bedrooms (in nos.)",
y = "Price (in CAD)") +
theme_classic()+
theme(axis.title = element_text())
We had created a dummy binary variable, using the is.na function at
the Data set level, the two values stored in the field
'parking' are Yes and No.
Now, we will look at multiple regression using variables
'bed', 'parking' and 'price'.
model <- lm(price ~ parking + bed + parking*bed, data = nanaimo)
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 280195.34 46386.19 6.040491 3.753508e-09
## parkingYes -123293.66 59677.32 -2.066005 3.952549e-02
## bed 28452.12 15936.88 1.785300 7.503392e-02
## parkingYes:bed 41668.17 19794.09 2.105081 3.595978e-02
ggplot( data = nanaimo, mapping = aes( x = bed, y = price, colour = parking ) ) +
geom_point( alpha = .6 ) +
geom_smooth( method = 'lm', se = FALSE ) +
labs(title = "Nanaimo Housing Prices",
subtitle = "by bedroom and parking availability",
x = "Bedrooms (in nos.) and Parking availability",
y = "Price (in CAD)") +
theme_classic()+
theme(axis.title = element_text())
## `geom_smooth()` using formula 'y ~ x'
We can infer that an additional bedroom will cost 28,452, and that if that additional property happens to be in a property with parking, then the cost will increase by 41,668 to 70,120.