Introduction

We will examine various characteristics of the data set 'nanaimo.csv'. From wherein we will choose three variables and compare the impact it has on the prices in the Nanaimo Housing Market. The functions we will be using in our analysis are geom_point(), geom_smooth(), and the lm().

The three variables that we will be considering for assessing the pricing impact on Nanaimo Housing Market are:

Area of the property ('area')
Bedrooms in the property ('bed')
Price of the property ('price')
Parking availability ('parking') - for the Dummy Variable

Dataset Characteristics

url <- "http://latul.be/mbaa_531/data/nanaimo.csv"
nanaimo <- read.csv(url, header = TRUE)
nanaimo <- nanaimo %>% mutate(price=ifelse(is.na(price), 0, price))
nanaimo <- nanaimo %>% mutate(area=ifelse(is.na(area), 0, area))
nanaimo <- nanaimo %>% mutate(bed=ifelse(is.na(bed), 0, bed))
nanaimo <- nanaimo %>% mutate(parking=ifelse(is.na(parking), "No", "Yes"))
rel_col <- c("address","price","area","bed","parking")
nanaimo <- nanaimo[rel_col]
index <- nanaimo$price > 0
nanaimo <-nanaimo[index,]
index <- nanaimo$area > 0
nanaimo <-nanaimo[index,]
#str(nanaimo)
#summary(nanaimo)
#writexl::write_xlsx(nanaimo, "file_output.xlsx")

The original data set 'nanaimo.csv' consisted of 549 observations and 40 variables.

The following manipulations were done to the data set:

replaced NA values in 'price' with 0 and we are not considering any listing with a 'price' = 0 (19 listings), the rationale here is that the data is junk and not relevant for comparing effect on housing prices
replaced NA values in 'area' with 0 and we are not considering any listing with an 'area' = 0 (157 listings), the rationale here is that since we are evaluating housing prices, the price comparison of freehold land ownership will not be like to like with a constructed house
replaced NA values in 'bed' with 0 and we are considering such data, as it could be a studio apartment
replaced NA values in 'parking' with 0 and we are considering NA to signify non-availability of parking
trimmed the data set for data that were not relevant in the limited exercise
- the trimmed version column heads are c(“address”, “price”, “area”, “bed”, “parking”)

The updated data set had 373 relevant observations and 5 variables.

Evaluation of Variables

Area

The Scatter Plot below plots the 'price' for the independent variable 'area'. At a glance, the higher the square foot area of the house, the price is also higher. To be explored further.

base <- ggplot( data = nanaimo, aes( x = area, y = price) )
base + 
  geom_point() + 
  labs(title = "Nanaimo Housing Prices",
       subtitle = "by Area",
       x = "Area (in sq. ft.)",
       y = "Price (in CAD)") +
  theme_classic() +
  theme(axis.title = element_text())

When Linear Regression was performed on 'price' and 'area', we find that the intercept of the linear model is 106,124.05 and that the slope is 138.95. This means that one can expect to predict the price for a house in Nanaimo based on the size, wherein for every additional square feet the price would increase by 138.95 whereas the intercept represents a market premium factor which a customer can expect to pay for buying a property in Nanaimo. This will be further explained by the linear regression in the graph below.

mhousing <- lm( formula = price ~ area, data = nanaimo )
mhousing$coefficients

## (Intercept)        area 
## 106124.0503    138.9489

The regression line shows the relationship between the data points plotted, i.e. 'area' and 'price'. The line (of best fit) can also be used to plot and predict price for any given area size based on linear growth model.

base +
  geom_smooth( method = 'lm',
        formula = y ~ x,
        se = FALSE) +
  geom_point()+
  labs(title = "Nanaimo Housing Prices",
       subtitle = "by Area",
       x = "Area (in sq. ft.)",
       y = "Price (in CAD)") +
  theme_classic()+
  theme(axis.title = element_text())

Bedrooms

The Scatter Plot below plots the 'price' for the independent variable 'bed'. At a glance, the higher the number of bedrooms, the price also tends to increase. To be explored further.

base <- ggplot( data = nanaimo, aes( x = bed, y = price ) )
base + 
  geom_point() + 
  labs(title = "Nanaimo Housing Prices",
       subtitle = "by Bedrooms",
       x = "Bedroom (in nos.)",
       y = "Price (in CAD)") +
  theme_classic()+
  theme(axis.title = element_text())

When Linear Regression was performed on 'price' and 'bed', we find that the intercept of the linear model is 204,771 and that the slope is 54,984. This means that one can expect to predict the price for a house in Nanaimo based on the number of bedrooms, wherein for every bedroom the price would increase by 54,984 whereas the intercept represents a market premium factor which a customer can expect to pay for buying a property in Nanaimo. The market premium factor is different for bedrooms as compared to area, due to the distribution of data and lack of standardization in constructing bedrooms to the size of the house. This will be further explained by the linear regression in the graph below.

mhousing <- lm( formula = price ~ bed, data = nanaimo )
mhousing$coefficients

## (Intercept)         bed 
##   204770.47    54984.11

The regression line shows the relationship between the data points plotted, i.e. 'bed' and 'price'. The line (of best fit) can also be used to plot and predict price for any given number of bedrooms based on linear growth model. . The slope of the regression model for bedroom is less steep as compared to 'area' which implies that the degree of correlation is not as strong.

base +
  geom_smooth( method = 'lm',
        formula = y ~ x,
        se = FALSE) +
  geom_point()+
  labs(title = "Nanaimo Housing Prices",
       subtitle = "by Bedrooms",
       x = "Bedrooms (in nos.)",
       y = "Price (in CAD)") +
  theme_classic()+
  theme(axis.title = element_text())

Dummy Variable

We had created a dummy binary variable, using the is.na function at the Data set level, the two values stored in the field 'parking' are Yes and No.

Now, we will look at multiple regression using variables 'bed', 'parking' and 'price'.

model <- lm(price ~ parking + bed + parking*bed, data = nanaimo)
summary(model)$coefficients

##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     280195.34   46386.19  6.040491 3.753508e-09
## parkingYes     -123293.66   59677.32 -2.066005 3.952549e-02
## bed              28452.12   15936.88  1.785300 7.503392e-02
## parkingYes:bed   41668.17   19794.09  2.105081 3.595978e-02

ggplot( data = nanaimo, mapping = aes( x = bed, y = price, colour =  parking  ) ) +
  geom_point( alpha = .6 ) +
  geom_smooth( method = 'lm', se = FALSE ) +
  labs(title = "Nanaimo Housing Prices",
       subtitle = "by bedroom and parking availability",
       x = "Bedrooms (in nos.) and Parking availability",
       y = "Price (in CAD)") +
  theme_classic()+
  theme(axis.title = element_text())

## `geom_smooth()` using formula 'y ~ x'

We can infer that an additional bedroom will cost 28,452, and that if that additional property happens to be in a property with parking, then the cost will increase by 41,668 to 70,120.

Final Test

Daisy Li, 658167887

Dickson Pallikunnath, 658770763