Dataset - Nanaimo Housing Dataset

Step 1: Loading data from local directory

# Loading required libraries
library(DT)
library(stringr)

# Load csv data
df <- read.csv("C:/Users/adity/Downloads/nanaimo.csv")

# Display top 8 columns from dataset
datatable(df[, c("X","address","price","area","bath","taxes", "taxyear","stratafee")])

Step 2: Feature Selection

Correlation provides us the strength of the relationship between 2 variables.

A negative correlation implies that, if one variable increases the other decreases and vice versa.
On the other hand, a positive correlation implies that if one variable increases, then the other one also increases and if one decreases the other one decreases as well.
For categorical data, we will not be able to test for correlation between string(character) and numeric data. Hence, we will encode the unique values (Eg: Powell River: 0 , Parksville: 1 …)

Step 2.1 : Identifying all numeric columns

# To store numeric cols for finding correlation to target variable
numeric_cols <- c()

# All columns present in the dataframe
df_columns <- names(df)

for (col_name in df_columns){
    if (is.numeric(df[,col_name]) == TRUE){
      numeric_cols <- append(numeric_cols, col_name) 
    }
}

numeric_cols

##  [1] "X"         "price"     "mls"       "lat"       "lng"       "bed"      
##  [7] "area"      "bath"      "landarea"  "lotwidth"  "lotdepth"  "age"      
## [13] "taxes"     "taxyear"   "stratafee"

# Store correlation values of respective Columns
corr_values = c()

# Iterate through each column, and append calculated correlation score w.r.t to 'price' column
for (num_col in numeric_cols){
  calc_corr <- cor(df[,num_col], df[,"price"], use='complete.obs')
  corr_values <- append(corr_values, calc_corr)
}

# Create new dataframe, combining Column name and Correlation values
corr_df <- data.frame(numeric_cols, corr_values)
corr_df_sorted <- corr_df[order(-corr_values),]
corr_df_sorted

##    numeric_cols corr_values
## 2         price  1.00000000
## 1             X  0.94017538
## 7          area  0.67467852
## 13        taxes  0.67407152
## 8          bath  0.52809544
## 6           bed  0.43519572
## 11     lotdepth  0.30078896
## 9      landarea  0.30044462
## 3           mls  0.29631017
## 10     lotwidth  0.23141636
## 12          age  0.22475732
## 5           lng  0.13626407
## 14      taxyear  0.09430585
## 4           lat  0.02795866
## 15    stratafee -0.33306197

From the above table, we can observe that features such as area, taxes, are highly correlated to the feature ‘price’.

Step 3: Selecting 3 features from dataset (which best explain House prices)

Feature 1: Using ‘area’ column from the dataset

# Correlation of 'area' with 'price'
area_price_corr <- cor(df$area, df$price, use="complete.obs")
cat("Correlation of Area with Price:",area_price_corr)

## Correlation of Area with Price: 0.6746785

Using Ggplot - geom_point, geom_smooth, lm functions

library(ggplot2)
plot_1 <- ggplot(data=df,mapping=aes(x=area, y=price)) + geom_point()  + ggtitle("Area w.r.t Price")
plot_1

Figure 1: Plotting Area values wrt to Price

plot_1 <- ggplot(data=df,mapping=aes(x=area, y=price)) + geom_point() + geom_smooth() + ggtitle("Area w.r.t Price - Regression line")
plot_1

Figure 2: Adding a best fit line - Area values wrt to Price

plot_1 <- ggplot(data=df,mapping=aes(x=area, y=price)) + geom_point() + geom_smooth(method="lm",se=FALSE) + ggtitle("Area w.r.t Price - Regression Line (Using Linear model)")
plot_1

Figure 3 : Adding a regression line (using lm) - Area values wrt to Price

# Using lm function to create a regression model
price_area_summ <- lm(formula=price~area, data=df)
summary(price_area_summ)

## 
## Call:
## lm(formula = price ~ area, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -354270  -98622  -17701   63728  572017 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.259e+05  1.294e+04   9.728   <2e-16 ***
## area        1.305e+02  6.782e+00  19.239   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156800 on 443 degrees of freedom
##   (104 observations deleted due to missingness)
## Multiple R-squared:  0.4552, Adjusted R-squared:  0.454 
## F-statistic: 370.1 on 1 and 443 DF,  p-value: < 2.2e-16

Summary: Feature 1 (Area)

From Figure 3 we can see that the feature ‘price’ and area are highly correlated, i.e as area tends to increase, the price of the house also increases.
Thus, we can say that there is a positive correlation among the 2 features.
Also, from the summary generated using the lm function, we can observe that if the area is approx 130 sq.feet, the price of the house is roughly $125,900.

Feature 2: Using the ‘taxes’ column from the dataset

# Correlation of 'taxes' with 'price'
taxes_price_corr <- cor(df$taxes, df$price, use="complete.obs")
cat("Correlation of taxes with Price:", taxes_price_corr)

## Correlation of taxes with Price: 0.6740715

tax_price_plot <- ggplot(data=df,mapping=aes(x=taxes, y=price)) + geom_point()  + ggtitle("Tax w.r.t Price")
tax_price_plot

Figure 4 : Plotting Taxes values wrt to Price

tax_price_plot_smooth <- ggplot(data=df,mapping=aes(x=taxes, y=price)) + geom_point() + geom_smooth() + ggtitle("Tax w.r.t Price - Regression line")
tax_price_plot_smooth

Figure 5 : Best fit line - Area values wrt to Price

tax_price_plot_smooth_lm <- ggplot(data=df,mapping=aes(x=taxes, y=price)) + geom_point() + geom_smooth(method="lm",se=FALSE) + ggtitle("Tax w.r.t Price - Regression Line (Using Linear model)")

tax_price_plot_smooth_lm

Figure 6 : Regression line - Area values wrt to Price

# Using lm function to create a regression model
tax_price_summ <- lm(price~taxes, data=df)
summary(tax_price_summ)

## 
## Call:
## lm(formula = price ~ taxes, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462915  -95522  -27534   62873  809417 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 58266.946  14743.118   3.952 8.91e-05 ***
## taxes         112.242      5.614  19.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156900 on 480 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.4544, Adjusted R-squared:  0.4532 
## F-statistic: 399.7 on 1 and 480 DF,  p-value: < 2.2e-16

Summary: Feature 2 (Taxes)

From the above correlation value, we can say that ‘area’ and ‘tax’ feature has a high correlation to ‘price’, i.e As the Area & taxes of the house increases, the Price of the house also increases.
Also from Figure 6, we can observe that there is a positive correlation among the two variables.
We can also observe that, if the taxes value is close to $112, the price is roughly $58,266.

Step 4: Feature Engineering : Creating new features from existing columns

New feature : Creating new feature ‘location’ from ‘address’ column

# Splitting address into new column -> 'Location'
split_addr_by_comma <- function(x) str_trim(unlist(strsplit(x,","))[2])
df$Location <- mapply(split_addr_by_comma, df$address)


head(data.frame(Original_Address=df$address, Location_derived=df$Location))

##                         Original_Address Location_derived
## 1    122 - 1175 Resort Drive, Parksville       Parksville
## 2   26 - 4837 Roger Street, Port Alberni     Port Alberni
## 3 106c - 181 Beachside Drive, Parksville       Parksville
## 4      Lot 15 Fernwood Ave, Powell River     Powell River
## 5    A11 - 2157 Regent Road, Black Creek      Black Creek
## 6   38 - 5455 Borden Place, Powell River     Powell River

# Check all Unique values in new column 'Location'
unique(df$Location)

##  [1] "Parksville"      "Port Alberni"    "Powell River"    "Black Creek"    
##  [5] "Nanoose Bay"     "Ucluelet"        "Gabriola Island" "Savary Island"  
##  [9] "Errington"       "Nanaimo"         "Coombs"          "Ladysmith"      
## [13] "Duncan"          "Campbell River"  "Courtenay"       "Denman Island"  
## [17] "Texada Island"   "Hornby Island"   "Bamfield"        "Youbou"         
## [21] "Lasqueti Island" "Cumberland"      "Saltair"         "Ruxton Island"  
## [25] "Comox"           "French Creek"    "Tofino"          "47"             
## [29] "Qualicum Beach"  "Bowser"          "Chemainus"       "Union Bay"      
## [33] "Merville"        "Lantzville"

# Select row with outlier where Location value is "47" 

datatable(df[which(df$Location == "47"),c("address", "Location")])

# Replacing original value of location by considering last occurence of "," in given Address 
comma_indexes <- which(strsplit(df[which(df$Location == "47"),"address"],"")[[1]] == ",")

# Iterate through all indices, where "," is present in address value
# Eg: Address: Lots - 46,47,48 Marine Ave, Powell River  -> Split on last occurence of "," -> location: Powell River
for (idx in comma_indexes){
  
  # Check index of last occurence of "," and include substring which occurs after the "," character 
  if (idx == comma_indexes[length(comma_indexes)]){
    df$Location[302] <- str_trim(str_sub(df$address[302],start=idx+1))
  }
}

# Print new Location value
df$Location[302]

## [1] "Powell River"

Step 2.1: Correlation test of new column ‘Location’ with ‘Price’ column

# Label encoding
df$Location_encoded <- as.numeric(factor(df$Location))
head(df[,c("address","Location", "Location_encoded")])

##                                  address     Location Location_encoded
## 1    122 - 1175 Resort Drive, Parksville   Parksville               22
## 2   26 - 4837 Roger Street, Port Alberni Port Alberni               23
## 3 106c - 181 Beachside Drive, Parksville   Parksville               22
## 4      Lot 15 Fernwood Ave, Powell River Powell River               24
## 5    A11 - 2157 Regent Road, Black Creek  Black Creek                2
## 6   38 - 5455 Borden Place, Powell River Powell River               24

Step 2.2: Test for correlation : Location (encoded) with Price

loc_price_corr <- cor(df$Location_encoded, df$price, use="complete.obs")
cat("Correlation of Location wrt Price:", loc_price_corr)

## Correlation of Location wrt Price: -0.1620253

Since the correlation of the newly created feature “Location” is low w.r.t the target variable i.e price, we will continue to test combinations of other columns which will generate a greater correlation score.

New feature: Taxes + Strata Fee

# New feature wrt to Price - Using Area, age, Total extra charges (Tax + strata fee)

#cor(df$area, df$price, use="complete.obs")

df$Combined_taxes_strata_fee <- df$taxes + df$stratafee
taxes_plus_strata_corr <- cor(df$Combined_taxes_strata_fee, df$price, use="complete.obs")
taxes_plus_strata_corr

## [1] 0.8162398

combined_taxes_strata_fee_lm = lm(price~Combined_taxes_strata_fee, data=df)
summary(combined_taxes_strata_fee_lm)

## 
## Call:
## lm(formula = price ~ Combined_taxes_strata_fee, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -332070  -41836   -1010   24121  562603 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -36898.84   29542.23  -1.249    0.216    
## Combined_taxes_strata_fee    123.02      10.05  12.236   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122500 on 75 degrees of freedom
##   (472 observations deleted due to missingness)
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6618 
## F-statistic: 149.7 on 1 and 75 DF,  p-value: < 2.2e-16

We can immediately notice that the combination of the two columns viz. taxes and strata fee produces a higher correlation score, as opposed to treating each variable independently.

Using Ggplot - geom_point, geom_smooth, lm functions

library(ggplot2)
scatter_plot_taxes <- ggplot(data=df,mapping=aes(x=Combined_taxes_strata_fee, y=price)) + geom_point()  + ggtitle("Total Price as a function of Taxes + Strata fee")
scatter_plot_taxes

Figure 7 : Plotting New column(Combined_taxes_strata_fee) values wrt to Price

plot_1 <- ggplot(data=df,mapping=aes(x=Combined_taxes_strata_fee, y=price, color=Location)) + geom_point() + geom_smooth() + ggtitle("Total Price as a function of Taxes + Strata fee")
plot_1

Figure 8 : Best fit line : New column(Combined_taxes_strata_fee) values wrt to Price

plot_1 <- ggplot(data=df,mapping=aes(x=Combined_taxes_strata_fee, y=price)) + geom_point() + geom_smooth(method="lm",se=FALSE) + ggtitle("Total Price as a function of Taxes + Strata fee - Regression Line (Least squares line)")
plot_1

Figure 9 : Regression Line: New column(Combined_taxes_strata_fee) values wrt to Price

Summary : New Feature (Combined_taxes_strata_fee - Taxes + Stratafee)

From Figure 9, we can observe that there is a positive correlation among the newly created feature (Taxes + strata fee) with the price feature.
Also, the correlation of the newly created feature tends to be higher, as opposed to treating each feature independently.

Final Test

Vinita, 658177761

Aditya Nair, 658268891

10/22/2021

Dataset - Nanaimo Housing Dataset

Step 1: Loading data from local directory

Step 2: Feature Selection

Step 2.1 : Identifying all numeric columns

Step 3: Selecting 3 features from dataset (which best explain House prices)

Feature 1: Using ‘area’ column from the dataset

Using Ggplot - geom_point, geom_smooth, lm functions

Summary: Feature 1 (Area)

Feature 2: Using the ‘taxes’ column from the dataset

Summary: Feature 2 (Taxes)

Step 4: Feature Engineering : Creating new features from existing columns

New feature : Creating new feature ‘location’ from ‘address’ column

Step 2.1: Correlation test of new column ‘Location’ with ‘Price’ column

Step 2.2: Test for correlation : Location (encoded) with Price

New feature: Taxes + Strata Fee

Using Ggplot - geom_point, geom_smooth, lm functions

Summary : New Feature (Combined_taxes_strata_fee - Taxes + Stratafee)