Introduction

For this exercise, I used a real estate price dataset obtained from Kaggle. This dataset includes 6 predictor variables, and is focused on trying to predict the house price of a unit area.

Assignment: - Using R, build a multiple regression model for data that interests you. - Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. - Interpret all coefficients. - Conduct residual analysis. - Was the linear model appropriate? - Why or why not?

Setup environment

We begin by importing the data into a data frame and then cleaning up column names

df <- read.csv(path)
df <- tibble(df)
df <- janitor::clean_names(df)

Viewing Original Data and Pre-Processing

We begin the process by viewing our dependent variable and remove any outliers if any

hist(df$y_house_price_of_unit_area)

boxplot(df$y_house_price_of_unit_area)

summary(df)
##        no        x1_transaction_date  x2_house_age   
##  Min.   :  1.0   Min.   :2013        Min.   : 0.000  
##  1st Qu.:104.2   1st Qu.:2013        1st Qu.: 9.025  
##  Median :207.5   Median :2013        Median :16.100  
##  Mean   :207.5   Mean   :2013        Mean   :17.713  
##  3rd Qu.:310.8   3rd Qu.:2013        3rd Qu.:28.150  
##  Max.   :414.0   Max.   :2014        Max.   :43.800  
##  x3_distance_to_the_nearest_mrt_station x4_number_of_convenience_stores
##  Min.   :  23.38                        Min.   : 0.000                 
##  1st Qu.: 289.32                        1st Qu.: 1.000                 
##  Median : 492.23                        Median : 4.000                 
##  Mean   :1083.89                        Mean   : 4.094                 
##  3rd Qu.:1454.28                        3rd Qu.: 6.000                 
##  Max.   :6488.02                        Max.   :10.000                 
##   x5_latitude     x6_longitude   y_house_price_of_unit_area
##  Min.   :24.93   Min.   :121.5   Min.   :  7.60            
##  1st Qu.:24.96   1st Qu.:121.5   1st Qu.: 27.70            
##  Median :24.97   Median :121.5   Median : 38.45            
##  Mean   :24.97   Mean   :121.5   Mean   : 37.98            
##  3rd Qu.:24.98   3rd Qu.:121.5   3rd Qu.: 46.60            
##  Max.   :25.01   Max.   :121.6   Max.   :117.50

We see from the histogram that our output variable is relatively normal. However, from the boxplot we see that there are at least 2 values that are outliers. Based on this, we use the IQR formula define the upper and lower bounds for our outliers, and filter out any outliers from the output variable.

Additionally, we note that the first column is simply an index column that labels the row for the data. Also, we would not expect the x1_transaction_date column to have any predictive value since it is a variable that is associated to when the transaction was completed, and therefore it is an “after the fact” variable. Therefore both of these columns will be removed from our data.

An additional observation worth noting is that x3_distance_to_the_nearest_mrt_station is highly skewed to the right, and therefore we may want to apply some form of transformation to that feature. And finally, the x5_latitude and x6_latitude values suggest that the prices are for houses in a single neighborhood, and given the variability in the data, it’s highly likely that the output is dependent on qualitative variables not included in the data.

As a next step, we go ahead and do the following: - filter the dataframe to remove outlier values in our output variable - remove the “no” and “x1_transaction_date” values

Q3 <- quantile(df$y_house_price_of_unit_area,.75)
Q1 <- quantile(df$y_house_price_of_unit_area,.25)


IQR <- Q3[[1]]-Q1[[1]]
lower_bound = (IQR*1.5) - Q1
upper_bound = (IQR*1.5) + Q3

mod0 <- df %>%
  select(-c("no", "x1_transaction_date"))

mod1 <- df %>%
  filter(between(y_house_price_of_unit_area,lower_bound, upper_bound)) %>%
  select(-c("no", "x1_transaction_date"))

cor(mod1)
##                                        x2_house_age
## x2_house_age                             1.00000000
## x3_distance_to_the_nearest_mrt_station   0.03016725
## x4_number_of_convenience_stores          0.03538514
## x5_latitude                              0.05228492
## x6_longitude                            -0.05352651
## y_house_price_of_unit_area              -0.24285150
##                                        x3_distance_to_the_nearest_mrt_station
## x2_house_age                                                       0.03016725
## x3_distance_to_the_nearest_mrt_station                             1.00000000
## x4_number_of_convenience_stores                                   -0.60471041
## x5_latitude                                                       -0.59042600
## x6_longitude                                                      -0.80676824
## y_house_price_of_unit_area                                        -0.70134918
##                                        x4_number_of_convenience_stores
## x2_house_age                                                0.03538514
## x3_distance_to_the_nearest_mrt_station                     -0.60471041
## x4_number_of_convenience_stores                             1.00000000
## x5_latitude                                                 0.44607875
## x6_longitude                                                0.44821067
## y_house_price_of_unit_area                                  0.60585298
##                                        x5_latitude x6_longitude
## x2_house_age                            0.05228492  -0.05352651
## x3_distance_to_the_nearest_mrt_station -0.59042600  -0.80676824
## x4_number_of_convenience_stores         0.44607875   0.44821067
## x5_latitude                             1.00000000   0.41265667
## x6_longitude                            0.41265667   1.00000000
## y_house_price_of_unit_area              0.57184891   0.55458517
##                                        y_house_price_of_unit_area
## x2_house_age                                           -0.2428515
## x3_distance_to_the_nearest_mrt_station                 -0.7013492
## x4_number_of_convenience_stores                         0.6058530
## x5_latitude                                             0.5718489
## x6_longitude                                            0.5545852
## y_house_price_of_unit_area                              1.0000000

Observing the correlation of variables to our output variable, we see that x2_house_age seems to have a weak negative correlation to the data. This may be a good candidate for a variable that may benefit from being dichotomized. Also we see pretty strong colinearity between our x6_longitude variable and the x3 variable.

Building the model

Given that there are only a few explanatory variables in the dataset and our pre-knowledge of the correlations in the data, we will begin by building a model that contains all of our predictor variables and comparing it with a model containing just the highest correlated feature in our dataset - in this case x3_distance_to_the_nerest_mrt_station. And then we will compare the results.

# Model with all terms + including outliers
lm0 <- lm(y_house_price_of_unit_area~., data=mod0)
summary(lm0)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.546  -5.267  -1.600   4.247  76.372 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -4.946e+03  6.211e+03  -0.796    0.426
## x2_house_age                           -2.689e-01  3.900e-02  -6.896 2.04e-11
## x3_distance_to_the_nearest_mrt_station -4.259e-03  7.233e-04  -5.888 8.17e-09
## x4_number_of_convenience_stores         1.163e+00  1.902e-01   6.114 2.27e-09
## x5_latitude                             2.378e+02  4.495e+01   5.290 2.00e-07
## x6_longitude                           -7.805e+00  4.915e+01  -0.159    0.874
##                                           
## (Intercept)                               
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station ***
## x4_number_of_convenience_stores        ***
## x5_latitude                            ***
## x6_longitude                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.965 on 408 degrees of freedom
## Multiple R-squared:  0.5712, Adjusted R-squared:  0.5659 
## F-statistic: 108.7 on 5 and 408 DF,  p-value: < 2.2e-16
# Model with all terms + excluding outliers
lm.all <- lm(y_house_price_of_unit_area~., data=mod1)
summary(lm.all)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.066  -5.048  -1.241   4.258  31.178 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -8.028e+03  5.391e+03  -1.489    0.137
## x2_house_age                           -2.840e-01  3.399e-02  -8.354 1.06e-15
## x3_distance_to_the_nearest_mrt_station -3.756e-03  6.292e-04  -5.969 5.21e-09
## x4_number_of_convenience_stores         1.203e+00  1.664e-01   7.232 2.40e-12
## x5_latitude                             2.397e+02  3.895e+01   6.153 1.83e-09
## x6_longitude                            1.716e+01  4.267e+01   0.402    0.688
##                                           
## (Intercept)                               
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station ***
## x4_number_of_convenience_stores        ***
## x5_latitude                            ***
## x6_longitude                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.765 on 405 degrees of freedom
## Multiple R-squared:  0.6347, Adjusted R-squared:  0.6302 
## F-statistic: 140.7 on 5 and 405 DF,  p-value: < 2.2e-16
# Model with the explanatory term with highest correlation to response variable
lm1 <- lm(y_house_price_of_unit_area~x3_distance_to_the_nearest_mrt_station, data=mod1)
summary(lm1)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ x3_distance_to_the_nearest_mrt_station, 
##     data = mod1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.925  -5.828  -0.938   4.963  30.365 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            45.3093268  0.5937247   76.31   <2e-16
## x3_distance_to_the_nearest_mrt_station -0.0070811  0.0003559  -19.90   <2e-16
##                                           
## (Intercept)                            ***
## x3_distance_to_the_nearest_mrt_station ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.113 on 409 degrees of freedom
## Multiple R-squared:  0.4919, Adjusted R-squared:  0.4906 
## F-statistic: 395.9 on 1 and 409 DF,  p-value: < 2.2e-16

Looking at the R-squared values for both models, we see that our model with all variables included has an R-Squared of 0.6302, whereas our model with only the x3 variable has an R-Squared of 0.4906. Given this difference, we will move forward by working with our model containing all of our variables and working to improve it. One other note is that the model that included all variables, before excluding the outliers in our response variable, had an R-Squared value of 0.5712.

boxplot(mod1$x3_distance_to_the_nearest_mrt_station)

hist(mod1$x3_distance_to_the_nearest_mrt_station)

The charts for the x3 variable, show heavy skewness, so we will apply a log transform to attempt to correct for this.

hist(log(mod1$x3_distance_to_the_nearest_mrt_station))

This drastically improves the distribution of this variable, so we will create a term in our dataset with this new feature. From there we will re-evaluate the model to determine how it impacts our R-Squared.

mod2 <- mod1 %>% mutate(
  log_mrt_distance = log(x3_distance_to_the_nearest_mrt_station)
)

lm2 <- lm(y_house_price_of_unit_area~., data=mod2)
summary(lm2)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.5300  -3.8833  -0.3455   3.1042  30.8001 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -1.793e+04  4.988e+03  -3.594 0.000365
## x2_house_age                           -2.461e-01  3.101e-02  -7.934 2.11e-14
## x3_distance_to_the_nearest_mrt_station  1.507e-03  7.936e-04   1.899 0.058342
## x4_number_of_convenience_stores         4.935e-01  1.680e-01   2.937 0.003507
## x5_latitude                             3.180e+02  3.620e+01   8.786  < 2e-16
## x6_longitude                            8.285e+01  3.922e+01   2.112 0.035282
## log_mrt_distance                       -6.744e+00  7.085e-01  -9.519  < 2e-16
##                                           
## (Intercept)                            ***
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station .  
## x4_number_of_convenience_stores        ** 
## x5_latitude                            ***
## x6_longitude                           *  
## log_mrt_distance                       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.026 on 404 degrees of freedom
## Multiple R-squared:  0.7016, Adjusted R-squared:  0.6972 
## F-statistic: 158.3 on 6 and 404 DF,  p-value: < 2.2e-16

The inclusion of our new transformed terms improves our R-Squared to 0.6972.

Dichotomous Variables

We will now implement two different variables to bin the data based on the age of the house. One variable - home_age_bin - will group the data into age bins at 5-year intervals. The second variable, is_new will simply create a dichotomous variable that will take a value of 1 if the house is less than or equal to 10 years old, and 0 if older. Finally, for the home_age_bin variables, we will convert them into dicotomous variables using the dummy_cols function from the fastDummies library. Additionally, we added a dichotomus variable convenience_store_nearby that looks to see if there are 5 or more convenience stores nearby.

Once complete, we re-run our model to evaluate the impact of our new dichotomouse variables

mod3 <- mod2 %>% mutate(home_age_bin = case_when(
  between(x2_house_age,0,5) ~ '0_5_years',
  between(x2_house_age,5,10) ~ '6_10_years',
  between(x2_house_age,10,15) ~ '11_15_years',
  between(x2_house_age,15,20) ~ '16_20_years',
  between(x2_house_age,20,25) ~ '21_25_years',
  between(x2_house_age,25,30) ~ '26_30_years',
  between(x2_house_age,30,35) ~ '31_35_years',
  between(x2_house_age,35,40) ~ '36_40_years', 
  TRUE ~ '41_50_years' 
),
  is_new = ifelse(x2_house_age <=10,1,0),
  convenience_store_nearby = ifelse(x4_number_of_convenience_stores >= 5,1,0)
)


mod3 <- dummy_cols(mod3, select_columns='home_age_bin', remove_first_dummy = TRUE) %>%
  select(-home_age_bin)


lm3 <- lm(y_house_price_of_unit_area~., data=mod3)
summary(lm3)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.4418  -4.1331  -0.0487   3.6100  30.0701 
## 
## Coefficients: (1 not defined because of singularities)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -1.377e+04  4.861e+03  -2.833 0.004849
## x2_house_age                           -1.065e+00  2.390e-01  -4.455 1.09e-05
## x3_distance_to_the_nearest_mrt_station  8.684e-04  7.725e-04   1.124 0.261605
## x4_number_of_convenience_stores        -8.351e-02  2.378e-01  -0.351 0.725653
## x5_latitude                             3.401e+02  3.450e+01   9.859  < 2e-16
## x6_longitude                            4.438e+01  3.852e+01   1.152 0.250074
## log_mrt_distance                       -5.152e+00  7.232e-01  -7.123 5.03e-12
## is_new                                 -3.789e+01  9.724e+00  -3.896 0.000115
## convenience_store_nearby                5.031e+00  1.425e+00   3.531 0.000463
## home_age_bin_6_10_years                 3.019e+00  1.765e+00   1.710 0.088017
## home_age_bin_11_15_years               -3.405e+01  7.292e+00  -4.670 4.14e-06
## home_age_bin_16_20_years               -2.916e+01  6.367e+00  -4.580 6.25e-06
## home_age_bin_21_25_years               -2.420e+01  5.550e+00  -4.362 1.65e-05
## home_age_bin_26_30_years               -2.209e+01  4.350e+00  -5.078 5.89e-07
## home_age_bin_31_35_years               -1.642e+01  3.332e+00  -4.929 1.22e-06
## home_age_bin_36_40_years               -6.938e+00  2.897e+00  -2.395 0.017078
## home_age_bin_41_50_years                       NA         NA      NA       NA
##                                           
## (Intercept)                            ** 
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station    
## x4_number_of_convenience_stores           
## x5_latitude                            ***
## x6_longitude                              
## log_mrt_distance                       ***
## is_new                                 ***
## convenience_store_nearby               ***
## home_age_bin_6_10_years                .  
## home_age_bin_11_15_years               ***
## home_age_bin_16_20_years               ***
## home_age_bin_21_25_years               ***
## home_age_bin_26_30_years               ***
## home_age_bin_31_35_years               ***
## home_age_bin_36_40_years               *  
## home_age_bin_41_50_years                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.582 on 395 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.7343 
## F-statistic: 76.55 on 15 and 395 DF,  p-value: < 2.2e-16

The R-Squared for our model has now improved to 0.7343.

Quadratic Terms

Next we square the distance of our log_mrt_distance variable and rerun the model to evaluate the performance

mod4 <- mod3 %>% mutate(
  square_mrt_dist = (log_mrt_distance)^2
)

lm4 <- lm(y_house_price_of_unit_area~., data=mod4)
summary(lm4)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.097  -4.222  -0.065   3.629  30.094 
## 
## Coefficients: (1 not defined because of singularities)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -1.399e+04  4.807e+03  -2.911 0.003807
## x2_house_age                           -9.944e-01  2.373e-01  -4.190 3.44e-05
## x3_distance_to_the_nearest_mrt_station  4.860e-03  1.473e-03   3.299 0.001058
## x4_number_of_convenience_stores         1.152e-01  2.434e-01   0.474 0.636115
## x5_latitude                             3.157e+02  3.497e+01   9.029  < 2e-16
## x6_longitude                            5.077e+01  3.814e+01   1.331 0.184001
## log_mrt_distance                        1.456e+01  6.260e+00   2.325 0.020572
## is_new                                 -3.494e+01  9.660e+00  -3.618 0.000336
## convenience_store_nearby                3.022e+00  1.545e+00   1.956 0.051201
## home_age_bin_6_10_years                 3.060e+00  1.745e+00   1.753 0.080344
## home_age_bin_11_15_years               -3.207e+01  7.237e+00  -4.432 1.21e-05
## home_age_bin_16_20_years               -2.758e+01  6.315e+00  -4.367 1.61e-05
## home_age_bin_21_25_years               -2.298e+01  5.501e+00  -4.177 3.64e-05
## home_age_bin_26_30_years               -2.124e+01  4.310e+00  -4.929 1.22e-06
## home_age_bin_31_35_years               -1.614e+01  3.296e+00  -4.899 1.41e-06
## home_age_bin_36_40_years               -7.305e+00  2.867e+00  -2.548 0.011198
## home_age_bin_41_50_years                       NA         NA      NA       NA
## square_mrt_dist                        -1.861e+00  5.874e-01  -3.169 0.001650
##                                           
## (Intercept)                            ** 
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station ** 
## x4_number_of_convenience_stores           
## x5_latitude                            ***
## x6_longitude                              
## log_mrt_distance                       *  
## is_new                                 ***
## convenience_store_nearby               .  
## home_age_bin_6_10_years                .  
## home_age_bin_11_15_years               ***
## home_age_bin_16_20_years               ***
## home_age_bin_21_25_years               ***
## home_age_bin_26_30_years               ***
## home_age_bin_31_35_years               ***
## home_age_bin_36_40_years               *  
## home_age_bin_41_50_years                  
## square_mrt_dist                        ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.508 on 394 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7403 
## F-statistic: 74.03 on 16 and 394 DF,  p-value: < 2.2e-16

This new term, improved the R-Squared of our model to 0.7403.

Interaction Terms

We next build out an interaction term that is an interaction between one of our dichotomous terms and one of our quantitative variables. After seeing the correlation of the mrt_distance variable, I choose to explore the relationship between the convenience_store_nearby dichotomous variable and the log_mrt_distance variable as well as the is_new dichotomous variable and the square_mrt_distance.

mod5 <- mod4 %>% mutate(
  interaction_term1 = (convenience_store_nearby)*(log_mrt_distance),
  interaction_term2 = (is_new)*(square_mrt_dist)
)

lm5 <- lm(y_house_price_of_unit_area~., data=mod5)
summary(lm5)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.8602  -4.0292  -0.3665   3.5033  30.5259 
## 
## Coefficients: (1 not defined because of singularities)
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -1.292e+04  4.761e+03  -2.713 0.006955
## x2_house_age                           -8.952e-01  2.372e-01  -3.774 0.000186
## x3_distance_to_the_nearest_mrt_station  6.172e-03  1.507e-03   4.095 5.13e-05
## x4_number_of_convenience_stores        -5.592e-02  2.477e-01  -0.226 0.821542
## x5_latitude                             3.315e+02  3.457e+01   9.588  < 2e-16
## x6_longitude                            3.786e+01  3.789e+01   0.999 0.318324
## log_mrt_distance                        4.274e+01  9.727e+00   4.394 1.43e-05
## is_new                                 -2.557e+01  1.020e+01  -2.507 0.012572
## convenience_store_nearby                4.238e+01  1.184e+01   3.580 0.000386
## home_age_bin_6_10_years                 1.802e+00  1.743e+00   1.034 0.301873
## home_age_bin_11_15_years               -2.946e+01  7.153e+00  -4.119 4.65e-05
## home_age_bin_16_20_years               -2.586e+01  6.228e+00  -4.151 4.06e-05
## home_age_bin_21_25_years               -2.170e+01  5.414e+00  -4.009 7.30e-05
## home_age_bin_26_30_years               -1.981e+01  4.250e+00  -4.661 4.32e-06
## home_age_bin_31_35_years               -1.563e+01  3.240e+00  -4.826 2.00e-06
## home_age_bin_36_40_years               -7.140e+00  2.847e+00  -2.508 0.012540
## home_age_bin_41_50_years                       NA         NA      NA       NA
## square_mrt_dist                        -3.966e+00  8.137e-01  -4.874 1.59e-06
## interaction_term1                      -6.434e+00  1.911e+00  -3.368 0.000834
## interaction_term2                      -1.407e-01  6.285e-02  -2.239 0.025709
##                                           
## (Intercept)                            ** 
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station ***
## x4_number_of_convenience_stores           
## x5_latitude                            ***
## x6_longitude                              
## log_mrt_distance                       ***
## is_new                                 *  
## convenience_store_nearby               ***
## home_age_bin_6_10_years                   
## home_age_bin_11_15_years               ***
## home_age_bin_16_20_years               ***
## home_age_bin_21_25_years               ***
## home_age_bin_26_30_years               ***
## home_age_bin_31_35_years               ***
## home_age_bin_36_40_years               *  
## home_age_bin_41_50_years                  
## square_mrt_dist                        ***
## interaction_term1                      ***
## interaction_term2                      *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.392 on 392 degrees of freedom
## Multiple R-squared:  0.7604, Adjusted R-squared:  0.7494 
## F-statistic: 69.13 on 18 and 392 DF,  p-value: < 2.2e-16

The inclusion of these terms resulted in our adjusted R-Squared term improving to .7494.

Interpreting the Coefficients

Our model is currently using 18 variables. Of these, the variable home_age_bin_41_50_years is listed as NA in the dataset, and home_age_bin_5_10_years, x6_longitude and x4_number_of_convenience_stores are have significantly high p-values suggesting that they are not statistically significant. In order to potentially improve upon our model, we look to remove some of the variables and see what impact it has on our R-Squared.

mod6 <- mod5 %>% 
  select(-c("home_age_bin_41_50_years",
               "x4_number_of_convenience_stores",
               "x6_longitude"))


lm6 <- lm(y_house_price_of_unit_area~., data=mod6)
summary(lm6)
## 
## Call:
## lm(formula = y_house_price_of_unit_area ~ ., data = mod6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.1317  -4.0328  -0.3845   3.4647  30.6297 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -8.117e+03  8.115e+02 -10.002  < 2e-16
## x2_house_age                           -9.093e-01  2.365e-01  -3.845 0.000140
## x3_distance_to_the_nearest_mrt_station  5.757e-03  1.414e-03   4.071 5.67e-05
## x5_latitude                             3.235e+02  3.255e+01   9.936  < 2e-16
## log_mrt_distance                        4.351e+01  9.684e+00   4.493 9.22e-06
## is_new                                 -2.623e+01  1.015e+01  -2.584 0.010138
## convenience_store_nearby                4.274e+01  1.103e+01   3.874 0.000125
## home_age_bin_6_10_years                 2.038e+00  1.728e+00   1.180 0.238835
## home_age_bin_11_15_years               -2.980e+01  7.132e+00  -4.178 3.63e-05
## home_age_bin_16_20_years               -2.622e+01  6.209e+00  -4.222 3.01e-05
## home_age_bin_21_25_years               -2.207e+01  5.398e+00  -4.089 5.26e-05
## home_age_bin_26_30_years               -2.013e+01  4.234e+00  -4.754 2.81e-06
## home_age_bin_31_35_years               -1.584e+01  3.231e+00  -4.902 1.39e-06
## home_age_bin_36_40_years               -7.212e+00  2.842e+00  -2.538 0.011541
## square_mrt_dist                        -4.016e+00  8.115e-01  -4.949 1.10e-06
## interaction_term1                      -6.507e+00  1.839e+00  -3.539 0.000449
## interaction_term2                      -1.389e-01  6.256e-02  -2.220 0.027015
##                                           
## (Intercept)                            ***
## x2_house_age                           ***
## x3_distance_to_the_nearest_mrt_station ***
## x5_latitude                            ***
## log_mrt_distance                       ***
## is_new                                 *  
## convenience_store_nearby               ***
## home_age_bin_6_10_years                   
## home_age_bin_11_15_years               ***
## home_age_bin_16_20_years               ***
## home_age_bin_21_25_years               ***
## home_age_bin_26_30_years               ***
## home_age_bin_31_35_years               ***
## home_age_bin_36_40_years               *  
## square_mrt_dist                        ***
## interaction_term1                      ***
## interaction_term2                      *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.385 on 394 degrees of freedom
## Multiple R-squared:  0.7597, Adjusted R-squared:   0.75 
## F-statistic: 77.86 on 16 and 394 DF,  p-value: < 2.2e-16

By removing several of those variables, we have improved our R-Square to 0.75. Our new model includes 16 variables. Given our final model, the equation for our linear regression model is:

\(\hat{y} = -0.008117 - 0.9093(x2\_house\_age) +.005757(x3\_distance\_to\_the\_nearest\_mrt\_station) + 323.5(x5\_latitude) + 43.51\times ln(x3\_distance\_to\_the\_nearest\_mrt\_station) - 26.23(is\_new) + \\ 42.74(convenience\_store\_nearby) + 2.038(home\_age\_bin\_5\_10\_years) - 29.8(home\_age\_bin\_11\_15\_years) - 26.22(home\_age\_bin\_16\_20\_years) - 22.07(home\_age\_bin\_21\_25\_years) - \\20.13(home\_age\_bin\_26\_30\_years) - 15.84(home\_age\_bin\_31\_35\_years) - 7.212(home\_age\_bin\_36\_40\_years) - 4.016(x3\_distance\_to\_the\_nearest\_mrt\_station)^2 - \\ 6.507(convenience\_store\_nearby)(square\_mrt\_distance) - .01389(is\_new)(square\_mrt\_distance)\)

Model Evaluation

The following plots allows us to view the residual terms to determine if our model can be described as being an effective model. Looking at the QQ Plot we see that the chart appears to be normal out to the 2nd standard deviation. However, beyond that we can detect the presence of outliers. Additionally, there are no glaring patterns in the plot of the residual errors that would cause us to reject the model. However, once again with this chart, there are some clear outliers in the residuals.

plot(fitted(lm5), resid(lm5))
abline(h=0, lty="dashed")

qqnorm(resid(lm5))
qqline(resid(lm5))

par(mfrow=c(2,2))
plot(lm5)

## Conclusion

I believe that the linear model is relatively appropriate for this dataset. The plots of the residuals appear to suggest that they are normally distributed and that there are no obvious patterns in the residuals. Additionally, the model as an R-Square value of 0.75, which is relatively strong and the F-Stastic for the residuals is statistically significant. Finally, with the exception of home_age_bin_5_10_years variable, the other variables were determined to be significant at an alpha level of 0.05.