Background to the Housing Data Assignment

In 1964, economist William Alonso wrote a book on the bid-rent function of housing value. The central proposition went something like this: households optimise the size of their land ownership against the cost of commuting to work (i.e. to pay for their land ownership). Some households demand more land and therefore are prepared to have higher commuting costs and other households do not require as much land and will live closer to the city centre. This, of course, is a gross oversimplification of how house prices are determined, as the quality of the house (e.g. materials, size, aesthetics and features) and the quality of the neighbourhood (e.g. access to commercial activities, open space, types of housing offered) have a bearing on the implicit value of the lot.

In this assignment, we will not try to solve all the issues related to estimating the price of a house, but we will see if we can determine the values of housing features and whether bigger lots are further from the city, as well as if there is a diminishing value (after controlling for lot size and housing features) of property the further we move away from the city. Our case study uses property sales data from Perth in 2013, which has been supplied as HousingDataForAssignment.csv.

The Assignment

Complete the steps below to complete the assignment.

Firstly, set the appropriate working directory for this project (i.e. set the working directory to the project location) in RStudio, before proceeding to the next steps.

Part A: Preliminary Data Preparation

A1: Open the HousingDataForAssignment.csv data file using R and read it into a data frame.

#Set the working directory
setwd("~/Desktop/Assignment1_LMS 5/Assignment1_LMS")
getwd()
## [1] "/Users/jamesmacbook/Desktop/Assignment1_LMS 5/Assignment1_LMS"
##Read into data frame
Houses_df <- read.csv("/Users/jamesmacbook/Desktop/Assignment1_LMS 5/Assignment1_LMS/Data/HousingDataForAssignment.csv")

A2: Select 1200 observations from the data frame using a random number generator, with your student number as the seed. Place these 1200 observations into a new data frame named MyHousingData. You will work with this data only and you can remove (rm()) the data frame you created in the previous step.

#Set seed as the number generator
set.seed(23119479)
#Create index of 1200 
Data <- sample(1:nrow(Houses_df),1200)
#Place into data frame with variables
MyHousingData <- Houses_df[Data, ]
##View the data frame to 'eye off' the data and get a feel for what it looks like
View(MyHousingData)
##Remove data frame 
rm(Houses_df)

A3: Create factors for any variables that require them.

#Create factors for variables and make SALE as numeric. 
MyHousingData$Date <- as.factor(MyHousingData$Date)
MyHousingData$SUBURB <- as.factor(MyHousingData$SUBURB)
MyHousingData$Date <- as.factor(MyHousingData$Date)
MyHousingData$BRICKWALL <- as.factor(MyHousingData$BRICKWALL)
MyHousingData$DETACHED <- as.factor(MyHousingData$DETACHED)
MyHousingData$PERIOD_INDEX <- as.factor(MyHousingData$PERIOD_INDEX)
MyHousingData$BEDS <- as.factor(MyHousingData$BEDS)
MyHousingData$BATHS_TOT <- as.factor(MyHousingData$BATHS_TOT)
MyHousingData$LIVING_TOT <- as.factor(MyHousingData$LIVING_TOT)
MyHousingData$CARPORTT <- as.factor(MyHousingData$CARPORTT)
MyHousingData$POOL <- as.factor(MyHousingData$POOL)
MyHousingData$GARAGET <- as.factor(MyHousingData$GARAGET)
MyHousingData$TILEDROOF <- as.factor(MyHousingData$TILEDROOF)
MyHousingData$VILLA_UNIT <- as.factor(MyHousingData$VILLA_UNIT)
MyHousingData$SALE <- as.numeric(MyHousingData$SALE)
##Check structure of dataframe has changed
str(MyHousingData)
## 'data.frame':    1200 obs. of  35 variables:
##  $ ID_SEQ               : int  109349 124793 94193 125228 118283 98325 85549 96087 120589 115189 ...
##  $ SUBURB               : Factor w/ 211 levels "ALEXANDER HEIGHTS",..: 76 152 133 171 77 78 184 152 185 147 ...
##  $ SALE                 : num  305000 390000 369000 492000 605000 350000 750000 370000 510000 480000 ...
##  $ Date                 : Factor w/ 177 levels "1/03/2013","1/04/2013",..: 86 35 32 46 152 62 66 86 58 45 ...
##  $ PERIOD_INDEX         : Factor w/ 2 levels "3","5": 1 2 1 2 2 1 1 1 2 2 ...
##  $ DETACHED             : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 1 ...
##  $ VILLA_UNIT           : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 1 2 ...
##  $ BRICKWALL            : Factor w/ 3 levels "#NULL!","0","1": 3 3 3 3 3 3 3 3 3 3 ...
##  $ TILEDROOF            : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 1 2 2 2 ...
##  $ LAND_AREA            : int  101 247 544 492 923 469 308 218 603 128 ...
##  $ AREA_HSE             : int  79 98 174 193 230 163 121 101 196 128 ...
##  $ AGE                  : int  43 11 5 15 11 9 20 6 11 29 ...
##  $ BEDS                 : Factor w/ 7 levels "0","1","2","3",..: 3 4 4 5 4 5 3 4 5 4 ...
##  $ BATHS_TOT            : Factor w/ 5 levels "0","1","2","3",..: 2 3 3 3 3 3 2 3 3 2 ...
##  $ LIVING_TOT           : Factor w/ 8 levels "1","2","3","4",..: 3 4 4 6 5 5 4 4 4 4 ...
##  $ CARPORTT             : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 3 1 1 2 ...
##  $ GARAGET              : Factor w/ 5 levels "0","1","2","3",..: 1 3 3 3 3 3 1 3 3 1 ...
##  $ POOL                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Lat_centroid         : num  -31.9 -31.9 -32.5 -31.9 -32.4 ...
##  $ Long_centroid        : num  116 116 116 116 116 ...
##  $ D_CBD_Drive_Km       : num  6.95 13.16 67.07 14.11 60.88 ...
##  $ D_CBD_PT_Km          : num  8.1 10.7 77.9 12.2 62 ...
##  $ Time_CBD_Drive_Min   : num  8.03 16.8 51.9 16.08 45.92 ...
##  $ Time_CBD_PT_Min      : num  33.1 47.9 95.8 16.1 77.2 ...
##  $ Commercial           : num  1.38 0.27 0.43 1.11 0.23 0.58 3.93 0.27 0.93 0.65 ...
##  $ Education            : num  1.38 1.34 1.07 1.84 0.23 0.65 0 1.34 1.56 1.31 ...
##  $ Health               : num  1.38 1.07 0.64 1.11 0.23 0.52 0 1.07 0.93 0.98 ...
##  $ Manufacturingindustry: num  0 0 0 0 0 0.06 0 0 0 0 ...
##  $ Office               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recreation           : num  4.15 5.36 5.56 3.69 3.17 2.78 8.52 5.36 6.84 6.86 ...
##  $ Residential          : num  0.49 0.77 0.46 0.61 0.32 0.49 0.51 0.77 0.67 0.77 ...
##  $ Retailoutlets        : num  1.38 0.27 0.43 0 0.23 0.32 1.31 0.27 0.62 0.33 ...
##  $ natural_elements     : num  0 0 0 0.02 1.32 1.33 0 0 0 0.08 ...
##  $ walk_score           : num  54.5 49.8 28.1 34 10 ...
##  $ bikeroute_total      : num  4.87 3.71 0.06 4.56 1.26 2.78 6.57 3.71 5.7 3.67 ...

A4: Split the observations in your MyHousingData data frame into a training set of 1000 observations and a test set of 200 observations.

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(pacman, tidyverse, EnvStats,psych, caret)
library(caret)

##Create Index for training then create data frames for testing and training set
training <- sample(1:nrow(MyHousingData),1000)
train_df <- MyHousingData[training, ]
test_df <- MyHousingData[-training, ]
##training_df will be used from here 

Part B: Descriptive Statistics

From here on in, you will need to write a paragraph to describe each output.

B1: Your data set has nominal (factors) and ratio (numerical) variables. Create the appropriate numerical summary for each of them.

##Creating a different subset for character factors and numeric for data wrangling if needed
factor_df=subset(train_df, select=c("SUBURB","Date", "BRICKWALL","DETACHED", "PERIOD_INDEX", "BATHS_TOT", "LIVING_TOT", "CARPORTT", "POOL", "GARAGET", "TILEDROOF", "VILLA_UNIT", "BEDS"))

##Numeric Data frame for only numerice factors, this can be used for data wrangling
numeric_df=subset(train_df, select=c("SALE", "LAND_AREA", "AREA_HSE", "AGE", "Lat_centroid", "Long_centroid", "D_CBD_Drive_Km", "D_CBD_PT_Km", "Time_CBD_Drive_Min", "Time_CBD_PT_Min", "Commercial", "Education", "Health", "Manufacturingindustry", "Office", "Recreation", "Residential", "Retailoutlets", "natural_elements", "walk_score", "bikeroute_total"))

##Summarise data, we could use below function or simply summary(MyHousingData)
summary(factor_df)
##         SUBURB            Date      BRICKWALL   DETACHED PERIOD_INDEX BATHS_TOT
##  ELLENBROOK: 53   13/05/2013: 16   #NULL!:  2   0:481    3:544        0:  2    
##  NOLLAMARA : 30   20/05/2013: 15   0     : 23   1:519    5:456        1:290    
##  BALDIVIS  : 25   22/05/2013: 15   1     :975                         2:698    
##  MORLEY    : 25   30/04/2013: 15                                      3:  8    
##  MAYLANDS  : 23   27/05/2013: 13                                      4:  2    
##  BUTLER    : 22   29/04/2013: 13                                               
##  (Other)   :822   (Other)   :913                                               
##    LIVING_TOT  CARPORTT POOL    GARAGET TILEDROOF VILLA_UNIT BEDS   
##  4      :354   0:601    0:935   0:452   0:161     0:640      0: 19  
##  3      :222   1:223    1: 65   1: 88   1:839     1:360      1: 10  
##  5      :210   2:176            2:448                        2:119  
##  6      : 97                    3: 11                        3:459  
##  2      : 80                    4:  1                        4:376  
##  7      : 28                                                 5: 15  
##  (Other):  9                                                 6:  2
summary(numeric_df)
##       SALE          LAND_AREA         AREA_HSE          AGE        
##  Min.   :110000   Min.   : 100.0   Min.   : 71.0   Min.   :  0.00  
##  1st Qu.:374250   1st Qu.: 219.0   1st Qu.:106.0   1st Qu.:  6.00  
##  Median :445000   Median : 408.0   Median :137.0   Median : 11.00  
##  Mean   :455463   Mean   : 406.4   Mean   :146.6   Mean   : 15.78  
##  3rd Qu.:526500   3rd Qu.: 569.2   3rd Qu.:184.0   3rd Qu.: 21.00  
##  Max.   :800000   Max.   :1031.0   Max.   :358.0   Max.   :102.00  
##                                                                    
##   Lat_centroid    Long_centroid   D_CBD_Drive_Km   D_CBD_PT_Km   
##  Min.   :-32.63   Min.   :115.6   Min.   : 0.00   Min.   : 2.97  
##  1st Qu.:-32.08   1st Qu.:115.8   1st Qu.:14.07   1st Qu.:13.03  
##  Median :-31.94   Median :115.8   Median :23.72   Median :23.28  
##  Mean   :-31.98   Mean   :115.9   Mean   :27.60   Mean   :27.75  
##  3rd Qu.:-31.84   3rd Qu.:115.9   3rd Qu.:36.66   3rd Qu.:32.19  
##  Max.   :-31.54   Max.   :116.2   Max.   :85.24   Max.   :89.38  
##                                                   NA's   :5      
##  Time_CBD_Drive_Min Time_CBD_PT_Min    Commercial        Education    
##  Min.   : 6.28      Min.   :  6.28   Min.   : 0.0000   Min.   :0.000  
##  1st Qu.:16.80      1st Qu.: 38.38   1st Qu.: 0.2400   1st Qu.:0.560  
##  Median :24.72      Median : 53.92   Median : 0.4700   Median :1.025  
##  Mean   :26.68      Mean   : 53.06   Mean   : 0.8112   Mean   :1.112  
##  3rd Qu.:33.87      3rd Qu.: 70.03   3rd Qu.: 0.9100   3rd Qu.:1.420  
##  Max.   :67.25      Max.   :123.33   Max.   :46.5700   Max.   :5.160  
##  NA's   :5          NA's   :5                                         
##      Health       Manufacturingindustry     Office          Recreation    
##  Min.   :0.0000   Min.   :0.00000       Min.   :0.00000   Min.   : 0.090  
##  1st Qu.:0.4700   1st Qu.:0.00000       1st Qu.:0.00000   1st Qu.: 3.090  
##  Median :0.6300   Median :0.00000       Median :0.00000   Median : 4.420  
##  Mean   :0.7514   Mean   :0.08174       Mean   :0.03403   Mean   : 4.609  
##  3rd Qu.:0.9800   3rd Qu.:0.10000       3rd Qu.:0.00000   3rd Qu.: 5.750  
##  Max.   :5.1600   Max.   :0.91000       Max.   :1.99000   Max.   :15.620  
##                                                                           
##   Residential     Retailoutlets    natural_elements    walk_score   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   Min.   : 3.69  
##  1st Qu.:0.3700   1st Qu.:0.1000   1st Qu.:  0.000   1st Qu.:28.84  
##  Median :0.5800   Median :0.2700   Median :  0.160   Median :41.79  
##  Mean   :0.5357   Mean   :0.3521   Mean   :  1.609   Mean   :40.97  
##  3rd Qu.:0.7100   3rd Qu.:0.4600   3rd Qu.:  1.070   3rd Qu.:51.71  
##  Max.   :0.9400   Max.   :4.4300   Max.   :110.560   Max.   :96.50  
##                                                                     
##  bikeroute_total
##  Min.   :0.000  
##  1st Qu.:2.160  
##  Median :3.710  
##  Mean   :3.722  
##  3rd Qu.:5.020  
##  Max.   :8.650  
## 

From the training data frame there are 1000 observations 35 variables or columns. The data represents houses that have been sold and describes details around the sale, including a unique identifier, sale price of property, the date it was sold, suburb it was sold in, distance from city and other key destinations. Some insights derived from the data are as follows, suburb with the most houses sold is Ellenbrook with 53 houses sold, followed by Nollamara 30, and lastly, Baldivis 25. The mean sale price of the sample was $455463 with the maximum sale price of three properties $800000, two in Illuka and one in Wembley. The minimum house price is $110000, a villa in Applecross and one in Hammersely, another four properties in Hammersely were sold for $120000-$128000. The majority of properties have two bathrooms (n=698). In regard to bedrooms, most houses have 3 (n=459) and 10 houses have one.376 houses have 4 bedrooms. Only 65 properties have a pool, 935 do not have a pool. 27.60km is the mean distance from Perth CBD, the average time in drive is 26 minutes. There are many other variables that can be analysed and compared. You could represent this data as percentages rather than total counts. Further analysis using hypothesis tests and regression models will be used to test relationships and correlation.

B2: Create scatter plots for (1) lot-size, (2) distance from the CBD and (3) price.

##Creating plot for House price versus land area. created a new value by dividing house price by 1000 to stop the plot going into scientific notation. Edited scales and added title for aesthetic.
if (!require("ggplot2")) install.packages("ggplot2")
library(dplyr)

Property_Price <- train_df$SALE/1000
ggplot(train_df, aes(x=LAND_AREA, y=Property_Price)) +
  geom_point(size=1.5, shape=16) +
  scale_y_continuous(labels = scales::dollar, breaks = seq(0, 1000, by = 150))+
  scale_x_continuous(breaks = seq(0, 1200, by = 200)) + 
  labs(x="Land Area (sqm)",
       y = "Sale Price ($Hundreds of Thousands)") +
  ggtitle("Sale Price compared to Land Area")

##Created plot to for drive to CBD compared to Land Area, add titles and labels.
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=LAND_AREA)) +
  geom_point(size=1.5, shape=16) +
  labs(x="Distance In Kilometres",
       y="Land Area (sqm)") +
  ggtitle("Drive to CBD compared to Land Area")

##Create plot comparing distance and property price, add titles, labels and breaks.
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=Property_Price)) +
  geom_point(size=1.5, shape=18) +
  scale_y_continuous(labels = scales::dollar, breaks = seq(0, 1000, by = 150)) +
  labs(x="Distance from City (Kilometres)",
       y="Property Price (Hundreds of Thousands)") +
  ggtitle("Price compared to Drive")

We can see from scatter plot 1 that the correlation when tested could be low. As many data points have different Y values are the same as the X Values. Visually, we can see most of the data falls between the $300 - 600k house price mark as well as falling between 100sqm - 600sqm area range. Outside those ranges we can see that there are fewer data points and some outliers. When considering house prices this is logical as there a fewer suburbs with higher price tags and expensive houses. A line could also be used by coding geom_abline().

B3: Estimate the bi-variate correlation coefficients for these variables in the previous question. Are they significantly different from zero? Address this in your paragraph of explanation.

cor(train_df$LAND_AREA, train_df$SALE)
## [1] 0.1766924
cor(train_df$D_CBD_Drive_Km, train_df$SALE)
## [1] -0.2272232
cor(train_df$AREA_HSE,train_df$D_CBD_Drive_Km)
## [1] 0.4211321

The answer to the first correlation is R = 0.1767 (rounded to 4 decimals). This indicates that the relationship between price and land area is extremely weak. The second correlation R= -0.2272, this shows there is no relationship between kilometres from the CBD and the property price. The third correlation R=0.4211, this is still a weaker correlation, but shows a greater relationship than the other tests.

Part C: Preliminary Inferential Statistics

C1: Compute conditional mean prices for houses, where the conditions are (1) number of bedrooms, (2) number of bathrooms and (3) the presence of a swimming pool.

##Install packages if they are needed.
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)

##Create new data frame for conditional means. We are separating the data and grouping it by number of beds and asking R to output the mean price for each level within the factor BEDS. Print the data frame to look at results. Can also use head() or view().
C1_BEDS_temp2 <- group_by(train_df, BEDS)
C1_BED_temp <- summarise(C1_BEDS_temp2,
                             bed_mean_temp = mean(SALE),
                             n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_BED_temp)
##   BEDS bed_mean_temp   n
## 1    0      462157.9  19
## 2    1      445450.0  10
## 3    2      424066.4 119
## 4    3      443654.7 459
## 5    4      475659.6 376
## 6    5      528560.7  15
## 7    6      675000.0   2
C1_BATHS_temp2 <- group_by(train_df, BATHS_TOT)
C1_BATH_temp <- summarise(C1_BATHS_temp2,
                             BATH_mean_temp = mean(SALE),
                             n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_BATH_temp)
##   BATHS_TOT BATH_mean_temp   n
## 1         0       362500.0   2
## 2         1       412058.3 290
## 3         2       471108.9 698
## 4         3       646125.0   8
## 5         4       619205.0   2
C1_POOL_temp2 <- group_by(train_df, POOL)
C1_POOL_temp <- summarise(C1_POOL_temp2,
                             POOL_mean_temp = mean(SALE),
                             n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_POOL_temp)
##   POOL POOL_mean_temp   n
## 1    0       451073.7 935
## 2    1       518606.3  65

Paragraph: C2 The analysis for the conditional means for bed count displays an interesting result to interpret. 19 houses with no beds have a mean price of $462157.9. Potentially this is a data integrity issues and this data should have been removed initially. This data could however, may represent properties that are purely land or projects that have not been finished yet. Properties with 1 bedroom have a mean price of $445450 which more than the mean of properties with two bedrooms which is $445450. There may be other factors affecting this or it may simply be that there are only 119 properties with one bedroom, therefore outliers will affect the mean more adversely. From 2 bedrooms onwards, the average price increases for every bedroom extra. This is naturally what would be expected.

For bathrooms, there are only 2 properties that have no bathrooms. Again this could be land, a development or data integrity. On average properties with 2 bathrooms are worth $447110, add a third bathroom and the mean price increases considerably to $646125. This is a difference of $199015. Although a large difference it should be noted that the extra bathroom is not causing this price increase, but it could be related. Further analysis would need to be undertaken to understand this.

Assuming that 0 represents no pool and that 1 represent a pool. We can determine based on data provided that properties that have pools, on average, are worth $67,533 more.

C2: Use the appropriate test (ANOVA or t-test) to make a preliminary assessment on whether price is dependent on (1) the number of bedrooms; (2) the number of bathrooms; (3) the presence of a swimming pool.

# Compute the analysis of variance
Anova_Combined <- lm(SALE ~ BEDS + BATHS_TOT + POOL, data=train_df)
anova(Anova_Combined) 
confint(Anova_Combined, level=0.95) 
##                   2.5 %     97.5 %
## (Intercept)  206885.635 518114.365
## BEDS1       -133523.089  42217.975
## BEDS2       -125992.488 -11263.420
## BEDS3       -123660.230 -12616.942
## BEDS4       -118589.812  -3720.886
## BEDS5       -100750.239  60401.981
## BEDS6       -120498.563 220833.703
## BATHS_TOT1   -53131.330 275896.036
## BATHS_TOT2     3288.131 334277.935
## BATHS_TOT3   126284.243 494619.615
## BATHS_TOT4    24923.891 485939.028
## POOL1         15687.670  75172.126

ANOVA must be used as t.test is for two variables only. The null hypothesis for the test is that all means are the same and the alternative hypothesis is that at least one of the means are different. All the means are significantly different and we can reject the null hypothesis at the 5% level.

Part D: Regression Models

D1: Run simple linear regressions for price as a function of (1) distance from the city and (2) lot-size.

##This is the equation with results for price/kilometres
Model_Kilometres <- lm(Property_Price ~ D_CBD_Drive_Km, data=train_df)
##Use summary instead of ANOVA, only one dependant variable
summary(Model_Kilometres)
## 
## Call:
## lm(formula = Property_Price ~ D_CBD_Drive_Km, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -373.84  -79.15   -4.94   67.45  351.16 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     496.973      6.694  74.246  < 2e-16 ***
## D_CBD_Drive_Km   -1.504      0.204  -7.371 3.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.4 on 998 degrees of freedom
## Multiple R-squared:  0.05163,    Adjusted R-squared:  0.05068 
## F-statistic: 54.33 on 1 and 998 DF,  p-value: 3.548e-13
##Plotting the model
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=Property_Price)) + 
  geom_point()+
  geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

##Equation for Sale versus Land Area
Model_LotSize <- lm(SALE ~ LAND_AREA, data=train_df)
summary(Model_LotSize)
## 
## Call:
## lm(formula = SALE ~ LAND_AREA, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346227  -80097  -15662   68083  351498 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 414695.65    8065.32  51.417  < 2e-16 ***
## LAND_AREA      100.32      17.69   5.671 1.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115600 on 998 degrees of freedom
## Multiple R-squared:  0.03122,    Adjusted R-squared:  0.03025 
## F-statistic: 32.16 on 1 and 998 DF,  p-value: 1.857e-08
#The model
ggplot(train_df, aes(x=LAND_AREA, y=Property_Price)) +
  geom_point()+
  geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Comparing both models, we can see both land area and driving distance from the CBD are significant at the 0.0001 level. For driving distance the error is slighlty larger than the error for the size of the land. The variance in price is only explained by 3% by Land area and 5% by distance by driving to the CBD. This is a low amount of variance explained for price.

If the linear regression model is used we can determine how much land area contributes to price. The intercept for land price is 100.3 from this we can deduce that for every 1 square metre in land size that increases, it adds an extra $100.3 in price.

Due to the low correlation the model is not effective for predicting house price, to get a better model we will need to add more factors that are significant. This is a logical answer as we know that there are more factors that affect a house price rather than simply distance.

D2: Run a multiple linear regression of price as a function of distance from the city, lot size and the appropriate housing attributes.

Model_MLR<- lm(SALE ~ LAND_AREA + D_CBD_Drive_Km + BEDS + BATHS_TOT + POOL + GARAGET + AGE + AREA_HSE + Time_CBD_Drive_Min, data=train_df)
summary(Model_MLR)
## 
## Call:
## lm(formula = SALE ~ LAND_AREA + D_CBD_Drive_Km + BEDS + BATHS_TOT + 
##     POOL + GARAGET + AGE + AREA_HSE + Time_CBD_Drive_Min, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -556195  -47357   -9223   40716  332135 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        271013.964  62103.614   4.364 1.41e-05 ***
## LAND_AREA               1.685     23.419   0.072 0.942658    
## D_CBD_Drive_Km       1014.600    607.801   1.669 0.095380 .  
## BEDS1               52103.697  34780.070   1.498 0.134434    
## BEDS2              -12179.130  24644.662  -0.494 0.621284    
## BEDS3              -43484.970  22984.329  -1.892 0.058795 .  
## BEDS4              -82948.663  24232.664  -3.423 0.000645 ***
## BEDS5              -69028.479  33688.176  -2.049 0.040726 *  
## BEDS6               33425.851  67131.873   0.498 0.618658    
## BATHS_TOT1          83279.901  64126.015   1.299 0.194357    
## BATHS_TOT2         126466.606  64970.822   1.947 0.051880 .  
## BATHS_TOT3         130062.500  72649.331   1.790 0.073720 .  
## BATHS_TOT4          80543.680  90504.988   0.890 0.373720    
## POOL1                7661.621  12087.128   0.634 0.526317    
## GARAGET1            14403.651  10419.361   1.382 0.167168    
## GARAGET2             4176.116   7753.079   0.539 0.590259    
## GARAGET3            42052.613  27115.956   1.551 0.121264    
## GARAGET4            97270.846  89560.132   1.086 0.277706    
## AGE                   579.133    306.894   1.887 0.059447 .  
## AREA_HSE             1684.322    103.312  16.303  < 2e-16 ***
## Time_CBD_Drive_Min  -6207.103    862.639  -7.195 1.24e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84960 on 974 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.4874, Adjusted R-squared:  0.4768 
## F-statistic:  46.3 on 20 and 974 DF,  p-value: < 2.2e-16
anova(Model_MLR)
confint(Model_MLR, level=0.95)
##                            2.5 %       97.5 %
## (Intercept)         149141.67380 392886.25428
## LAND_AREA              -44.27179     47.64162
## D_CBD_Drive_Km        -178.15070   2207.35068
## BEDS1               -16148.79996 120356.19487
## BEDS2               -60541.87708  36183.61766
## BEDS3               -88589.47553   1619.53472
## BEDS4              -130502.90480 -35394.42071
## BEDS5              -135138.24315  -2918.71568
## BEDS6               -98313.90788 165165.61023
## BATHS_TOT1          -42561.15627 209120.95732
## BATHS_TOT2           -1032.30125 253965.51391
## BATHS_TOT3          -12504.73236 272629.73235
## BATHS_TOT4          -97063.54062 258150.90022
## POOL1               -16058.19047  31381.43327
## GARAGET1             -6043.33107  34850.63209
## GARAGET2            -11038.54619  19390.77736
## GARAGET3            -11159.80799  95265.03417
## GARAGET4            -78482.18697 273023.87808
## AGE                    -23.11646   1181.38171
## AREA_HSE              1481.58154   1887.06249
## Time_CBD_Drive_Min   -7899.94773  -4514.25893

The data output for the multiple regression has been determined on the factors that will most likely affect price. We can see from the output, that in fact not many of the factors are significant. With our 20 degrees of freedom the model explains only 47.3% of the variance in price, using the multiple R squared.

Houses with 4 and 5 bedrooms have a significant affect on price. House area and drive time to the CBD also are significant at the 5% confidence level. To improve the model the factors that are below 5% confidence level should be removed.

D3: Note you have detached housing and villas/units delineated in your data. Run a regression for each of the two segments and comment on the differences in parameters.

##Run model based on multiple attributes then run second model
Villas_Regression <- train_df[train_df$VILLA_UNIT==1,]
Villas_Regression_Model <- lm(SALE ~ AREA_HSE + LAND_AREA + D_CBD_Drive_Km  + Time_CBD_Drive_Min + BATHS_TOT + TILEDROOF + LIVING_TOT + CARPORTT + AGE + Education + Health + Retailoutlets + BEDS, data = Villas_Regression)
summary(Villas_Regression_Model)
## 
## Call:
## lm(formula = SALE ~ AREA_HSE + LAND_AREA + D_CBD_Drive_Km + Time_CBD_Drive_Min + 
##     BATHS_TOT + TILEDROOF + LIVING_TOT + CARPORTT + AGE + Education + 
##     Health + Retailoutlets + BEDS, data = Villas_Regression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531757  -49714   -3323   53622  262288 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        300155.33   92648.94   3.240  0.00132 ** 
## AREA_HSE             1847.64     239.53   7.714 1.42e-13 ***
## LAND_AREA             125.66      46.94   2.677  0.00779 ** 
## D_CBD_Drive_Km       2145.79    1402.01   1.531  0.12684    
## Time_CBD_Drive_Min  -8421.86    1857.91  -4.533 8.11e-06 ***
## BATHS_TOT1          60635.59   78230.05   0.775  0.43883    
## BATHS_TOT2          86921.70   80605.02   1.078  0.28165    
## BATHS_TOT3          57925.70  126481.86   0.458  0.64727    
## TILEDROOF1         -29599.66   13957.32  -2.121  0.03468 *  
## LIVING_TOT2         -1266.85   50660.51  -0.025  0.98006    
## LIVING_TOT3         26156.58   52640.74   0.497  0.61959    
## LIVING_TOT4         11400.18   53649.46   0.212  0.83185    
## LIVING_TOT5        -28009.91   56969.36  -0.492  0.62328    
## LIVING_TOT6        -68205.45   72398.21  -0.942  0.34683    
## CARPORTT1           -9541.72   13859.81  -0.688  0.49165    
## CARPORTT2           44612.01   20467.67   2.180  0.02998 *  
## AGE                  -409.88     567.43  -0.722  0.47058    
## Education           10559.26   17322.70   0.610  0.54257    
## Health             -15973.34   21582.17  -0.740  0.45975    
## Retailoutlets        4841.16   12065.81   0.401  0.68851    
## BEDS1               59415.52   54309.31   1.094  0.27473    
## BEDS2                6966.81   46274.13   0.151  0.88042    
## BEDS3              -44088.13   43995.49  -1.002  0.31702    
## BEDS4              -86646.95   49262.20  -1.759  0.07951 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96890 on 334 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4212, Adjusted R-squared:  0.3814 
## F-statistic: 10.57 on 23 and 334 DF,  p-value: < 2.2e-16
Detached_Regression <- train_df[train_df$DETACHED==1, ] 
Detached_Regression_Model <- lm(SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + Education + Health + Retailoutlets + Time_CBD_Drive_Min, data = Detached_Regression)
summary(Detached_Regression_Model)
## 
## Call:
## lm(formula = SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + 
##     BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + 
##     Education + Health + Retailoutlets + Time_CBD_Drive_Min, 
##     data = Detached_Regression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -219543  -38371   -9222   27331  285252 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         338209.95   97891.72   3.455 0.000598 ***
## BEDS2               241242.00   81952.72   2.944 0.003398 ** 
## BEDS3               176970.81   78444.70   2.256 0.024513 *  
## BEDS4               159619.13   78510.91   2.033 0.042585 *  
## BEDS5               173942.10   80787.32   2.153 0.031801 *  
## BEDS6               271083.38   93577.29   2.897 0.003938 ** 
## AREA_HSE              1661.05     107.90  15.394  < 2e-16 ***
## CARPORTT1           -25265.11   14137.36  -1.787 0.074539 .  
## CARPORTT2           -27469.81    7790.63  -3.526 0.000462 ***
## AGE                   2162.43     474.34   4.559 6.51e-06 ***
## BATHS_TOT2           49151.15   13689.76   3.590 0.000364 ***
## BATHS_TOT3           91924.51   32168.59   2.858 0.004451 ** 
## BATHS_TOT4            1815.18   50748.99   0.036 0.971482    
## BRICKWALL0          -56796.69   75950.23  -0.748 0.454932    
## BRICKWALL1          -75859.44   66261.79  -1.145 0.252835    
## POOL1                12934.06   10043.25   1.288 0.198413    
## LAND_AREA              -16.23      29.36  -0.553 0.580756    
## LIVING_TOT3        -114658.98   44211.28  -2.593 0.009788 ** 
## LIVING_TOT4        -140883.60   42015.36  -3.353 0.000861 ***
## LIVING_TOT5        -152371.97   42145.46  -3.615 0.000331 ***
## LIVING_TOT6        -163027.82   42524.79  -3.834 0.000143 ***
## LIVING_TOT7        -185978.04   44210.66  -4.207 3.08e-05 ***
## TILEDROOF1          -15853.32   11029.31  -1.437 0.151251    
## D_CBD_Drive_Km        -284.59     621.15  -0.458 0.647034    
## Education           -10931.42   12025.86  -0.909 0.363803    
## Health               11591.53   17380.61   0.667 0.505136    
## Retailoutlets       -17178.95   12326.20  -1.394 0.164045    
## Time_CBD_Drive_Min   -3988.75     915.31  -4.358 1.60e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65210 on 488 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.6591, Adjusted R-squared:  0.6402 
## F-statistic: 34.94 on 27 and 488 DF,  p-value: < 2.2e-16

The comparison between villa/unit and detached housing is interesting. Looking at the intercept for each regression mode. The y value, or expected mean value is $300155 for villa/units however for detached housing it is $338209, this meaning that all other variables being zero, detached houses cost $38054 more than villas and units. This is statistically correct and not physically possible; it is to demonstrate that detached houses should cost more in this model.

The model for villas show some factors that are significant, these are: tiled roof (0.03468), 2 carports (0.02998), distance in kilometres (8.11e-06) and house area (1.87e-10). The overall model is significant at the 1% level, the p-value being < 2.2e-16. The R-squared for the model is 0.4212 and the adjusted R-squared is 0.3814 meaning that 38% of the variance in price is explained by the factors. Given all the information discussed it will be necessary to remove factors that are not substantially increasing the R-Squared value.

The factors that significantly affect detached houses are properties with 2, 3, 4, 5, or 6 bedrooms, distance from the city in time and distance, house area, 2 carports and 2 and 3 bathrooms. Age has the greatest significance compared to price where the p-value = 6.51e-06. The effectiveness of the model is good; the multiple R squared is 0.6591 and the adjusted R squared is 0.6402. 64% of the factors explain the variance in price. The small different between the multiple R Squared and the adjusted R Squared also demonstrates the effectiveness of the model. The overall model is significant at the 0.0001 level with a p-value at < 2.2e-16.

Part E: My Regression Model

E1: This is your contribution. You will make sense of your preliminary work and estimate your best regression model. You need to diagnose the errors and validate your models’ performance on the test set of observations. It may take more than one chunk of code. Please remember to write a pargraph about each output.

#Start model
if(!require("car")) install.packages("car")
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:EnvStats':
## 
##     qqPlot
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(car)
if(!require("caret")) install.packages("caret")
library(caret)
#Create MLR for training
MyModel <- lm(SALE ~BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + Education + Retailoutlets  + Time_CBD_Drive_Min + Health , data = train_df)
#save predictions
Predictor  <- MyModel %>% predict(test_df)
data.frame(
          squares_1 = RMSE(Predictor, test_df$SALE),
          Rsquared_1= R2(Predictor, test_df$SALE))
##analyse data
summary(MyModel)
## 
## Call:
## lm(formula = SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + 
##     BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + 
##     Education + Retailoutlets + Time_CBD_Drive_Min + Health, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -555380  -46530   -5714   38124  290493 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         454109.10   94443.43   4.808 1.77e-06 ***
## BEDS1                 2939.88   40947.88   0.072 0.942779    
## BEDS2               -39591.84   32419.88  -1.221 0.222301    
## BEDS3               -69671.14   31547.47  -2.208 0.027448 *  
## BEDS4              -104268.29   32246.29  -3.233 0.001264 ** 
## BEDS5               -92102.72   39257.01  -2.346 0.019171 *  
## BEDS6                11298.62   70687.47   0.160 0.873041    
## AREA_HSE              1859.38     103.30  17.999  < 2e-16 ***
## CARPORTT1           -12052.22    8642.07  -1.395 0.163459    
## CARPORTT2            -2894.21    7867.81  -0.368 0.713063    
## AGE                   1095.73     329.39   3.327 0.000913 ***
## BATHS_TOT1           45508.73   65048.75   0.700 0.484340    
## BATHS_TOT2           85246.93   65882.61   1.294 0.196003    
## BATHS_TOT3          104981.04   74015.44   1.418 0.156407    
## BATHS_TOT4           24087.38   90289.13   0.267 0.789695    
## BRICKWALL0         -131773.93   62141.57  -2.121 0.034215 *  
## BRICKWALL1         -139072.98   59495.51  -2.338 0.019615 *  
## POOL1                15180.14   11885.05   1.277 0.201823    
## LAND_AREA               21.16      24.08   0.879 0.379691    
## LIVING_TOT2         -23949.11   40419.54  -0.593 0.553646    
## LIVING_TOT3          31900.54   41853.05   0.762 0.446125    
## LIVING_TOT4          21514.74   42317.32   0.508 0.611279    
## LIVING_TOT5            461.20   42898.61   0.011 0.991424    
## LIVING_TOT6         -14955.44   43646.36  -0.343 0.731936    
## LIVING_TOT7         -37470.95   46293.68  -0.809 0.418475    
## LIVING_TOT8        -153533.73  101238.99  -1.517 0.129709    
## TILEDROOF1          -34730.05    8114.98  -4.280 2.06e-05 ***
## D_CBD_Drive_Km         639.36     601.78   1.062 0.288298    
## Education            -2635.19    8824.54  -0.299 0.765294    
## Retailoutlets         3916.64    7647.56   0.512 0.608669    
## Time_CBD_Drive_Min   -5818.00     852.61  -6.824 1.57e-11 ***
## Health                6455.16   12104.15   0.533 0.593948    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82890 on 963 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.5175, Adjusted R-squared:  0.502 
## F-statistic: 33.32 on 31 and 963 DF,  p-value: < 2.2e-16
anova(MyModel)
##Check for multicollinearity
car::vif(MyModel)
##                         GVIF Df GVIF^(1/(2*Df))
## BEDS               11.758196  6        1.227991
## AREA_HSE            3.961633  1        1.990385
## CARPORTT            2.133863  2        1.208625
## AGE                 2.746453  1        1.657243
## BATHS_TOT           4.845079  4        1.218043
## BRICKWALL           1.210889  2        1.049001
## POOL                1.231065  1        1.109534
## LAND_AREA           3.586592  1        1.893830
## LIVING_TOT         12.580137  7        1.198254
## TILEDROOF           1.280346  1        1.131524
## D_CBD_Drive_Km     16.491686  1        4.060996
## Education           6.417424  1        2.533263
## Retailoutlets       1.213998  1        1.101816
## Time_CBD_Drive_Min 16.637289  1        4.078883
## Health              6.118378  1        2.473536
##Show plot for residuals
plot(MyModel)
## Warning: not plotting observations with leverage one:
##   693

##Show table for predictions for the fitted values
predict(MyModel, test_df, type="response")
##     3896     4865     1884     2237     4899     1375      447     1479 
## 417343.9 442811.0 371682.6 425635.2 333286.1 459286.0 418838.3 441128.8 
##     4855       86     3292        1      612     3166      453     4139 
## 493669.9 492599.9 510228.9 438013.9 475957.3 386599.3 427182.5 498425.7 
##     2473      245     2030     3376     3958     2253     3197     4154 
## 401576.9 610315.6 295992.6 479682.7 438245.9 426224.3 459421.7 412011.7 
##     3504     4177     2696      218     3842      274     3741     1260 
## 618256.9 425803.8 618274.9 483186.2 426443.6 414202.1 478126.8 417504.7 
##     2684      267     3007     1430     1359     2401     3676     2543 
## 481348.4 436381.4 434775.8 464816.7 458996.3 514751.1 446627.9 532651.0 
##     2234     3348     2857     3941     4883     2345     1712     1284 
## 372172.7 432269.8 396165.0 578994.7 472232.3 468877.9 368700.8 469995.4 
##     2096     2027     2323     1109      165      581     1793     2904 
## 481039.3 434902.6 626122.0 403174.3 460693.3 522345.6 273689.2 439788.8 
##     1665     2595     1043     1897     2403     1366     4395     1197 
## 392053.1 461184.7 444713.9 548649.6 517775.9 401200.4 266343.4 491785.4 
##     2784     1454     2393     1729     3843     1107      305     3558 
## 490841.1 427992.7 700884.6 438880.7 426507.1 496341.8 389228.0 293864.2 
##     3123     3623     4166       76      847     2223     1195     3965 
## 565494.5 389284.9 444204.5 478490.3 450400.1 448501.0 516249.3 425374.9 
##     4694     4712     1911     2332     4271     4028     2367      695 
## 586195.9 543805.4 369078.7 631636.6       NA 427881.3 533669.7 485775.2 
##     4345     3990     2516     4793      645     4417     3492      674 
## 499440.4 411083.4 641435.2 607407.0 566056.7 450591.0 365487.2 535784.3 
##     4523     2896     3703       40     2739     3172     4535     1889 
## 382227.2 430068.7 470894.5 529792.1 680080.7 439329.7 336896.1 470299.9 
##      934     4349     4612     2424      598     2926     3484      704 
## 606034.7 571957.8 429926.1 421508.1 527784.7 633812.0 542693.7 617961.2 
##     3013     4927     1247     3840     1837     1795     3290     4750 
## 469334.6 392588.2 381459.4 497524.0 462272.7 329418.0 529934.4 306791.6 
##     1213     2716     4813     4741     4390     2321     2944     1796 
## 555777.6 462367.3 527038.0 447758.0 331879.3 547917.6 397117.4 345143.7 
##     1693     3047     2230     4701      503     4060     1422     4444 
## 595141.4 455423.0 365502.8 476460.8 366144.5 424778.1 398672.5 348824.8 
##     2180     4247      185     3448     4934     1307     4249     3995 
## 413897.5 444387.3 488553.3 390999.1 338922.9 643430.4 446070.7 468453.9 
##     3442     3187     1507     4155     1763      978     2371     4912 
## 498900.9 521730.8 405985.1 379545.6 523170.6 577938.0 359649.4 542905.0 
##     1676     1215     1797     1748     2845     1682     4683      535 
## 460169.7 496026.4 272640.3 362266.0 593952.0 398796.6 497305.1 478699.7 
##     3009     1919     3080     3136     4167     1394       97     2646 
## 375822.9 397589.9 465419.0 542853.8 427062.3 348674.7 512559.0 358210.9 
##      123     3875     4183     2737     3637     5026     2079     1844 
## 572475.3 569034.0 422619.7 544718.9 353487.6 405499.4 447925.9 428853.1 
##     3870     1546     4266     4507     1912     3158     4199     4790 
## 631301.4 611545.7       NA 625497.1 372425.4 355194.4 393319.1 600030.2 
##      676     2448     3237     1751     1908      539     4573     4707 
## 466418.0 487895.2 492932.8 531785.8 336152.3 390215.4 501262.4 525427.2 
##     1570     1832     4485     2336     4264     2966     3518     2745 
## 370314.2 397780.8 457906.0 501455.0 402144.0 410924.4 445271.6 537537.4

Looking at the summary of this regression model. The significant factors are, tiled roof, time to CBD, brickwall, 3,4 or 5 beds, house area, 1 carport, age and brickwall or no brickwall. Testing for multicollinearity some of these factors, time to CBD, living area and beds. When looking at the variance inflation factor compared to the degrees of freedom we can see that living area and beds have a high VIF because of the degrees of freedom. Another model will be tested used with the significant factors, one distance function will be removed as this would be the independent variables that would be highly multicollinear. The overall model explains 50% of the variance explained in sale price and is significant overall with a p-value: < 2.2e-16.

MyModel2 <- lm(SALE ~ AREA_HSE + BEDS+ BATHS_TOT + AGE  + D_CBD_Drive_Km + LAND_AREA + Office + Manufacturingindustry + Recreation + natural_elements + Education + walk_score + Residential, data = train_df)

#save predictions
Predictor1  <- MyModel2 %>% predict(test_df)
data.frame(
          squares_2 = RMSE(Predictor, test_df$SALE),
          Rsquared_2= R2(Predictor, test_df$SALE))
##analyse data
summary(MyModel2)
## 
## Call:
## lm(formula = SALE ~ AREA_HSE + BEDS + BATHS_TOT + AGE + D_CBD_Drive_Km + 
##     LAND_AREA + Office + Manufacturingindustry + Recreation + 
##     natural_elements + Education + walk_score + Residential, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -558499  -47385   -2908   42582  300423 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            88994.88   59739.92   1.490  0.13662    
## AREA_HSE                1754.82      87.75  19.998  < 2e-16 ***
## BEDS1                  31190.39   33162.29   0.941  0.34717    
## BEDS2                 -25639.39   23360.55  -1.098  0.27267    
## BEDS3                 -39427.15   21753.90  -1.812  0.07023 .  
## BEDS4                 -71648.29   22797.14  -3.143  0.00172 ** 
## BEDS5                 -53230.49   31442.80  -1.693  0.09079 .  
## BEDS6                  35558.29   63661.99   0.559  0.57660    
## BATHS_TOT1             71805.79   60443.75   1.188  0.23513    
## BATHS_TOT2            109973.38   61215.29   1.797  0.07272 .  
## BATHS_TOT3            112492.83   68621.29   1.639  0.10147    
## BATHS_TOT4             87684.82   85139.30   1.030  0.30331    
## AGE                      610.84     284.11   2.150  0.03180 *  
## D_CBD_Drive_Km         -2493.59     206.24 -12.090  < 2e-16 ***
## LAND_AREA                 20.99      21.68   0.968  0.33305    
## Office                 43566.55   13639.04   3.194  0.00145 ** 
## Manufacturingindustry  54528.84   18215.62   2.994  0.00283 ** 
## Recreation              6929.42    1417.66   4.888 1.19e-06 ***
## natural_elements         887.21     487.41   1.820  0.06903 .  
## Education             -20453.29    4549.76  -4.495 7.77e-06 ***
## walk_score              1807.88     253.59   7.129 1.96e-12 ***
## Residential            35485.58   17689.15   2.006  0.04512 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80600 on 978 degrees of freedom
## Multiple R-squared:  0.5388, Adjusted R-squared:  0.5289 
## F-statistic: 54.41 on 21 and 978 DF,  p-value: < 2.2e-16
anova(MyModel2)
##Check for Multicollinearity
car::vif(MyModel2)
##                           GVIF Df GVIF^(1/(2*Df))
## AREA_HSE              3.037557  1        1.742859
## BEDS                  4.851196  6        1.140654
## BATHS_TOT             3.212048  4        1.157038
## AGE                   2.163649  1        1.470935
## D_CBD_Drive_Km        2.058866  1        1.434875
## LAND_AREA             3.091528  1        1.758274
## Office                1.369528  1        1.170268
## Manufacturingindustry 1.269335  1        1.126648
## Recreation            1.504809  1        1.226707
## natural_elements      1.301045  1        1.140634
## Education             1.811298  1        1.345845
## walk_score            2.660051  1        1.630966
## Residential           2.168411  1        1.472553
##Show plot for residuals
plot(MyModel2)

The second model has successfully explained a 2% increase in the variation of price against the other factors. In this model, CBD drive was removed. When the code is executed for VIF, there is no sign of high levels of multicollinearity. Although the intercept is quite low for model 2, this understandable as living areas have been removed from the model. It should be noted that ideally the predicted data from model 2 would be graphed over the test data, to analyse the accuracy of the predicted data.

Part F: Summary and Conclusion

F1: Write a conclusion to summarise the entirety of your work for this assignment.

William Alonso’s bid rent theory explains that businesses and individuals will compete for land close to the CBD and that the purchase comes with two items: the land and location. The assignment has demonstrated that while this might be true, there are many other factors that affect the price of a property. Distance to CBD has been a factor with high significance throughout tests but low correlation to sale directly. This is because there are many other factors affecting price.

As distance from CBD increases, the land available may increase and individuals and businesses alike may use this extra space to add value. William Alonso’s model also assumes that the CBD is the most ideal place to buy property. This may have been true in the early 1900s but as housing density and population have increased people have sought to buy further away from the city. Increased demand to live near natural elements or beaches would also cloud the correlation. A measure that may be effective is price per square metre. It could be taken a step further and a metric could be created, price per square metre per suburb.

Overall the theory has basic logic to it and generally speaking inner-city suburbs are generally more expensive but land area is not the only sacrifice. Further analysis could be completed on new, more relevant data, this would be an interesting comparison.

To create the web page use knit HTML. Zip the html and associated file of images etc. and upload to LMS.

tinytex::install_tinytex()

End of Assignment One