In 1964, economist William Alonso wrote a book on the bid-rent function of housing value. The central proposition went something like this: households optimise the size of their land ownership against the cost of commuting to work (i.e. to pay for their land ownership). Some households demand more land and therefore are prepared to have higher commuting costs and other households do not require as much land and will live closer to the city centre. This, of course, is a gross oversimplification of how house prices are determined, as the quality of the house (e.g. materials, size, aesthetics and features) and the quality of the neighbourhood (e.g. access to commercial activities, open space, types of housing offered) have a bearing on the implicit value of the lot.
In this assignment, we will not try to solve all the issues related to estimating the price of a house, but we will see if we can determine the values of housing features and whether bigger lots are further from the city, as well as if there is a diminishing value (after controlling for lot size and housing features) of property the further we move away from the city. Our case study uses property sales data from Perth in 2013, which has been supplied as HousingDataForAssignment.csv.
Complete the steps below to complete the assignment.
Firstly, set the appropriate working directory for this project (i.e. set the working directory to the project location) in RStudio, before proceeding to the next steps.
A1: Open the HousingDataForAssignment.csv data file using R and read it into a data frame.
#Set the working directory
setwd("~/Desktop/Assignment1_LMS 5/Assignment1_LMS")
getwd()
## [1] "/Users/jamesmacbook/Desktop/Assignment1_LMS 5/Assignment1_LMS"
##Read into data frame
Houses_df <- read.csv("/Users/jamesmacbook/Desktop/Assignment1_LMS 5/Assignment1_LMS/Data/HousingDataForAssignment.csv")
A2: Select 1200 observations from the data frame using a random number generator, with your student number as the seed. Place these 1200 observations into a new data frame named MyHousingData. You will work with this data only and you can remove (rm()) the data frame you created in the previous step.
#Set seed as the number generator
set.seed(23119479)
#Create index of 1200
Data <- sample(1:nrow(Houses_df),1200)
#Place into data frame with variables
MyHousingData <- Houses_df[Data, ]
##View the data frame to 'eye off' the data and get a feel for what it looks like
View(MyHousingData)
##Remove data frame
rm(Houses_df)
A3: Create factors for any variables that require them.
#Create factors for variables and make SALE as numeric.
MyHousingData$Date <- as.factor(MyHousingData$Date)
MyHousingData$SUBURB <- as.factor(MyHousingData$SUBURB)
MyHousingData$Date <- as.factor(MyHousingData$Date)
MyHousingData$BRICKWALL <- as.factor(MyHousingData$BRICKWALL)
MyHousingData$DETACHED <- as.factor(MyHousingData$DETACHED)
MyHousingData$PERIOD_INDEX <- as.factor(MyHousingData$PERIOD_INDEX)
MyHousingData$BEDS <- as.factor(MyHousingData$BEDS)
MyHousingData$BATHS_TOT <- as.factor(MyHousingData$BATHS_TOT)
MyHousingData$LIVING_TOT <- as.factor(MyHousingData$LIVING_TOT)
MyHousingData$CARPORTT <- as.factor(MyHousingData$CARPORTT)
MyHousingData$POOL <- as.factor(MyHousingData$POOL)
MyHousingData$GARAGET <- as.factor(MyHousingData$GARAGET)
MyHousingData$TILEDROOF <- as.factor(MyHousingData$TILEDROOF)
MyHousingData$VILLA_UNIT <- as.factor(MyHousingData$VILLA_UNIT)
MyHousingData$SALE <- as.numeric(MyHousingData$SALE)
##Check structure of dataframe has changed
str(MyHousingData)
## 'data.frame': 1200 obs. of 35 variables:
## $ ID_SEQ : int 109349 124793 94193 125228 118283 98325 85549 96087 120589 115189 ...
## $ SUBURB : Factor w/ 211 levels "ALEXANDER HEIGHTS",..: 76 152 133 171 77 78 184 152 185 147 ...
## $ SALE : num 305000 390000 369000 492000 605000 350000 750000 370000 510000 480000 ...
## $ Date : Factor w/ 177 levels "1/03/2013","1/04/2013",..: 86 35 32 46 152 62 66 86 58 45 ...
## $ PERIOD_INDEX : Factor w/ 2 levels "3","5": 1 2 1 2 2 1 1 1 2 2 ...
## $ DETACHED : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 1 ...
## $ VILLA_UNIT : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 1 2 ...
## $ BRICKWALL : Factor w/ 3 levels "#NULL!","0","1": 3 3 3 3 3 3 3 3 3 3 ...
## $ TILEDROOF : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 1 2 2 2 ...
## $ LAND_AREA : int 101 247 544 492 923 469 308 218 603 128 ...
## $ AREA_HSE : int 79 98 174 193 230 163 121 101 196 128 ...
## $ AGE : int 43 11 5 15 11 9 20 6 11 29 ...
## $ BEDS : Factor w/ 7 levels "0","1","2","3",..: 3 4 4 5 4 5 3 4 5 4 ...
## $ BATHS_TOT : Factor w/ 5 levels "0","1","2","3",..: 2 3 3 3 3 3 2 3 3 2 ...
## $ LIVING_TOT : Factor w/ 8 levels "1","2","3","4",..: 3 4 4 6 5 5 4 4 4 4 ...
## $ CARPORTT : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 3 1 1 2 ...
## $ GARAGET : Factor w/ 5 levels "0","1","2","3",..: 1 3 3 3 3 3 1 3 3 1 ...
## $ POOL : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ Lat_centroid : num -31.9 -31.9 -32.5 -31.9 -32.4 ...
## $ Long_centroid : num 116 116 116 116 116 ...
## $ D_CBD_Drive_Km : num 6.95 13.16 67.07 14.11 60.88 ...
## $ D_CBD_PT_Km : num 8.1 10.7 77.9 12.2 62 ...
## $ Time_CBD_Drive_Min : num 8.03 16.8 51.9 16.08 45.92 ...
## $ Time_CBD_PT_Min : num 33.1 47.9 95.8 16.1 77.2 ...
## $ Commercial : num 1.38 0.27 0.43 1.11 0.23 0.58 3.93 0.27 0.93 0.65 ...
## $ Education : num 1.38 1.34 1.07 1.84 0.23 0.65 0 1.34 1.56 1.31 ...
## $ Health : num 1.38 1.07 0.64 1.11 0.23 0.52 0 1.07 0.93 0.98 ...
## $ Manufacturingindustry: num 0 0 0 0 0 0.06 0 0 0 0 ...
## $ Office : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Recreation : num 4.15 5.36 5.56 3.69 3.17 2.78 8.52 5.36 6.84 6.86 ...
## $ Residential : num 0.49 0.77 0.46 0.61 0.32 0.49 0.51 0.77 0.67 0.77 ...
## $ Retailoutlets : num 1.38 0.27 0.43 0 0.23 0.32 1.31 0.27 0.62 0.33 ...
## $ natural_elements : num 0 0 0 0.02 1.32 1.33 0 0 0 0.08 ...
## $ walk_score : num 54.5 49.8 28.1 34 10 ...
## $ bikeroute_total : num 4.87 3.71 0.06 4.56 1.26 2.78 6.57 3.71 5.7 3.67 ...
A4: Split the observations in your MyHousingData data frame into a training set of 1000 observations and a test set of 200 observations.
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(pacman, tidyverse, EnvStats,psych, caret)
library(caret)
##Create Index for training then create data frames for testing and training set
training <- sample(1:nrow(MyHousingData),1000)
train_df <- MyHousingData[training, ]
test_df <- MyHousingData[-training, ]
##training_df will be used from here
From here on in, you will need to write a paragraph to describe each output.
B1: Your data set has nominal (factors) and ratio (numerical) variables. Create the appropriate numerical summary for each of them.
##Creating a different subset for character factors and numeric for data wrangling if needed
factor_df=subset(train_df, select=c("SUBURB","Date", "BRICKWALL","DETACHED", "PERIOD_INDEX", "BATHS_TOT", "LIVING_TOT", "CARPORTT", "POOL", "GARAGET", "TILEDROOF", "VILLA_UNIT", "BEDS"))
##Numeric Data frame for only numerice factors, this can be used for data wrangling
numeric_df=subset(train_df, select=c("SALE", "LAND_AREA", "AREA_HSE", "AGE", "Lat_centroid", "Long_centroid", "D_CBD_Drive_Km", "D_CBD_PT_Km", "Time_CBD_Drive_Min", "Time_CBD_PT_Min", "Commercial", "Education", "Health", "Manufacturingindustry", "Office", "Recreation", "Residential", "Retailoutlets", "natural_elements", "walk_score", "bikeroute_total"))
##Summarise data, we could use below function or simply summary(MyHousingData)
summary(factor_df)
## SUBURB Date BRICKWALL DETACHED PERIOD_INDEX BATHS_TOT
## ELLENBROOK: 53 13/05/2013: 16 #NULL!: 2 0:481 3:544 0: 2
## NOLLAMARA : 30 20/05/2013: 15 0 : 23 1:519 5:456 1:290
## BALDIVIS : 25 22/05/2013: 15 1 :975 2:698
## MORLEY : 25 30/04/2013: 15 3: 8
## MAYLANDS : 23 27/05/2013: 13 4: 2
## BUTLER : 22 29/04/2013: 13
## (Other) :822 (Other) :913
## LIVING_TOT CARPORTT POOL GARAGET TILEDROOF VILLA_UNIT BEDS
## 4 :354 0:601 0:935 0:452 0:161 0:640 0: 19
## 3 :222 1:223 1: 65 1: 88 1:839 1:360 1: 10
## 5 :210 2:176 2:448 2:119
## 6 : 97 3: 11 3:459
## 2 : 80 4: 1 4:376
## 7 : 28 5: 15
## (Other): 9 6: 2
summary(numeric_df)
## SALE LAND_AREA AREA_HSE AGE
## Min. :110000 Min. : 100.0 Min. : 71.0 Min. : 0.00
## 1st Qu.:374250 1st Qu.: 219.0 1st Qu.:106.0 1st Qu.: 6.00
## Median :445000 Median : 408.0 Median :137.0 Median : 11.00
## Mean :455463 Mean : 406.4 Mean :146.6 Mean : 15.78
## 3rd Qu.:526500 3rd Qu.: 569.2 3rd Qu.:184.0 3rd Qu.: 21.00
## Max. :800000 Max. :1031.0 Max. :358.0 Max. :102.00
##
## Lat_centroid Long_centroid D_CBD_Drive_Km D_CBD_PT_Km
## Min. :-32.63 Min. :115.6 Min. : 0.00 Min. : 2.97
## 1st Qu.:-32.08 1st Qu.:115.8 1st Qu.:14.07 1st Qu.:13.03
## Median :-31.94 Median :115.8 Median :23.72 Median :23.28
## Mean :-31.98 Mean :115.9 Mean :27.60 Mean :27.75
## 3rd Qu.:-31.84 3rd Qu.:115.9 3rd Qu.:36.66 3rd Qu.:32.19
## Max. :-31.54 Max. :116.2 Max. :85.24 Max. :89.38
## NA's :5
## Time_CBD_Drive_Min Time_CBD_PT_Min Commercial Education
## Min. : 6.28 Min. : 6.28 Min. : 0.0000 Min. :0.000
## 1st Qu.:16.80 1st Qu.: 38.38 1st Qu.: 0.2400 1st Qu.:0.560
## Median :24.72 Median : 53.92 Median : 0.4700 Median :1.025
## Mean :26.68 Mean : 53.06 Mean : 0.8112 Mean :1.112
## 3rd Qu.:33.87 3rd Qu.: 70.03 3rd Qu.: 0.9100 3rd Qu.:1.420
## Max. :67.25 Max. :123.33 Max. :46.5700 Max. :5.160
## NA's :5 NA's :5
## Health Manufacturingindustry Office Recreation
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. : 0.090
## 1st Qu.:0.4700 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 3.090
## Median :0.6300 Median :0.00000 Median :0.00000 Median : 4.420
## Mean :0.7514 Mean :0.08174 Mean :0.03403 Mean : 4.609
## 3rd Qu.:0.9800 3rd Qu.:0.10000 3rd Qu.:0.00000 3rd Qu.: 5.750
## Max. :5.1600 Max. :0.91000 Max. :1.99000 Max. :15.620
##
## Residential Retailoutlets natural_elements walk_score
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 3.69
## 1st Qu.:0.3700 1st Qu.:0.1000 1st Qu.: 0.000 1st Qu.:28.84
## Median :0.5800 Median :0.2700 Median : 0.160 Median :41.79
## Mean :0.5357 Mean :0.3521 Mean : 1.609 Mean :40.97
## 3rd Qu.:0.7100 3rd Qu.:0.4600 3rd Qu.: 1.070 3rd Qu.:51.71
## Max. :0.9400 Max. :4.4300 Max. :110.560 Max. :96.50
##
## bikeroute_total
## Min. :0.000
## 1st Qu.:2.160
## Median :3.710
## Mean :3.722
## 3rd Qu.:5.020
## Max. :8.650
##
From the training data frame there are 1000 observations 35 variables or columns. The data represents houses that have been sold and describes details around the sale, including a unique identifier, sale price of property, the date it was sold, suburb it was sold in, distance from city and other key destinations. Some insights derived from the data are as follows, suburb with the most houses sold is Ellenbrook with 53 houses sold, followed by Nollamara 30, and lastly, Baldivis 25. The mean sale price of the sample was $455463 with the maximum sale price of three properties $800000, two in Illuka and one in Wembley. The minimum house price is $110000, a villa in Applecross and one in Hammersely, another four properties in Hammersely were sold for $120000-$128000. The majority of properties have two bathrooms (n=698). In regard to bedrooms, most houses have 3 (n=459) and 10 houses have one.376 houses have 4 bedrooms. Only 65 properties have a pool, 935 do not have a pool. 27.60km is the mean distance from Perth CBD, the average time in drive is 26 minutes. There are many other variables that can be analysed and compared. You could represent this data as percentages rather than total counts. Further analysis using hypothesis tests and regression models will be used to test relationships and correlation.
B2: Create scatter plots for (1) lot-size, (2) distance from the CBD and (3) price.
##Creating plot for House price versus land area. created a new value by dividing house price by 1000 to stop the plot going into scientific notation. Edited scales and added title for aesthetic.
if (!require("ggplot2")) install.packages("ggplot2")
library(dplyr)
Property_Price <- train_df$SALE/1000
ggplot(train_df, aes(x=LAND_AREA, y=Property_Price)) +
geom_point(size=1.5, shape=16) +
scale_y_continuous(labels = scales::dollar, breaks = seq(0, 1000, by = 150))+
scale_x_continuous(breaks = seq(0, 1200, by = 200)) +
labs(x="Land Area (sqm)",
y = "Sale Price ($Hundreds of Thousands)") +
ggtitle("Sale Price compared to Land Area")
##Created plot to for drive to CBD compared to Land Area, add titles and labels.
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=LAND_AREA)) +
geom_point(size=1.5, shape=16) +
labs(x="Distance In Kilometres",
y="Land Area (sqm)") +
ggtitle("Drive to CBD compared to Land Area")
##Create plot comparing distance and property price, add titles, labels and breaks.
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=Property_Price)) +
geom_point(size=1.5, shape=18) +
scale_y_continuous(labels = scales::dollar, breaks = seq(0, 1000, by = 150)) +
labs(x="Distance from City (Kilometres)",
y="Property Price (Hundreds of Thousands)") +
ggtitle("Price compared to Drive")
We can see from scatter plot 1 that the correlation when tested could be low. As many data points have different Y values are the same as the X Values. Visually, we can see most of the data falls between the $300 - 600k house price mark as well as falling between 100sqm - 600sqm area range. Outside those ranges we can see that there are fewer data points and some outliers. When considering house prices this is logical as there a fewer suburbs with higher price tags and expensive houses. A line could also be used by coding geom_abline().
B3: Estimate the bi-variate correlation coefficients for these variables in the previous question. Are they significantly different from zero? Address this in your paragraph of explanation.
cor(train_df$LAND_AREA, train_df$SALE)
## [1] 0.1766924
cor(train_df$D_CBD_Drive_Km, train_df$SALE)
## [1] -0.2272232
cor(train_df$AREA_HSE,train_df$D_CBD_Drive_Km)
## [1] 0.4211321
The answer to the first correlation is R = 0.1767 (rounded to 4 decimals). This indicates that the relationship between price and land area is extremely weak. The second correlation R= -0.2272, this shows there is no relationship between kilometres from the CBD and the property price. The third correlation R=0.4211, this is still a weaker correlation, but shows a greater relationship than the other tests.
C1: Compute conditional mean prices for houses, where the conditions are (1) number of bedrooms, (2) number of bathrooms and (3) the presence of a swimming pool.
##Install packages if they are needed.
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
##Create new data frame for conditional means. We are separating the data and grouping it by number of beds and asking R to output the mean price for each level within the factor BEDS. Print the data frame to look at results. Can also use head() or view().
C1_BEDS_temp2 <- group_by(train_df, BEDS)
C1_BED_temp <- summarise(C1_BEDS_temp2,
bed_mean_temp = mean(SALE),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_BED_temp)
## BEDS bed_mean_temp n
## 1 0 462157.9 19
## 2 1 445450.0 10
## 3 2 424066.4 119
## 4 3 443654.7 459
## 5 4 475659.6 376
## 6 5 528560.7 15
## 7 6 675000.0 2
C1_BATHS_temp2 <- group_by(train_df, BATHS_TOT)
C1_BATH_temp <- summarise(C1_BATHS_temp2,
BATH_mean_temp = mean(SALE),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_BATH_temp)
## BATHS_TOT BATH_mean_temp n
## 1 0 362500.0 2
## 2 1 412058.3 290
## 3 2 471108.9 698
## 4 3 646125.0 8
## 5 4 619205.0 2
C1_POOL_temp2 <- group_by(train_df, POOL)
C1_POOL_temp <- summarise(C1_POOL_temp2,
POOL_mean_temp = mean(SALE),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
print(C1_POOL_temp)
## POOL POOL_mean_temp n
## 1 0 451073.7 935
## 2 1 518606.3 65
Paragraph: C2 The analysis for the conditional means for bed count displays an interesting result to interpret. 19 houses with no beds have a mean price of $462157.9. Potentially this is a data integrity issues and this data should have been removed initially. This data could however, may represent properties that are purely land or projects that have not been finished yet. Properties with 1 bedroom have a mean price of $445450 which more than the mean of properties with two bedrooms which is $445450. There may be other factors affecting this or it may simply be that there are only 119 properties with one bedroom, therefore outliers will affect the mean more adversely. From 2 bedrooms onwards, the average price increases for every bedroom extra. This is naturally what would be expected.
For bathrooms, there are only 2 properties that have no bathrooms. Again this could be land, a development or data integrity. On average properties with 2 bathrooms are worth $447110, add a third bathroom and the mean price increases considerably to $646125. This is a difference of $199015. Although a large difference it should be noted that the extra bathroom is not causing this price increase, but it could be related. Further analysis would need to be undertaken to understand this.
Assuming that 0 represents no pool and that 1 represent a pool. We can determine based on data provided that properties that have pools, on average, are worth $67,533 more.
C2: Use the appropriate test (ANOVA or t-test) to make a preliminary assessment on whether price is dependent on (1) the number of bedrooms; (2) the number of bathrooms; (3) the presence of a swimming pool.
# Compute the analysis of variance
Anova_Combined <- lm(SALE ~ BEDS + BATHS_TOT + POOL, data=train_df)
anova(Anova_Combined)
confint(Anova_Combined, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 206885.635 518114.365
## BEDS1 -133523.089 42217.975
## BEDS2 -125992.488 -11263.420
## BEDS3 -123660.230 -12616.942
## BEDS4 -118589.812 -3720.886
## BEDS5 -100750.239 60401.981
## BEDS6 -120498.563 220833.703
## BATHS_TOT1 -53131.330 275896.036
## BATHS_TOT2 3288.131 334277.935
## BATHS_TOT3 126284.243 494619.615
## BATHS_TOT4 24923.891 485939.028
## POOL1 15687.670 75172.126
ANOVA must be used as t.test is for two variables only. The null hypothesis for the test is that all means are the same and the alternative hypothesis is that at least one of the means are different. All the means are significantly different and we can reject the null hypothesis at the 5% level.
D1: Run simple linear regressions for price as a function of (1) distance from the city and (2) lot-size.
##This is the equation with results for price/kilometres
Model_Kilometres <- lm(Property_Price ~ D_CBD_Drive_Km, data=train_df)
##Use summary instead of ANOVA, only one dependant variable
summary(Model_Kilometres)
##
## Call:
## lm(formula = Property_Price ~ D_CBD_Drive_Km, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -373.84 -79.15 -4.94 67.45 351.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 496.973 6.694 74.246 < 2e-16 ***
## D_CBD_Drive_Km -1.504 0.204 -7.371 3.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.4 on 998 degrees of freedom
## Multiple R-squared: 0.05163, Adjusted R-squared: 0.05068
## F-statistic: 54.33 on 1 and 998 DF, p-value: 3.548e-13
##Plotting the model
ggplot(train_df, aes(x=D_CBD_Drive_Km, y=Property_Price)) +
geom_point()+
geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
##Equation for Sale versus Land Area
Model_LotSize <- lm(SALE ~ LAND_AREA, data=train_df)
summary(Model_LotSize)
##
## Call:
## lm(formula = SALE ~ LAND_AREA, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -346227 -80097 -15662 68083 351498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 414695.65 8065.32 51.417 < 2e-16 ***
## LAND_AREA 100.32 17.69 5.671 1.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115600 on 998 degrees of freedom
## Multiple R-squared: 0.03122, Adjusted R-squared: 0.03025
## F-statistic: 32.16 on 1 and 998 DF, p-value: 1.857e-08
#The model
ggplot(train_df, aes(x=LAND_AREA, y=Property_Price)) +
geom_point()+
geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
Comparing both models, we can see both land area and driving distance from the CBD are significant at the 0.0001 level. For driving distance the error is slighlty larger than the error for the size of the land. The variance in price is only explained by 3% by Land area and 5% by distance by driving to the CBD. This is a low amount of variance explained for price.
If the linear regression model is used we can determine how much land area contributes to price. The intercept for land price is 100.3 from this we can deduce that for every 1 square metre in land size that increases, it adds an extra $100.3 in price.
Due to the low correlation the model is not effective for predicting house price, to get a better model we will need to add more factors that are significant. This is a logical answer as we know that there are more factors that affect a house price rather than simply distance.
D2: Run a multiple linear regression of price as a function of distance from the city, lot size and the appropriate housing attributes.
Model_MLR<- lm(SALE ~ LAND_AREA + D_CBD_Drive_Km + BEDS + BATHS_TOT + POOL + GARAGET + AGE + AREA_HSE + Time_CBD_Drive_Min, data=train_df)
summary(Model_MLR)
##
## Call:
## lm(formula = SALE ~ LAND_AREA + D_CBD_Drive_Km + BEDS + BATHS_TOT +
## POOL + GARAGET + AGE + AREA_HSE + Time_CBD_Drive_Min, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -556195 -47357 -9223 40716 332135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 271013.964 62103.614 4.364 1.41e-05 ***
## LAND_AREA 1.685 23.419 0.072 0.942658
## D_CBD_Drive_Km 1014.600 607.801 1.669 0.095380 .
## BEDS1 52103.697 34780.070 1.498 0.134434
## BEDS2 -12179.130 24644.662 -0.494 0.621284
## BEDS3 -43484.970 22984.329 -1.892 0.058795 .
## BEDS4 -82948.663 24232.664 -3.423 0.000645 ***
## BEDS5 -69028.479 33688.176 -2.049 0.040726 *
## BEDS6 33425.851 67131.873 0.498 0.618658
## BATHS_TOT1 83279.901 64126.015 1.299 0.194357
## BATHS_TOT2 126466.606 64970.822 1.947 0.051880 .
## BATHS_TOT3 130062.500 72649.331 1.790 0.073720 .
## BATHS_TOT4 80543.680 90504.988 0.890 0.373720
## POOL1 7661.621 12087.128 0.634 0.526317
## GARAGET1 14403.651 10419.361 1.382 0.167168
## GARAGET2 4176.116 7753.079 0.539 0.590259
## GARAGET3 42052.613 27115.956 1.551 0.121264
## GARAGET4 97270.846 89560.132 1.086 0.277706
## AGE 579.133 306.894 1.887 0.059447 .
## AREA_HSE 1684.322 103.312 16.303 < 2e-16 ***
## Time_CBD_Drive_Min -6207.103 862.639 -7.195 1.24e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84960 on 974 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.4874, Adjusted R-squared: 0.4768
## F-statistic: 46.3 on 20 and 974 DF, p-value: < 2.2e-16
anova(Model_MLR)
confint(Model_MLR, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 149141.67380 392886.25428
## LAND_AREA -44.27179 47.64162
## D_CBD_Drive_Km -178.15070 2207.35068
## BEDS1 -16148.79996 120356.19487
## BEDS2 -60541.87708 36183.61766
## BEDS3 -88589.47553 1619.53472
## BEDS4 -130502.90480 -35394.42071
## BEDS5 -135138.24315 -2918.71568
## BEDS6 -98313.90788 165165.61023
## BATHS_TOT1 -42561.15627 209120.95732
## BATHS_TOT2 -1032.30125 253965.51391
## BATHS_TOT3 -12504.73236 272629.73235
## BATHS_TOT4 -97063.54062 258150.90022
## POOL1 -16058.19047 31381.43327
## GARAGET1 -6043.33107 34850.63209
## GARAGET2 -11038.54619 19390.77736
## GARAGET3 -11159.80799 95265.03417
## GARAGET4 -78482.18697 273023.87808
## AGE -23.11646 1181.38171
## AREA_HSE 1481.58154 1887.06249
## Time_CBD_Drive_Min -7899.94773 -4514.25893
The data output for the multiple regression has been determined on the factors that will most likely affect price. We can see from the output, that in fact not many of the factors are significant. With our 20 degrees of freedom the model explains only 47.3% of the variance in price, using the multiple R squared.
Houses with 4 and 5 bedrooms have a significant affect on price. House area and drive time to the CBD also are significant at the 5% confidence level. To improve the model the factors that are below 5% confidence level should be removed.
D3: Note you have detached housing and villas/units delineated in your data. Run a regression for each of the two segments and comment on the differences in parameters.
##Run model based on multiple attributes then run second model
Villas_Regression <- train_df[train_df$VILLA_UNIT==1,]
Villas_Regression_Model <- lm(SALE ~ AREA_HSE + LAND_AREA + D_CBD_Drive_Km + Time_CBD_Drive_Min + BATHS_TOT + TILEDROOF + LIVING_TOT + CARPORTT + AGE + Education + Health + Retailoutlets + BEDS, data = Villas_Regression)
summary(Villas_Regression_Model)
##
## Call:
## lm(formula = SALE ~ AREA_HSE + LAND_AREA + D_CBD_Drive_Km + Time_CBD_Drive_Min +
## BATHS_TOT + TILEDROOF + LIVING_TOT + CARPORTT + AGE + Education +
## Health + Retailoutlets + BEDS, data = Villas_Regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531757 -49714 -3323 53622 262288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 300155.33 92648.94 3.240 0.00132 **
## AREA_HSE 1847.64 239.53 7.714 1.42e-13 ***
## LAND_AREA 125.66 46.94 2.677 0.00779 **
## D_CBD_Drive_Km 2145.79 1402.01 1.531 0.12684
## Time_CBD_Drive_Min -8421.86 1857.91 -4.533 8.11e-06 ***
## BATHS_TOT1 60635.59 78230.05 0.775 0.43883
## BATHS_TOT2 86921.70 80605.02 1.078 0.28165
## BATHS_TOT3 57925.70 126481.86 0.458 0.64727
## TILEDROOF1 -29599.66 13957.32 -2.121 0.03468 *
## LIVING_TOT2 -1266.85 50660.51 -0.025 0.98006
## LIVING_TOT3 26156.58 52640.74 0.497 0.61959
## LIVING_TOT4 11400.18 53649.46 0.212 0.83185
## LIVING_TOT5 -28009.91 56969.36 -0.492 0.62328
## LIVING_TOT6 -68205.45 72398.21 -0.942 0.34683
## CARPORTT1 -9541.72 13859.81 -0.688 0.49165
## CARPORTT2 44612.01 20467.67 2.180 0.02998 *
## AGE -409.88 567.43 -0.722 0.47058
## Education 10559.26 17322.70 0.610 0.54257
## Health -15973.34 21582.17 -0.740 0.45975
## Retailoutlets 4841.16 12065.81 0.401 0.68851
## BEDS1 59415.52 54309.31 1.094 0.27473
## BEDS2 6966.81 46274.13 0.151 0.88042
## BEDS3 -44088.13 43995.49 -1.002 0.31702
## BEDS4 -86646.95 49262.20 -1.759 0.07951 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96890 on 334 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4212, Adjusted R-squared: 0.3814
## F-statistic: 10.57 on 23 and 334 DF, p-value: < 2.2e-16
Detached_Regression <- train_df[train_df$DETACHED==1, ]
Detached_Regression_Model <- lm(SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + Education + Health + Retailoutlets + Time_CBD_Drive_Min, data = Detached_Regression)
summary(Detached_Regression_Model)
##
## Call:
## lm(formula = SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT +
## BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km +
## Education + Health + Retailoutlets + Time_CBD_Drive_Min,
## data = Detached_Regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -219543 -38371 -9222 27331 285252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 338209.95 97891.72 3.455 0.000598 ***
## BEDS2 241242.00 81952.72 2.944 0.003398 **
## BEDS3 176970.81 78444.70 2.256 0.024513 *
## BEDS4 159619.13 78510.91 2.033 0.042585 *
## BEDS5 173942.10 80787.32 2.153 0.031801 *
## BEDS6 271083.38 93577.29 2.897 0.003938 **
## AREA_HSE 1661.05 107.90 15.394 < 2e-16 ***
## CARPORTT1 -25265.11 14137.36 -1.787 0.074539 .
## CARPORTT2 -27469.81 7790.63 -3.526 0.000462 ***
## AGE 2162.43 474.34 4.559 6.51e-06 ***
## BATHS_TOT2 49151.15 13689.76 3.590 0.000364 ***
## BATHS_TOT3 91924.51 32168.59 2.858 0.004451 **
## BATHS_TOT4 1815.18 50748.99 0.036 0.971482
## BRICKWALL0 -56796.69 75950.23 -0.748 0.454932
## BRICKWALL1 -75859.44 66261.79 -1.145 0.252835
## POOL1 12934.06 10043.25 1.288 0.198413
## LAND_AREA -16.23 29.36 -0.553 0.580756
## LIVING_TOT3 -114658.98 44211.28 -2.593 0.009788 **
## LIVING_TOT4 -140883.60 42015.36 -3.353 0.000861 ***
## LIVING_TOT5 -152371.97 42145.46 -3.615 0.000331 ***
## LIVING_TOT6 -163027.82 42524.79 -3.834 0.000143 ***
## LIVING_TOT7 -185978.04 44210.66 -4.207 3.08e-05 ***
## TILEDROOF1 -15853.32 11029.31 -1.437 0.151251
## D_CBD_Drive_Km -284.59 621.15 -0.458 0.647034
## Education -10931.42 12025.86 -0.909 0.363803
## Health 11591.53 17380.61 0.667 0.505136
## Retailoutlets -17178.95 12326.20 -1.394 0.164045
## Time_CBD_Drive_Min -3988.75 915.31 -4.358 1.60e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65210 on 488 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.6591, Adjusted R-squared: 0.6402
## F-statistic: 34.94 on 27 and 488 DF, p-value: < 2.2e-16
The comparison between villa/unit and detached housing is interesting. Looking at the intercept for each regression mode. The y value, or expected mean value is $300155 for villa/units however for detached housing it is $338209, this meaning that all other variables being zero, detached houses cost $38054 more than villas and units. This is statistically correct and not physically possible; it is to demonstrate that detached houses should cost more in this model.
The model for villas show some factors that are significant, these are: tiled roof (0.03468), 2 carports (0.02998), distance in kilometres (8.11e-06) and house area (1.87e-10). The overall model is significant at the 1% level, the p-value being < 2.2e-16. The R-squared for the model is 0.4212 and the adjusted R-squared is 0.3814 meaning that 38% of the variance in price is explained by the factors. Given all the information discussed it will be necessary to remove factors that are not substantially increasing the R-Squared value.
The factors that significantly affect detached houses are properties with 2, 3, 4, 5, or 6 bedrooms, distance from the city in time and distance, house area, 2 carports and 2 and 3 bathrooms. Age has the greatest significance compared to price where the p-value = 6.51e-06. The effectiveness of the model is good; the multiple R squared is 0.6591 and the adjusted R squared is 0.6402. 64% of the factors explain the variance in price. The small different between the multiple R Squared and the adjusted R Squared also demonstrates the effectiveness of the model. The overall model is significant at the 0.0001 level with a p-value at < 2.2e-16.
E1: This is your contribution. You will make sense of your preliminary work and estimate your best regression model. You need to diagnose the errors and validate your models’ performance on the test set of observations. It may take more than one chunk of code. Please remember to write a pargraph about each output.
#Start model
if(!require("car")) install.packages("car")
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:EnvStats':
##
## qqPlot
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(car)
if(!require("caret")) install.packages("caret")
library(caret)
#Create MLR for training
MyModel <- lm(SALE ~BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT + BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km + Education + Retailoutlets + Time_CBD_Drive_Min + Health , data = train_df)
#save predictions
Predictor <- MyModel %>% predict(test_df)
data.frame(
squares_1 = RMSE(Predictor, test_df$SALE),
Rsquared_1= R2(Predictor, test_df$SALE))
##analyse data
summary(MyModel)
##
## Call:
## lm(formula = SALE ~ BEDS + AREA_HSE + CARPORTT + AGE + BATHS_TOT +
## BRICKWALL + POOL + LAND_AREA + LIVING_TOT + TILEDROOF + D_CBD_Drive_Km +
## Education + Retailoutlets + Time_CBD_Drive_Min + Health,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -555380 -46530 -5714 38124 290493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 454109.10 94443.43 4.808 1.77e-06 ***
## BEDS1 2939.88 40947.88 0.072 0.942779
## BEDS2 -39591.84 32419.88 -1.221 0.222301
## BEDS3 -69671.14 31547.47 -2.208 0.027448 *
## BEDS4 -104268.29 32246.29 -3.233 0.001264 **
## BEDS5 -92102.72 39257.01 -2.346 0.019171 *
## BEDS6 11298.62 70687.47 0.160 0.873041
## AREA_HSE 1859.38 103.30 17.999 < 2e-16 ***
## CARPORTT1 -12052.22 8642.07 -1.395 0.163459
## CARPORTT2 -2894.21 7867.81 -0.368 0.713063
## AGE 1095.73 329.39 3.327 0.000913 ***
## BATHS_TOT1 45508.73 65048.75 0.700 0.484340
## BATHS_TOT2 85246.93 65882.61 1.294 0.196003
## BATHS_TOT3 104981.04 74015.44 1.418 0.156407
## BATHS_TOT4 24087.38 90289.13 0.267 0.789695
## BRICKWALL0 -131773.93 62141.57 -2.121 0.034215 *
## BRICKWALL1 -139072.98 59495.51 -2.338 0.019615 *
## POOL1 15180.14 11885.05 1.277 0.201823
## LAND_AREA 21.16 24.08 0.879 0.379691
## LIVING_TOT2 -23949.11 40419.54 -0.593 0.553646
## LIVING_TOT3 31900.54 41853.05 0.762 0.446125
## LIVING_TOT4 21514.74 42317.32 0.508 0.611279
## LIVING_TOT5 461.20 42898.61 0.011 0.991424
## LIVING_TOT6 -14955.44 43646.36 -0.343 0.731936
## LIVING_TOT7 -37470.95 46293.68 -0.809 0.418475
## LIVING_TOT8 -153533.73 101238.99 -1.517 0.129709
## TILEDROOF1 -34730.05 8114.98 -4.280 2.06e-05 ***
## D_CBD_Drive_Km 639.36 601.78 1.062 0.288298
## Education -2635.19 8824.54 -0.299 0.765294
## Retailoutlets 3916.64 7647.56 0.512 0.608669
## Time_CBD_Drive_Min -5818.00 852.61 -6.824 1.57e-11 ***
## Health 6455.16 12104.15 0.533 0.593948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82890 on 963 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.5175, Adjusted R-squared: 0.502
## F-statistic: 33.32 on 31 and 963 DF, p-value: < 2.2e-16
anova(MyModel)
##Check for multicollinearity
car::vif(MyModel)
## GVIF Df GVIF^(1/(2*Df))
## BEDS 11.758196 6 1.227991
## AREA_HSE 3.961633 1 1.990385
## CARPORTT 2.133863 2 1.208625
## AGE 2.746453 1 1.657243
## BATHS_TOT 4.845079 4 1.218043
## BRICKWALL 1.210889 2 1.049001
## POOL 1.231065 1 1.109534
## LAND_AREA 3.586592 1 1.893830
## LIVING_TOT 12.580137 7 1.198254
## TILEDROOF 1.280346 1 1.131524
## D_CBD_Drive_Km 16.491686 1 4.060996
## Education 6.417424 1 2.533263
## Retailoutlets 1.213998 1 1.101816
## Time_CBD_Drive_Min 16.637289 1 4.078883
## Health 6.118378 1 2.473536
##Show plot for residuals
plot(MyModel)
## Warning: not plotting observations with leverage one:
## 693
##Show table for predictions for the fitted values
predict(MyModel, test_df, type="response")
## 3896 4865 1884 2237 4899 1375 447 1479
## 417343.9 442811.0 371682.6 425635.2 333286.1 459286.0 418838.3 441128.8
## 4855 86 3292 1 612 3166 453 4139
## 493669.9 492599.9 510228.9 438013.9 475957.3 386599.3 427182.5 498425.7
## 2473 245 2030 3376 3958 2253 3197 4154
## 401576.9 610315.6 295992.6 479682.7 438245.9 426224.3 459421.7 412011.7
## 3504 4177 2696 218 3842 274 3741 1260
## 618256.9 425803.8 618274.9 483186.2 426443.6 414202.1 478126.8 417504.7
## 2684 267 3007 1430 1359 2401 3676 2543
## 481348.4 436381.4 434775.8 464816.7 458996.3 514751.1 446627.9 532651.0
## 2234 3348 2857 3941 4883 2345 1712 1284
## 372172.7 432269.8 396165.0 578994.7 472232.3 468877.9 368700.8 469995.4
## 2096 2027 2323 1109 165 581 1793 2904
## 481039.3 434902.6 626122.0 403174.3 460693.3 522345.6 273689.2 439788.8
## 1665 2595 1043 1897 2403 1366 4395 1197
## 392053.1 461184.7 444713.9 548649.6 517775.9 401200.4 266343.4 491785.4
## 2784 1454 2393 1729 3843 1107 305 3558
## 490841.1 427992.7 700884.6 438880.7 426507.1 496341.8 389228.0 293864.2
## 3123 3623 4166 76 847 2223 1195 3965
## 565494.5 389284.9 444204.5 478490.3 450400.1 448501.0 516249.3 425374.9
## 4694 4712 1911 2332 4271 4028 2367 695
## 586195.9 543805.4 369078.7 631636.6 NA 427881.3 533669.7 485775.2
## 4345 3990 2516 4793 645 4417 3492 674
## 499440.4 411083.4 641435.2 607407.0 566056.7 450591.0 365487.2 535784.3
## 4523 2896 3703 40 2739 3172 4535 1889
## 382227.2 430068.7 470894.5 529792.1 680080.7 439329.7 336896.1 470299.9
## 934 4349 4612 2424 598 2926 3484 704
## 606034.7 571957.8 429926.1 421508.1 527784.7 633812.0 542693.7 617961.2
## 3013 4927 1247 3840 1837 1795 3290 4750
## 469334.6 392588.2 381459.4 497524.0 462272.7 329418.0 529934.4 306791.6
## 1213 2716 4813 4741 4390 2321 2944 1796
## 555777.6 462367.3 527038.0 447758.0 331879.3 547917.6 397117.4 345143.7
## 1693 3047 2230 4701 503 4060 1422 4444
## 595141.4 455423.0 365502.8 476460.8 366144.5 424778.1 398672.5 348824.8
## 2180 4247 185 3448 4934 1307 4249 3995
## 413897.5 444387.3 488553.3 390999.1 338922.9 643430.4 446070.7 468453.9
## 3442 3187 1507 4155 1763 978 2371 4912
## 498900.9 521730.8 405985.1 379545.6 523170.6 577938.0 359649.4 542905.0
## 1676 1215 1797 1748 2845 1682 4683 535
## 460169.7 496026.4 272640.3 362266.0 593952.0 398796.6 497305.1 478699.7
## 3009 1919 3080 3136 4167 1394 97 2646
## 375822.9 397589.9 465419.0 542853.8 427062.3 348674.7 512559.0 358210.9
## 123 3875 4183 2737 3637 5026 2079 1844
## 572475.3 569034.0 422619.7 544718.9 353487.6 405499.4 447925.9 428853.1
## 3870 1546 4266 4507 1912 3158 4199 4790
## 631301.4 611545.7 NA 625497.1 372425.4 355194.4 393319.1 600030.2
## 676 2448 3237 1751 1908 539 4573 4707
## 466418.0 487895.2 492932.8 531785.8 336152.3 390215.4 501262.4 525427.2
## 1570 1832 4485 2336 4264 2966 3518 2745
## 370314.2 397780.8 457906.0 501455.0 402144.0 410924.4 445271.6 537537.4
Looking at the summary of this regression model. The significant factors are, tiled roof, time to CBD, brickwall, 3,4 or 5 beds, house area, 1 carport, age and brickwall or no brickwall. Testing for multicollinearity some of these factors, time to CBD, living area and beds. When looking at the variance inflation factor compared to the degrees of freedom we can see that living area and beds have a high VIF because of the degrees of freedom. Another model will be tested used with the significant factors, one distance function will be removed as this would be the independent variables that would be highly multicollinear. The overall model explains 50% of the variance explained in sale price and is significant overall with a p-value: < 2.2e-16.
MyModel2 <- lm(SALE ~ AREA_HSE + BEDS+ BATHS_TOT + AGE + D_CBD_Drive_Km + LAND_AREA + Office + Manufacturingindustry + Recreation + natural_elements + Education + walk_score + Residential, data = train_df)
#save predictions
Predictor1 <- MyModel2 %>% predict(test_df)
data.frame(
squares_2 = RMSE(Predictor, test_df$SALE),
Rsquared_2= R2(Predictor, test_df$SALE))
##analyse data
summary(MyModel2)
##
## Call:
## lm(formula = SALE ~ AREA_HSE + BEDS + BATHS_TOT + AGE + D_CBD_Drive_Km +
## LAND_AREA + Office + Manufacturingindustry + Recreation +
## natural_elements + Education + walk_score + Residential,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -558499 -47385 -2908 42582 300423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88994.88 59739.92 1.490 0.13662
## AREA_HSE 1754.82 87.75 19.998 < 2e-16 ***
## BEDS1 31190.39 33162.29 0.941 0.34717
## BEDS2 -25639.39 23360.55 -1.098 0.27267
## BEDS3 -39427.15 21753.90 -1.812 0.07023 .
## BEDS4 -71648.29 22797.14 -3.143 0.00172 **
## BEDS5 -53230.49 31442.80 -1.693 0.09079 .
## BEDS6 35558.29 63661.99 0.559 0.57660
## BATHS_TOT1 71805.79 60443.75 1.188 0.23513
## BATHS_TOT2 109973.38 61215.29 1.797 0.07272 .
## BATHS_TOT3 112492.83 68621.29 1.639 0.10147
## BATHS_TOT4 87684.82 85139.30 1.030 0.30331
## AGE 610.84 284.11 2.150 0.03180 *
## D_CBD_Drive_Km -2493.59 206.24 -12.090 < 2e-16 ***
## LAND_AREA 20.99 21.68 0.968 0.33305
## Office 43566.55 13639.04 3.194 0.00145 **
## Manufacturingindustry 54528.84 18215.62 2.994 0.00283 **
## Recreation 6929.42 1417.66 4.888 1.19e-06 ***
## natural_elements 887.21 487.41 1.820 0.06903 .
## Education -20453.29 4549.76 -4.495 7.77e-06 ***
## walk_score 1807.88 253.59 7.129 1.96e-12 ***
## Residential 35485.58 17689.15 2.006 0.04512 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80600 on 978 degrees of freedom
## Multiple R-squared: 0.5388, Adjusted R-squared: 0.5289
## F-statistic: 54.41 on 21 and 978 DF, p-value: < 2.2e-16
anova(MyModel2)
##Check for Multicollinearity
car::vif(MyModel2)
## GVIF Df GVIF^(1/(2*Df))
## AREA_HSE 3.037557 1 1.742859
## BEDS 4.851196 6 1.140654
## BATHS_TOT 3.212048 4 1.157038
## AGE 2.163649 1 1.470935
## D_CBD_Drive_Km 2.058866 1 1.434875
## LAND_AREA 3.091528 1 1.758274
## Office 1.369528 1 1.170268
## Manufacturingindustry 1.269335 1 1.126648
## Recreation 1.504809 1 1.226707
## natural_elements 1.301045 1 1.140634
## Education 1.811298 1 1.345845
## walk_score 2.660051 1 1.630966
## Residential 2.168411 1 1.472553
##Show plot for residuals
plot(MyModel2)
The second model has successfully explained a 2% increase in the variation of price against the other factors. In this model, CBD drive was removed. When the code is executed for VIF, there is no sign of high levels of multicollinearity. Although the intercept is quite low for model 2, this understandable as living areas have been removed from the model. It should be noted that ideally the predicted data from model 2 would be graphed over the test data, to analyse the accuracy of the predicted data.
F1: Write a conclusion to summarise the entirety of your work for this assignment.
William Alonso’s bid rent theory explains that businesses and individuals will compete for land close to the CBD and that the purchase comes with two items: the land and location. The assignment has demonstrated that while this might be true, there are many other factors that affect the price of a property. Distance to CBD has been a factor with high significance throughout tests but low correlation to sale directly. This is because there are many other factors affecting price.
As distance from CBD increases, the land available may increase and individuals and businesses alike may use this extra space to add value. William Alonso’s model also assumes that the CBD is the most ideal place to buy property. This may have been true in the early 1900s but as housing density and population have increased people have sought to buy further away from the city. Increased demand to live near natural elements or beaches would also cloud the correlation. A measure that may be effective is price per square metre. It could be taken a step further and a metric could be created, price per square metre per suburb.
Overall the theory has basic logic to it and generally speaking inner-city suburbs are generally more expensive but land area is not the only sacrifice. Further analysis could be completed on new, more relevant data, this would be an interesting comparison.
To create the web page use knit HTML. Zip the html and associated file of images etc. and upload to LMS.
tinytex::install_tinytex()
End of Assignment One