Anirudh Mangala Puttaswamy (s3993305), Shubha Shrivastava (s3991957)
Last updated: 15 October, 2023
In the realm of real estate, understanding the factors that influence property prices is of paramount importance to both buyers and sellers. The valuation of residential properties involves a complex interplay of various variables such as location, property type, number of rooms, and distance from essential urban amenities like the Central Business District (CBD). These factors collectively shape the property market and impact the decisions of stakeholders, including investors, homeowners, and policymakers.
This study delves into the exploration and analysis of a comprehensive dataset containing information on residential properties. The dataset encompasses critical attributes like suburb, property type, number of rooms, price, and distance from the CBD, among others. Our objective is to gain valuable insights into the property market dynamics by examining these key variables and their relationships.
The overarching problem driving this investigation is to understand
the factors that influence residential property prices and their
relationships within the real estate market. Specifically, we aim to
answer questions such as:
What are the key drivers of property prices in the given dataset?
How do variables like property type, number of rooms, distance to the
Central Business District (CBD), and suburb impact property
prices?
Are there any outliers or unusual patterns in the data that require
attention?
Use of Statistics to Solve the Problem:
Statistics will play a pivotal role in this investigation by providing
quantitative methods to explore, analyze, and interpret the
dataset.
Here’s how statistics will be used to address the problem,
Descriptive Statistics: We will employ descriptive
statistics to summarize and characterize the main features of the
dataset. This includes measures like mean, median, mode, and standard
deviation to understand the central tendency and dispersion of property
prices, room counts, and distances.
Data Visualization: Statistics will be used to create
visualizations such as histograms, scatterplots, and boxplots to
represent the distribution and relationships between variables.
Visualizations provide an intuitive way to identify patterns and
outliers.
Hypothesis Testing: Statistical hypothesis tests may be
conducted to determine whether certain variables (e.g., property type or
suburb) have a significant impact on property prices. This helps answer
questions like, “Is there a significant price difference between house
and unit properties?”
Variables
Factors
Int/Num
Data Preporocessing
## [1] 63023 13
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Postcode" "Regionname" "Propertycount" "Distance"
## [13] "CouncilArea"
## 'data.frame': 63023 obs. of 13 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
## $ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
## $ Rooms : int 3 3 3 3 2 2 2 3 6 3 ...
## $ Type : chr "h" "h" "h" "h" ...
## $ Price : int 1490000 1220000 1420000 1515000 670000 530000 540000 715000 NA 1925000 ...
## $ Method : chr "S" "S" "S" "S" ...
## $ SellerG : chr "Jellis" "Marshall" "Nelson" "Barry" ...
## $ Date : chr "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
## $ Postcode : int 3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
## $ Propertycount: int 4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
## $ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...
# lets drop Propertycount, CouncilArea as they dont affect our hypothesis in any way.
housing <- subset(housing, select=-c(Propertycount, CouncilArea))
# lets check and drop duplicate observations
duplicate_entries <- housing %>% duplicated()
duplicate_rows <- housing[duplicate_entries, ]
housing <- housing[!duplicate_entries, ]
# lets remove rows which contain NA/Null/NAN as value in Price or Distance columns.
if (sum(is.na(housing$Price)) > 0 | sum(is.na(housing$Distance))>0){
housing <- housing[!is.na(housing$Price), ]
housing <- housing[!is.na(housing$Distance), ]
}
# after performing simple preprocessing, lets check the structure of the data now
str(housing)## 'data.frame': 48432 obs. of 11 variables:
## $ Suburb : Factor w/ 380 levels "Abbotsford","Aberfeldie",..: 1 1 1 2 3 3 3 3 5 6 ...
## $ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
## $ Rooms : int 3 3 3 3 2 2 2 3 3 3 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 2 3 1 1 3 ...
## $ Price : int 1490000 1220000 1420000 1515000 670000 530000 540000 715000 1925000 515000 ...
## $ Method : Factor w/ 9 levels "PI","PN","S",..: 3 3 3 3 3 3 3 6 3 3 ...
## $ SellerG : Factor w/ 476 levels "@Realty","A",..: 217 274 309 29 309 217 29 309 77 116 ...
## $ Date : Date, format: "1-04-20" "1-04-20" ...
## $ Postcode : Factor w/ 225 levels "3000","3002",..: 55 55 55 31 33 33 33 33 175 13 ...
## $ Regionname: Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 7 7 7 7 7 6 7 ...
## $ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 3 10.5 ...
The importatnt variables in this dataset are, Price, Distance,
Suburb, Type and Rooms.
We removed NA values from the variables we will be using for hypothesis
testing. In this section we will explore the data further.
We will check for outliers and fix them if needed and then we will check
relationshio between important variables and plot graphs to visualize
them.
Note: In previous section we removed rows with NA values, the reason for this was, multiple factors affect the columns Price and Distance. Ex: Price depends on the distance of the property from City[Our Hypothesis], also on the type of the property and rooms so replacing NA with mean/median or any other statistically derived values would introduce unwanted bias in the data. So its better to remove those rows atleast for the scope of this particular assignment.
# We have fixed NA in previous section. So lets check for outliers by plotting box plot
boxplot(housing$Price, main="Box Plot: Price")## $stats
## [1] 85000 620000 830000 1220000 2120000
##
## $n
## [1] 48432
##
## $conf
## [1] 825692.3 834307.7
##
## $out
## numeric(0)
From the above boxplots, we can see that there are some outliers (extreme values) in the dataset, however since the data in this dataset is collected from single source(domain.com.au) and represents the actual value, we can ignore these as I manually verified the existence of such extreme values (https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc).
# lets visualize the distribution of Prices in different Suburbs
ggplot(housing, aes(x=Suburb, y=Price)) +
geom_boxplot() +
labs(title="Price Distribution by Suburb", x="Suburb", y="House Price") +
theme(axis.text.x = element_text(angle = 90))# Scatter plot of house price by distance from city
plot(housing$Distance, housing$Price, xlab="Distance from city", ylab="Price")# lets visualize relation between Type, rooms and price
ggplot(housing, aes(x=Rooms, y=Price, color=Type)) +
geom_point() +
labs(title="Price vs Rooms by Type", x="Number of Rooms", y="House Price") +
theme_minimal()We will be performing t-test and significance test using linear regression
The following hypothesis tests are being conducted to test the
relationship between price of the property and distance of the property
from city.
Null Hypothesis: The distance of the property from the city has no
effect on Price of the property.
Alternative Hypothesis: The distance of the property from the city has
significant effect on the price of the property.
Assumptions: * If the distance is less than 10 then its considered to reside near city and above 10 means its far from city.
# lets perform t-test first
# lets subset the dataframe into 2, one which contains properties nearer to city and another containing properties far from city based on our assumption.
close_to_city <- housing %>% filter(Distance <= 10)
far_from_city <- housing %>% filter(Distance > 10)
# lets perform the t-test, keeping default confidence interval(95%)
t_statistic <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$statistic
p_value <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$p.value
cat("t statistic ", t_statistic)## t statistic 43.63815
## p value 0
if (p_value < 0.05) {
print("The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away.")
} else {
print("The p value is greater than 0.05 so our Null Hypotheis is right. We can conclude that there is no significant difference between the average house price in houses closer to the city center and the average house price in houses further away.")
}## [1] "The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away."
Lets use linear regression to verify our t-test results.
# lets build the linear regression model
model <- lm(Price ~ Distance, data = housing)
# lets print the summary of the model
summary(model)##
## Call:
## lm(formula = Price ~ Distance, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1104285 -346827 -129399 210197 10158173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1251208.3 5105.9 245.05 <2e-16 ***
## Distance -19941.1 345.5 -57.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 574100 on 48430 degrees of freedom
## Multiple R-squared: 0.06435, Adjusted R-squared: 0.06433
## F-statistic: 3331 on 1 and 48430 DF, p-value: < 2.2e-16
# lets visualize the relationship between Distance and Price
ggplot(housing, aes(x=Distance, y=Price)) +
geom_point() +
geom_smooth(method=lm, col="red") +
labs(title="Price vs Distance", x="Distance from City", y="Price")The following can be interpreted from the output of lm(),
* Coefficients for Distance is -19941.1. Which indicates that for
increase of 1 unit in Distance, there is a decrease of 19941.1AUD in
price for the property.
* p-value: < 2.2e-16, indicates that the relationship between
distance and price is significant.
* R-squared value is 0.06435, i.e., 6.435%. This means our model can
only explain 6.435% variation in distance vs price. Even though the
R-squared value is less, it is statistically significant and the reason
for lower R-squared value might be because, Price is affected not only
by distance but by other variables.
Based on the above interpretations we can conclude that As Distance of the property increases from city, The price of the property decreases.
The major finding of our investigation was that, there is a negative relationship between distance from the city center and house price in Melbourne. This means that houses closer to the city center are more expensive than houses further away. This finding is supported by the statistical significance testing and hypothesis testing.
One of the strengths of this investigation is that it uses a large and representative dataset of house prices in Melbourne. This dataset allows us to draw general conclusions about the relationship between distance from the city center and house price in Melbourne.
One major limitation of this investigation is that it only considers the relationship between distance from the city center and house price. There are many other factors that can affect house price, such as the size of the house, the number of bedrooms and bathrooms, the quality of the house, the condition of the house, and the amenities in the neighborhood. This investigation does not account for any of these other factors.
One direction for future investigations would be to consider the relationship between distance from the city center and house price in other cities and markets. It would also be interesting to investigate how the relationship between distance and house price has changed over time.
Another direction for future investigations would be to consider the relationship between distance from the city center and other variables, such as crime rates, school quality, and access to public transportation. This would allow us to better understand the factors that drive house prices in different neighborhoods.
Our conclusion from this investigation is that distance from the city center is a significant factor that affects house price in Melbourne. Houses closer to the city center are more expensive than houses further away. This is because houses closer to the city center are generally more desirable and convenient for buyers.
[1] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 1 (https://rmit.instructure.com/courses/107035/pages/week-8-introduction?module_item_id=5261561)
[2] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 2 (https://rmit.instructure.com/courses/107035/pages/week-9-introduction?module_item_id=5261563)
[3] Applied Analytics, Week 2 ( Module 2 in “Course Website”): Descriptive Statistics through Visualisation (https://rmit.instructure.com/courses/107035/pages/week-2-introduction?module_item_id=5261542)
[4] Significance Test for Linear Regression | R Tutorial. (n.d.). Www.r-Tutor.com. https://www.r-tutor.com/elementary-statistics/simple-linear-regression/significance-test-linear-regression
[5] DataFlair Team. (2017, June 30). Introduction to Hypothesis Testing in R - Learn every concept from Scratch! - DataFlair. DataFlair. https://data-flair.training/blogs/hypothesis-testing-in-r/
[6] boxplot.stats function - RDocumentation. (n.d.). Www.rdocumentation.org. https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/boxplot.stats
[7] Melbourne Housing Market. (n.d.). Www.kaggle.com. Retrieved October 8, 2023, from https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv
[8] Goyal, C. (2021, May 16). Why You Shouldn’t Just Delete Outliers. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/why-you-shouldnt-just-delete-outliers/
[9] Auction. (2023). Domain. https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc
[10] Mcleod, S. (2019). P-values and statistical significance. Simply Psychology. https://www.simplypsychology.org/p-value.html
[11] Quick Guide: Interpreting Simple Linear Model Output in R. (2015). Github.io. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
[12] Frost, J. (2018). How To Interpret R-squared in Regression Analysis. Statistics by Jim; Jim Frost. https://statisticsbyjim.com/regression/interpret-r-squared-regression/