Introduction

In the realm of real estate, understanding the factors that influence property prices is of paramount importance to both buyers and sellers. The valuation of residential properties involves a complex interplay of various variables such as location, property type, number of rooms, and distance from essential urban amenities like the Central Business District (CBD). These factors collectively shape the property market and impact the decisions of stakeholders, including investors, homeowners, and policymakers.

This study delves into the exploration and analysis of a comprehensive dataset containing information on residential properties. The dataset encompasses critical attributes like suburb, property type, number of rooms, price, and distance from the CBD, among others. Our objective is to gain valuable insights into the property market dynamics by examining these key variables and their relationships.

Problem Statement

The overarching problem driving this investigation is to understand the factors that influence residential property prices and their relationships within the real estate market. Specifically, we aim to answer questions such as:
What are the key drivers of property prices in the given dataset?
How do variables like property type, number of rooms, distance to the Central Business District (CBD), and suburb impact property prices?
Are there any outliers or unusual patterns in the data that require attention?

Problem Statement Cont.

Use of Statistics to Solve the Problem:
Statistics will play a pivotal role in this investigation by providing quantitative methods to explore, analyze, and interpret the dataset.
Here’s how statistics will be used to address the problem,
Descriptive Statistics: We will employ descriptive statistics to summarize and characterize the main features of the dataset. This includes measures like mean, median, mode, and standard deviation to understand the central tendency and dispersion of property prices, room counts, and distances.
Data Visualization: Statistics will be used to create visualizations such as histograms, scatterplots, and boxplots to represent the distribution and relationships between variables. Visualizations provide an intuitive way to identify patterns and outliers.
Hypothesis Testing: Statistical hypothesis tests may be conducted to determine whether certain variables (e.g., property type or suburb) have a significant impact on property prices. This helps answer questions like, “Is there a significant price difference between house and unit properties?”

Data

The Data we are using MELBOURNE_HOUSE_PRICES_LESS.csv was collected from kaggle.com (https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv).
The dataset contains the pricing details of the houses sold in State of Victoria, Australia between 2016-2018.
This dataset can be used to study and understand the house pricing, factors affecting the pricing, how the housing market changed over years and much more.
However we will be concentrating mainly on Price and Distance variables exclusively for our hypothesis/significance testing.

Data Cont.

Variables

Suburb: Suburb where the house is located at.
Address: address of the house.
Rooms: Number of rooms available in the house.
Type: Type of the house. Ex: townhouse, apartment and etc.
Price: Price of the house in AUD.
Method: Tells the status of the property.
SellerG: Name of the property selling agency/agent.
Date: Date on which property was sold.
Postcode: Postcode of the area where the property is located at.
Distance: Distance from CBD in Kilometres.
Regionname: General Region (West, North West, North, North east …etc).
Propertycount: Number of properties that exist in the suburb.
CouncilArea: Governing council for the area.

Factors

Suburb: Variable Suburb is a factor containing 380 levels (Unique names of all suburbs).
Type: Type is a factor containing 3 levels (“h” for house, “u” for unit and “t” for townhouse)
Method: Method is a factor with 9 levels.
SellerG: SellerG is a factor with 476 variables.
Postcode: Postcode is a factor with 225 levels.
Regionname: Regionname is a factor with 8 levels.
CouncilArea: CouncilArea is a factor with 34 levels.

Int/Num

Rooms: Rooms is of type integer with minimum value of 1 and maximum value of 31.
Price: Price is of type integer with minimum value of 85000 and maximum value of 11200000.
Propertycount: Propertycount is of type integer with minimum value of 39 and maximum value of 21650.
Distance: Distance is of type number with minimum value of 0.00 and maximum value of 64.10.

Data Preporocessing

Drop variables that does not contribute and affect our hypothesis. Variables like Propertycount, CouncilArea wont affect the result in anyway so those can be dropped.
Remove rows which have Null/Nan/NA as value for the variable Distance and Price.
Convert variables into appropriate datatypes.
Remove duplicate entries if present.

Data Cont.

# load the dataset
housing <- read.csv('house_prices.csv')

# Dimensions of the dataset
dim(housing)

## [1] 63023    13

# variables in dataset
names(housing)

##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Postcode"      "Regionname"    "Propertycount" "Distance"     
## [13] "CouncilArea"

# structure of the dataset
str(housing)

## 'data.frame':    63023 obs. of  13 variables:
##  $ Suburb       : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
##  $ Address      : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms        : int  3 3 3 3 2 2 2 3 6 3 ...
##  $ Type         : chr  "h" "h" "h" "h" ...
##  $ Price        : int  1490000 1220000 1420000 1515000 670000 530000 540000 715000 NA 1925000 ...
##  $ Method       : chr  "S" "S" "S" "S" ...
##  $ SellerG      : chr  "Jellis" "Marshall" "Nelson" "Barry" ...
##  $ Date         : chr  "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
##  $ Postcode     : int  3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
##  $ Propertycount: int  4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
##  $ Distance     : num  3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
##  $ CouncilArea  : chr  "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...

# lets convert variables into proper data types
categorical_columns <- c("Suburb", "Type", "Method", "SellerG", "Regionname", "CouncilArea", "Postcode")

housing[categorical_columns] <- lapply(housing[categorical_columns], as.factor)

# convert Date into Date 
housing$Date <- as.Date(housing$Date)

Data Cont.

# lets drop Propertycount, CouncilArea as they dont affect our hypothesis in any way.
housing <- subset(housing, select=-c(Propertycount, CouncilArea))

# lets check and drop duplicate observations
duplicate_entries <- housing %>% duplicated()
duplicate_rows <- housing[duplicate_entries, ]

housing <- housing[!duplicate_entries, ]

# lets remove rows which contain NA/Null/NAN as value in Price or Distance columns.
if (sum(is.na(housing$Price)) > 0 | sum(is.na(housing$Distance))>0){
  housing <- housing[!is.na(housing$Price), ]
  housing <- housing[!is.na(housing$Distance), ]
}
# after performing simple preprocessing, lets check the structure of the data now
str(housing)

## 'data.frame':    48432 obs. of  11 variables:
##  $ Suburb    : Factor w/ 380 levels "Abbotsford","Aberfeldie",..: 1 1 1 2 3 3 3 3 5 6 ...
##  $ Address   : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms     : int  3 3 3 3 2 2 2 3 3 3 ...
##  $ Type      : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 2 3 1 1 3 ...
##  $ Price     : int  1490000 1220000 1420000 1515000 670000 530000 540000 715000 1925000 515000 ...
##  $ Method    : Factor w/ 9 levels "PI","PN","S",..: 3 3 3 3 3 3 3 6 3 3 ...
##  $ SellerG   : Factor w/ 476 levels "@Realty","A",..: 217 274 309 29 309 217 29 309 77 116 ...
##  $ Date      : Date, format: "1-04-20" "1-04-20" ...
##  $ Postcode  : Factor w/ 225 levels "3000","3002",..: 55 55 55 31 33 33 33 33 175 13 ...
##  $ Regionname: Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 7 7 7 7 7 6 7 ...
##  $ Distance  : num  3 3 3 7.5 10.4 10.4 10.4 10.4 3 10.5 ...

Descriptive Statistics and Visualisation

The importatnt variables in this dataset are, Price, Distance, Suburb, Type and Rooms.
We removed NA values from the variables we will be using for hypothesis testing. In this section we will explore the data further.
We will check for outliers and fix them if needed and then we will check relationshio between important variables and plot graphs to visualize them.

Note: In previous section we removed rows with NA values, the reason for this was, multiple factors affect the columns Price and Distance. Ex: Price depends on the distance of the property from City[Our Hypothesis], also on the type of the property and rooms so replacing NA with mean/median or any other statistically derived values would introduce unwanted bias in the data. So its better to remove those rows atleast for the scope of this particular assignment.

# We have fixed NA in previous section. So lets check for outliers by plotting box plot
boxplot(housing$Price, main="Box Plot: Price")

# lets print boxplot stats
boxplot.stats(housing$Price, do.out=FALSE)

## $stats
## [1]   85000  620000  830000 1220000 2120000
## 
## $n
## [1] 48432
## 
## $conf
## [1] 825692.3 834307.7
## 
## $out
## numeric(0)

From the above boxplots, we can see that there are some outliers (extreme values) in the dataset, however since the data in this dataset is collected from single source(domain.com.au) and represents the actual value, we can ignore these as I manually verified the existence of such extreme values (https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc).

Decsriptive Statistics Cont.

# lets visualize the distribution of Prices in different Suburbs
ggplot(housing, aes(x=Suburb, y=Price)) +
  geom_boxplot() +
  labs(title="Price Distribution by Suburb", x="Suburb", y="House Price") +
  theme(axis.text.x = element_text(angle = 90))

# Scatter plot of house price by distance from city
plot(housing$Distance, housing$Price, xlab="Distance from city", ylab="Price")

# lets visualize relation between Type, rooms and price
ggplot(housing, aes(x=Rooms, y=Price, color=Type)) +
  geom_point() +
  labs(title="Price vs Rooms by Type", x="Number of Rooms", y="House Price") +
  theme_minimal()

Hypothesis Testing

We will be performing t-test and significance test using linear regression

The following hypothesis tests are being conducted to test the relationship between price of the property and distance of the property from city.
Null Hypothesis: The distance of the property from the city has no effect on Price of the property.
Alternative Hypothesis: The distance of the property from the city has significant effect on the price of the property.

Assumptions: * If the distance is less than 10 then its considered to reside near city and above 10 means its far from city.

# lets perform t-test first

# lets subset the dataframe into 2, one which contains properties nearer to city and another containing properties far from city based on our assumption. 
close_to_city <- housing %>% filter(Distance <= 10)

far_from_city <- housing %>%  filter(Distance > 10)

# lets perform the t-test, keeping default confidence interval(95%)
t_statistic <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$statistic
p_value <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$p.value

cat("t statistic ", t_statistic)

## t statistic  43.63815

cat("p value ", p_value)

## p value  0

if (p_value < 0.05) {
  print("The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away.")
} else {
  print("The p value is greater than 0.05 so our Null Hypotheis is right. We can conclude that there is no significant difference between the average house price in houses closer to the city center and the average house price in houses further away.")
}

## [1] "The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away."

Hypthesis Testing Cont.

Lets use linear regression to verify our t-test results.

# lets build the linear regression model
model <- lm(Price ~ Distance, data = housing)

# lets print the summary of the model
summary(model)

## 
## Call:
## lm(formula = Price ~ Distance, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1104285  -346827  -129399   210197 10158173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1251208.3     5105.9  245.05   <2e-16 ***
## Distance     -19941.1      345.5  -57.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 574100 on 48430 degrees of freedom
## Multiple R-squared:  0.06435,    Adjusted R-squared:  0.06433 
## F-statistic:  3331 on 1 and 48430 DF,  p-value: < 2.2e-16

# lets visualize the relationship between Distance and Price
ggplot(housing, aes(x=Distance, y=Price)) +
  geom_point() +
  geom_smooth(method=lm, col="red") +
  labs(title="Price vs Distance", x="Distance from City", y="Price")

The following can be interpreted from the output of lm(),
* Coefficients for Distance is -19941.1. Which indicates that for increase of 1 unit in Distance, there is a decrease of 19941.1AUD in price for the property.
* p-value: < 2.2e-16, indicates that the relationship between distance and price is significant.
* R-squared value is 0.06435, i.e., 6.435%. This means our model can only explain 6.435% variation in distance vs price. Even though the R-squared value is less, it is statistically significant and the reason for lower R-squared value might be because, Price is affected not only by distance but by other variables.

Based on the above interpretations we can conclude that As Distance of the property increases from city, The price of the property decreases.

Discussion

The major finding of our investigation was that, there is a negative relationship between distance from the city center and house price in Melbourne. This means that houses closer to the city center are more expensive than houses further away. This finding is supported by the statistical significance testing and hypothesis testing.

One of the strengths of this investigation is that it uses a large and representative dataset of house prices in Melbourne. This dataset allows us to draw general conclusions about the relationship between distance from the city center and house price in Melbourne.

One major limitation of this investigation is that it only considers the relationship between distance from the city center and house price. There are many other factors that can affect house price, such as the size of the house, the number of bedrooms and bathrooms, the quality of the house, the condition of the house, and the amenities in the neighborhood. This investigation does not account for any of these other factors.

Discussion Cont.

One direction for future investigations would be to consider the relationship between distance from the city center and house price in other cities and markets. It would also be interesting to investigate how the relationship between distance and house price has changed over time.

Another direction for future investigations would be to consider the relationship between distance from the city center and other variables, such as crime rates, school quality, and access to public transportation. This would allow us to better understand the factors that drive house prices in different neighborhoods.

Our conclusion from this investigation is that distance from the city center is a significant factor that affects house price in Melbourne. Houses closer to the city center are more expensive than houses further away. This is because houses closer to the city center are generally more desirable and convenient for buyers.

References

[1] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 1 (https://rmit.instructure.com/courses/107035/pages/week-8-introduction?module_item_id=5261561)

[2] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 2 (https://rmit.instructure.com/courses/107035/pages/week-9-introduction?module_item_id=5261563)

[3] Applied Analytics, Week 2 ( Module 2 in “Course Website”): Descriptive Statistics through Visualisation (https://rmit.instructure.com/courses/107035/pages/week-2-introduction?module_item_id=5261542)

[4] Significance Test for Linear Regression | R Tutorial. (n.d.). Www.r-Tutor.com. https://www.r-tutor.com/elementary-statistics/simple-linear-regression/significance-test-linear-regression

[5] DataFlair Team. (2017, June 30). Introduction to Hypothesis Testing in R - Learn every concept from Scratch! - DataFlair. DataFlair. https://data-flair.training/blogs/hypothesis-testing-in-r/

[6] boxplot.stats function - RDocumentation. (n.d.). Www.rdocumentation.org. https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/boxplot.stats

[7] Melbourne Housing Market. (n.d.). Www.kaggle.com. Retrieved October 8, 2023, from https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv

[8] Goyal, C. (2021, May 16). Why You Shouldn’t Just Delete Outliers. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/why-you-shouldnt-just-delete-outliers/

[9] Auction. (2023). Domain. https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc

[10] Mcleod, S. (2019). P-values and statistical significance. Simply Psychology. https://www.simplypsychology.org/p-value.html

[11] Quick Guide: Interpreting Simple Linear Model Output in R. (2015). Github.io. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

[12] Frost, J. (2018). How To Interpret R-squared in Regression Analysis. Statistics by Jim; Jim Frost. https://statisticsbyjim.com/regression/interpret-r-squared-regression/

Location: The Key Factors Driving Property Prices

Understanding the Relationship Between Location and Property Prices

Introduction

Problem Statement

Problem Statement Cont.

Data

Data Cont.

Data Cont.

Data Cont.

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Hypothesis Testing

Hypthesis Testing Cont.

Discussion

Discussion Cont.

References