Student(s) Names and Student ID’s Come Here
Last updated: 08 October, 2023
Variables
-Suburb: Suburb where the house is located at. -Address: address of
the house. -Rooms: Number of rooms available in the house. -Type: Type
of the house. Ex: townhouse, apartment and etc.
-Price: Price of the house in AUD.
-Method: Tells the status of the property.
-SellerG: Name of the property selling agency/agent.
-Date: Date on which property was sold.
-Postcode: Postcode of the area where the property is located at.
-Distance: Distance from CBD in Kilometres.
-Regionname: General Region (West, North West, North, North east
…etc).
-Propertycount: Number of properties that exist in the suburb.
-CouncilArea: Governing council for the area.
Factors
-Suburb: Variable Suburb is a factor containing 380 levels (Unique
names of all suburbs). -Type: Type is a factor containing 3 levels (“h”
for house, “u” for unit and “t” for townhouse)
-Method: Method is a factor with 9 levels.
-SellerG: SellerG is a factor with 476 variables.
-Postcode: Postcode is a factor with 225 levels.
-Regionname: Regionname is a factor with 8 levels. -CouncilArea:
CouncilArea is a factor with 34 levels.
Int/Num
-Rooms: Rooms is of type integer with minimum value of 1 and maximum
value of 31.
-Price: Price is of type integer with minimum value of 85000 and maximum
value of 11200000.
-Propertycount: Propertycount is of type integer with minimum value of
39 and maximum value of 21650.
-Distance: Distance is of type number with minimum value of 0.00 and
maximum value of 64.10.
Data Preporocessing
-Drop variables that does not contribute and affect our hypothesis.
Variables like Propertycount, CouncilArea wont affect the result in
anyway so those can be dropped.
-Remove rows which have Null/Nan/NA as value for the variable Distance
and Price.
-Convert variables into appropriate datatypes.
-Remove duplicate entries if present.
## [1] 63023 13
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Postcode" "Regionname" "Propertycount" "Distance"
## [13] "CouncilArea"
## 'data.frame': 63023 obs. of 13 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
## $ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
## $ Rooms : int 3 3 3 3 2 2 2 3 6 3 ...
## $ Type : chr "h" "h" "h" "h" ...
## $ Price : int 1490000 1220000 1420000 1515000 670000 530000 540000 715000 NA 1925000 ...
## $ Method : chr "S" "S" "S" "S" ...
## $ SellerG : chr "Jellis" "Marshall" "Nelson" "Barry" ...
## $ Date : chr "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
## $ Postcode : int 3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
## $ Propertycount: int 4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
## $ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...
# lets convert variables into proper data types
categorical_columns <- c("Suburb", "Type", "Method", "SellerG", "Regionname", "CouncilArea", "Postcode")
housing[categorical_columns] <- lapply(housing[categorical_columns], as.factor)
# convert Date into Date
housing$Date <- as.Date(housing$Date)
# lets drop Propertycount, CouncilArea as they dont affect our hypothesis in any way.
housing <- subset(housing, select=-c(Propertycount, CouncilArea))
# lets check and drop duplicate observations
duplicate_entries <- housing %>% duplicated()
duplicate_rows <- housing[duplicate_entries, ]
housing <- housing[!duplicate_entries, ]
# lets remove rows which contain NA/Null/NAN as value in Price or Distance columns.
if (sum(is.na(housing$Price)) > 0 | sum(is.na(housing$Distance))>0){
housing <- housing[!is.na(housing$Price), ]
housing <- housing[!is.na(housing$Distance), ]
}
# after performing simple preprocessing, lets check the structure of the data now
str(housing)## 'data.frame': 48432 obs. of 11 variables:
## $ Suburb : Factor w/ 380 levels "Abbotsford","Aberfeldie",..: 1 1 1 2 3 3 3 3 5 6 ...
## $ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
## $ Rooms : int 3 3 3 3 2 2 2 3 3 3 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 2 3 1 1 3 ...
## $ Price : int 1490000 1220000 1420000 1515000 670000 530000 540000 715000 1925000 515000 ...
## $ Method : Factor w/ 9 levels "PI","PN","S",..: 3 3 3 3 3 3 3 6 3 3 ...
## $ SellerG : Factor w/ 476 levels "@Realty","A",..: 217 274 309 29 309 217 29 309 77 116 ...
## $ Date : Date, format: "1-04-20" "1-04-20" ...
## $ Postcode : Factor w/ 225 levels "3000","3002",..: 55 55 55 31 33 33 33 33 175 13 ...
## $ Regionname: Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 7 7 7 7 7 6 7 ...
## $ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 3 10.5 ...
The importatnt variables in this dataset are, Price, Distance,
Suburb, Type and Rooms.
We removed NA values from the variables we will be using for hypothesis
testing. In this section we will explore the data further.
We will check for outliers and fix them if needed and then we will check
relationshio between important variables and plot graphs to visualize
them.
Note: In previous section we removed rows with NA values, the reason for this was, multiple factors affect the columns Price and Distance. Ex: Price depends on the distance of the property from City[Our Hypothesis], also on the type of the property and rooms so replacing NA with mean/median or any other statistically derived values would introduce unwanted bias in the data. So its better to remove those rows atleast for the scope of this particular assignment.
# We have fixed NA in previous section. So lets check for outliers by plotting box plot
boxplot(housing$Price, main="Box Plot: Price")## $stats
## [1] 85000 620000 830000 1220000 2120000
##
## $n
## [1] 48432
##
## $conf
## [1] 825692.3 834307.7
##
## $out
## numeric(0)
From the above boxplots, we can see that there are some outliers (extreme values) in the dataset, however since the data in this dataset is collected from single source(domain.com.au) and represents the actual value, we can ignore these as I manually verified the existence of such extreme values (https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc).
# lets visualize the distribution of Prices in different Suburbs
ggplot(housing, aes(x=Suburb, y=Price)) +
geom_boxplot() +
labs(title="Price Distribution by Suburb", x="Suburb", y="House Price") +
theme(axis.text.x = element_text(angle = 90))# Scatter plot of house price by distance from city
plot(housing$Distance, housing$Price, xlab="Distance from city", ylab="Price")# lets visualize relation between Type, rooms and price
ggplot(housing, aes(x=Rooms, y=Price, color=Type)) +
geom_point() +
labs(title="Price vs Rooms by Type", x="Number of Rooms", y="House Price") +
theme_minimal()We will be performing t-test and significance test using linear regression
The following hypothesis tests are being conducted to test the
relationship between price of the property and distance of the property
from city.
Null Hypothesis: The distance of the property from the city has no
effect on Price of the property.
Alternative Hypothesis: The distance of the property from the city has
significant effect on the price of the property.
Assumptions: -If the distance is less than 10 then its considered to reside near city and above 10 means its far from city.
# lets perform t-test first
# lets subset the dataframe into 2, one which contains properties nearer to city and another containing properties far from city based on our assumption.
close_to_city <- housing %>% filter(Distance <= 10)
far_from_city <- housing %>% filter(Distance > 10)
# lets perform the t-test, keeping default confidence interval(95%)
t_statistic <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$statistic
p_value <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$p.value
cat("t statistic ", t_statistic)## t statistic 43.63815
## p value 0
if (p_value < 0.05) {
print("The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away.")
} else {
print("The p value is greater than 0.05 so our Null Hypotheis is right. We can conclude that there is no significant difference between the average house price in houses closer to the city center and the average house price in houses further away.")
}## [1] "The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away."
Lets use linear regression to verify our t-test results.
# lets build the linear regression model
model <- lm(Price ~ Distance, data = housing)
# lets print the summary of the model
summary(model)##
## Call:
## lm(formula = Price ~ Distance, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1104285 -346827 -129399 210197 10158173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1251208.3 5105.9 245.05 <2e-16 ***
## Distance -19941.1 345.5 -57.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 574100 on 48430 degrees of freedom
## Multiple R-squared: 0.06435, Adjusted R-squared: 0.06433
## F-statistic: 3331 on 1 and 48430 DF, p-value: < 2.2e-16
# lets visualize the relationship between Distance and Price
ggplot(housing, aes(x=Distance, y=Price)) +
geom_point() +
geom_smooth(method=lm, col="red") +
labs(title="Price vs Distance", x="Distance from City", y="Price")The following can be interpreted from the output of lm(),
- Coefficients for Distance is -19941.1. Which indicates that for
increase of 1 unit in Distance, there is a decrease of 19941.1AUD in
price for the property.
- p-value: < 2.2e-16, indicates that the relationship between
distance and price is significant.
-R-squared value is 0.06435, i.e., 6.435%. This means our model can only
explain 6.435% variation in distance vs price. Even though the R-squared
value is less, it is statistically significant and the reason for lower
R-squared value might be because, Price is affected not only by distance
but by other variables.
Based on the above interpretations we can conclude that As Distance of the property increases from city, The price of the property decreases.
[1] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 1 (https://rmit.instructure.com/courses/107035/pages/week-8-introduction?module_item_id=5261561)
[2] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 2 (https://rmit.instructure.com/courses/107035/pages/week-9-introduction?module_item_id=5261563)
[3] Applied Analytics, Week 2 ( Module 2 in “Course Website”): Descriptive Statistics through Visualisation (https://rmit.instructure.com/courses/107035/pages/week-2-introduction?module_item_id=5261542)
[4] Significance Test for Linear Regression | R Tutorial. (n.d.). Www.r-Tutor.com. https://www.r-tutor.com/elementary-statistics/simple-linear-regression/significance-test-linear-regression
[5] DataFlair Team. (2017, June 30). Introduction to Hypothesis Testing in R - Learn every concept from Scratch! - DataFlair. DataFlair. https://data-flair.training/blogs/hypothesis-testing-in-r/
[6] boxplot.stats function - RDocumentation. (n.d.). Www.rdocumentation.org. https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/boxplot.stats
[7] Melbourne Housing Market. (n.d.). Www.kaggle.com. Retrieved October 8, 2023, from https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv
[8] Goyal, C. (2021, May 16). Why You Shouldn’t Just Delete Outliers. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/why-you-shouldnt-just-delete-outliers/
[9] Auction. (2023). Domain. https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc
[10] Mcleod, S. (2019). P-values and statistical significance. Simply Psychology. https://www.simplypsychology.org/p-value.html
[11] Quick Guide: Interpreting Simple Linear Model Output in R. (2015). Github.io. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
[12] Frost, J. (2018). How To Interpret R-squared in Regression Analysis. Statistics by Jim; Jim Frost. https://statisticsbyjim.com/regression/interpret-r-squared-regression/