Your Title Comes Here

Subtitle Comes Here

Student(s) Names and Student ID’s Come Here

Last updated: 08 October, 2023

Introduction

The following link will help you with creating R Markdown Slidy Presentations
http://rmarkdown.rstudio.com/slidy_presentation_format.html
Don’t forget about Bootcamp 4
A good introduction provides a brief background to the problem, defines important terms, and leads to a strong rationale.

Introduction Cont.

Keep everything short and straight to the point.
Use bullet points to help minimise text.
Add relevant images to make your presentation more appealing
Remember, you have a maximum of 20 slides to fit everything in.
Ensure each slide fits on one screen. The reader should not have to scroll down.

Problem Statement

State the overall problem/question driving the investigation
Summarise how you will use statistics to solve the problem or answer your question.

Data

The Data we are using MELBOURNE_HOUSE_PRICES_LESS.csv was collected from kaggle.com (https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv).
The dataset contains the pricing details of the houses sold in State of Victoria, Australia between 2016-2018.
This dataset can be used to study and understand the house pricing, factors affecting the pricing, how the housing market changed over years and much more.
However we will concentrating mainly on Price and Distance variables exclusively for our hypothesis/significance testing.

Data Cont.

Variables

-Suburb: Suburb where the house is located at. -Address: address of the house. -Rooms: Number of rooms available in the house. -Type: Type of the house. Ex: townhouse, apartment and etc.
-Price: Price of the house in AUD.
-Method: Tells the status of the property.
-SellerG: Name of the property selling agency/agent.
-Date: Date on which property was sold.
-Postcode: Postcode of the area where the property is located at.
-Distance: Distance from CBD in Kilometres.
-Regionname: General Region (West, North West, North, North east …etc).
-Propertycount: Number of properties that exist in the suburb.
-CouncilArea: Governing council for the area.

Factors

-Suburb: Variable Suburb is a factor containing 380 levels (Unique names of all suburbs). -Type: Type is a factor containing 3 levels (“h” for house, “u” for unit and “t” for townhouse)
-Method: Method is a factor with 9 levels.
-SellerG: SellerG is a factor with 476 variables.
-Postcode: Postcode is a factor with 225 levels.
-Regionname: Regionname is a factor with 8 levels. -CouncilArea: CouncilArea is a factor with 34 levels.

Int/Num

-Rooms: Rooms is of type integer with minimum value of 1 and maximum value of 31.
-Price: Price is of type integer with minimum value of 85000 and maximum value of 11200000.
-Propertycount: Propertycount is of type integer with minimum value of 39 and maximum value of 21650.
-Distance: Distance is of type number with minimum value of 0.00 and maximum value of 64.10.

Data Preporocessing

-Drop variables that does not contribute and affect our hypothesis. Variables like Propertycount, CouncilArea wont affect the result in anyway so those can be dropped.
-Remove rows which have Null/Nan/NA as value for the variable Distance and Price.
-Convert variables into appropriate datatypes.
-Remove duplicate entries if present.

# load the dataset
housing <- read.csv('house_prices.csv')

# Dimensions of the dataset
dim(housing)

## [1] 63023    13

# variables in dataset
names(housing)

##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Postcode"      "Regionname"    "Propertycount" "Distance"     
## [13] "CouncilArea"

# structure of the dataset
str(housing)

## 'data.frame':    63023 obs. of  13 variables:
##  $ Suburb       : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
##  $ Address      : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms        : int  3 3 3 3 2 2 2 3 6 3 ...
##  $ Type         : chr  "h" "h" "h" "h" ...
##  $ Price        : int  1490000 1220000 1420000 1515000 670000 530000 540000 715000 NA 1925000 ...
##  $ Method       : chr  "S" "S" "S" "S" ...
##  $ SellerG      : chr  "Jellis" "Marshall" "Nelson" "Barry" ...
##  $ Date         : chr  "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
##  $ Postcode     : int  3067 3067 3067 3040 3042 3042 3042 3042 3021 3206 ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
##  $ Propertycount: int  4019 4019 4019 1543 3464 3464 3464 3464 1899 3280 ...
##  $ Distance     : num  3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
##  $ CouncilArea  : chr  "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...

# lets convert variables into proper data types
categorical_columns <- c("Suburb", "Type", "Method", "SellerG", "Regionname", "CouncilArea", "Postcode")

housing[categorical_columns] <- lapply(housing[categorical_columns], as.factor)

# convert Date into Date 
housing$Date <- as.Date(housing$Date)
# lets drop Propertycount, CouncilArea as they dont affect our hypothesis in any way.
housing <- subset(housing, select=-c(Propertycount, CouncilArea))

# lets check and drop duplicate observations
duplicate_entries <- housing %>% duplicated()
duplicate_rows <- housing[duplicate_entries, ]

housing <- housing[!duplicate_entries, ]

# lets remove rows which contain NA/Null/NAN as value in Price or Distance columns.
if (sum(is.na(housing$Price)) > 0 | sum(is.na(housing$Distance))>0){
  housing <- housing[!is.na(housing$Price), ]
  housing <- housing[!is.na(housing$Distance), ]
}
# after performing simple preprocessing, lets check the structure of the data now
str(housing)

## 'data.frame':    48432 obs. of  11 variables:
##  $ Suburb    : Factor w/ 380 levels "Abbotsford","Aberfeldie",..: 1 1 1 2 3 3 3 3 5 6 ...
##  $ Address   : chr  "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
##  $ Rooms     : int  3 3 3 3 2 2 2 3 3 3 ...
##  $ Type      : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 2 3 1 1 3 ...
##  $ Price     : int  1490000 1220000 1420000 1515000 670000 530000 540000 715000 1925000 515000 ...
##  $ Method    : Factor w/ 9 levels "PI","PN","S",..: 3 3 3 3 3 3 3 6 3 3 ...
##  $ SellerG   : Factor w/ 476 levels "@Realty","A",..: 217 274 309 29 309 217 29 309 77 116 ...
##  $ Date      : Date, format: "1-04-20" "1-04-20" ...
##  $ Postcode  : Factor w/ 225 levels "3000","3002",..: 55 55 55 31 33 33 33 33 175 13 ...
##  $ Regionname: Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 7 7 7 7 7 6 7 ...
##  $ Distance  : num  3 3 3 7.5 10.4 10.4 10.4 10.4 3 10.5 ...

Descriptive Statistics and Visualisation

The importatnt variables in this dataset are, Price, Distance, Suburb, Type and Rooms.
We removed NA values from the variables we will be using for hypothesis testing. In this section we will explore the data further.
We will check for outliers and fix them if needed and then we will check relationshio between important variables and plot graphs to visualize them.

Note: In previous section we removed rows with NA values, the reason for this was, multiple factors affect the columns Price and Distance. Ex: Price depends on the distance of the property from City[Our Hypothesis], also on the type of the property and rooms so replacing NA with mean/median or any other statistically derived values would introduce unwanted bias in the data. So its better to remove those rows atleast for the scope of this particular assignment.

# We have fixed NA in previous section. So lets check for outliers by plotting box plot
boxplot(housing$Price, main="Box Plot: Price")

# lets print boxplot stats
boxplot.stats(housing$Price, do.out=FALSE)

## $stats
## [1]   85000  620000  830000 1220000 2120000
## 
## $n
## [1] 48432
## 
## $conf
## [1] 825692.3 834307.7
## 
## $out
## numeric(0)

From the above boxplots, we can see that there are some outliers (extreme values) in the dataset, however since the data in this dataset is collected from single source(domain.com.au) and represents the actual value, we can ignore these as I manually verified the existence of such extreme values (https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc).

Decsriptive Statistics Cont.

# lets visualize the distribution of Prices in different Suburbs
ggplot(housing, aes(x=Suburb, y=Price)) +
  geom_boxplot() +
  labs(title="Price Distribution by Suburb", x="Suburb", y="House Price") +
  theme(axis.text.x = element_text(angle = 90))

# Scatter plot of house price by distance from city
plot(housing$Distance, housing$Price, xlab="Distance from city", ylab="Price")

# lets visualize relation between Type, rooms and price
ggplot(housing, aes(x=Rooms, y=Price, color=Type)) +
  geom_point() +
  labs(title="Price vs Rooms by Type", x="Number of Rooms", y="House Price") +
  theme_minimal()

Hypothesis Testing

We will be performing t-test and significance test using linear regression

The following hypothesis tests are being conducted to test the relationship between price of the property and distance of the property from city.
Null Hypothesis: The distance of the property from the city has no effect on Price of the property.
Alternative Hypothesis: The distance of the property from the city has significant effect on the price of the property.

Assumptions: -If the distance is less than 10 then its considered to reside near city and above 10 means its far from city.

# lets perform t-test first

# lets subset the dataframe into 2, one which contains properties nearer to city and another containing properties far from city based on our assumption. 
close_to_city <- housing %>% filter(Distance <= 10)

far_from_city <- housing %>%  filter(Distance > 10)

# lets perform the t-test, keeping default confidence interval(95%)
t_statistic <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$statistic
p_value <- t.test(close_to_city$Price, far_from_city$Price, var.equal = FALSE)$p.value

cat("t statistic ", t_statistic)

## t statistic  43.63815

cat("p value ", p_value)

## p value  0

if (p_value < 0.05) {
  print("The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away.")
} else {
  print("The p value is greater than 0.05 so our Null Hypotheis is right. We can conclude that there is no significant difference between the average house price in houses closer to the city center and the average house price in houses further away.")
}

## [1] "The p value is less than 0.05 so we can reject our Null hypothesis. We can conclude that average house price in houses closer to the city center is significantly higher than the average house price in houses further away."

Hypthesis Testing Cont.

Lets use linear regression to verify our t-test results.

# lets build the linear regression model
model <- lm(Price ~ Distance, data = housing)

# lets print the summary of the model
summary(model)

## 
## Call:
## lm(formula = Price ~ Distance, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1104285  -346827  -129399   210197 10158173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1251208.3     5105.9  245.05   <2e-16 ***
## Distance     -19941.1      345.5  -57.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 574100 on 48430 degrees of freedom
## Multiple R-squared:  0.06435,    Adjusted R-squared:  0.06433 
## F-statistic:  3331 on 1 and 48430 DF,  p-value: < 2.2e-16

# lets visualize the relationship between Distance and Price
ggplot(housing, aes(x=Distance, y=Price)) +
  geom_point() +
  geom_smooth(method=lm, col="red") +
  labs(title="Price vs Distance", x="Distance from City", y="Price")

The following can be interpreted from the output of lm(),
- Coefficients for Distance is -19941.1. Which indicates that for increase of 1 unit in Distance, there is a decrease of 19941.1AUD in price for the property.
- p-value: < 2.2e-16, indicates that the relationship between distance and price is significant.
-R-squared value is 0.06435, i.e., 6.435%. This means our model can only explain 6.435% variation in distance vs price. Even though the R-squared value is less, it is statistically significant and the reason for lower R-squared value might be because, Price is affected not only by distance but by other variables.

Based on the above interpretations we can conclude that As Distance of the property increases from city, The price of the property decreases.

Discussion

Discuss the major findings of your investigation
Discuss any strengths and limitations.
Propose directions for future investigations.
This is a good place to re-state your findings as a final conclusion. What is the one take home message the reader should leave with?
Your final conclusion needs to be very clear.

References

[1] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 1 (https://rmit.instructure.com/courses/107035/pages/week-8-introduction?module_item_id=5261561)

[2] Applied Analytics, Week 8 (Module 7 in “Course Website”): Testing the Null: Data on Trial-Part 2 (https://rmit.instructure.com/courses/107035/pages/week-9-introduction?module_item_id=5261563)

[3] Applied Analytics, Week 2 ( Module 2 in “Course Website”): Descriptive Statistics through Visualisation (https://rmit.instructure.com/courses/107035/pages/week-2-introduction?module_item_id=5261542)

[4] Significance Test for Linear Regression | R Tutorial. (n.d.). Www.r-Tutor.com. https://www.r-tutor.com/elementary-statistics/simple-linear-regression/significance-test-linear-regression

[5] DataFlair Team. (2017, June 30). Introduction to Hypothesis Testing in R - Learn every concept from Scratch! - DataFlair. DataFlair. https://data-flair.training/blogs/hypothesis-testing-in-r/

[6] boxplot.stats function - RDocumentation. (n.d.). Www.rdocumentation.org. https://www.rdocumentation.org/packages/grDevices/versions/3.6.2/topics/boxplot.stats

[7] Melbourne Housing Market. (n.d.). Www.kaggle.com. Retrieved October 8, 2023, from https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data?select=MELBOURNE_HOUSE_PRICES_LESS.csv

[8] Goyal, C. (2021, May 16). Why You Shouldn’t Just Delete Outliers. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/why-you-shouldnt-just-delete-outliers/

[9] Auction. (2023). Domain. https://www.domain.com.au/sold-listings/?suburb=melbourne-vic-3000,melbourne-vic-3004&excludepricewithheld=1&sort=price-desc

[10] Mcleod, S. (2019). P-values and statistical significance. Simply Psychology. https://www.simplypsychology.org/p-value.html

[11] Quick Guide: Interpreting Simple Linear Model Output in R. (2015). Github.io. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

[12] Frost, J. (2018). How To Interpret R-squared in Regression Analysis. Statistics by Jim; Jim Frost. https://statisticsbyjim.com/regression/interpret-r-squared-regression/