Is there a correlation between Property price and distance from CBD in Melbourne

A analysis based on the statistics

Student(s) Names: Sovit BAral and Student ID’s:s3870624

Problem Statement

Property price is a hot topic of discussion in realestate. In recent days Melbourne prices has had a significant growth. This is of concern for first home buyers and middle class income earners.The analysis of property prices will help the consumer to make the informed decision.At the first glace it looks like the suburbs that like Toorak and Hawthorn are expensive to buy property and are closer to the CBD. It would be interesting to identify whether the property price in melbourne follow the normal distribution and if the distance determines the price of the property. Statistical analysis like distribution and correlation analysis will help buyers to make the informed decision in buying the property in Melbourne.The buyers can predict the property price based on the distance of the CBD if there is a strong correlation between property price and distance from CBD.

Data

Data Cont.

Descriptive Statistics and Visualisation

# URL of the data for reference only if we wish to scrapre from the web
urldata1 <-"https://www.kaggle.com/anthonypino/melbourne-housing-market/download"
#set the working directory to import the excel file for analysis
filename<-setwd("C:/Users/212458752/Desktop/house")
getwd() # confirm if the default directly has been set
## [1] "C:/Users/212458752/Desktop/house"
#****************DATA Preprocessing START***************
#import the data into R from working directory
options(scipen = 999)
house_data<- data.frame(read.csv(file = "C:/Users/212458752/Desktop/house/Melbourne_housing_FULL.csv"))
house_data<- house_data[, c(1,5,9)]# drop all other unnecessary columns 
house_data <- house_data[!is.na(house_data$Price),]# omit NAs in property price
house_data <- house_data[!is.na(house_data$Distance),]# omit NAs in distance
house_data$Distance <-  as.double(house_data$Distance)# convert distance variable as a factor 
str(house_data)
## 'data.frame':    27247 obs. of  3 variables:
##  $ Suburb  : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
##  $ Price   : int  1480000 1035000 1465000 850000 1600000 941000 1876000 1636000 1000000 745000 ...
##  $ Distance: num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
check_infinite<-sapply(house_data, is.infinite)%>% invisible()   # function to Checks if the dataframe has any infite value
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
 check_special<-sapply(house_data, is.special)%>% invisible()# call the function 
 View(house_data)# View the data

 #****************DATA Preprocessing End***************
 
 
 #****************OUtlier Analysis Start***************
boxplot(house_data$Price,outline=TRUE, names ="Outliers analysis on Prices", main="Outliers analysis on Prices ", ylab = "Property Price", xlab = " Box plot")

# From box plot  Pricee variable has an outlier.
# further Z score computation need to be performed on Price Data
#Z Score for average price data 
z.scores_price <- house_data$Price %>%  scores(type = "z")
z.scores_price %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5046 -0.6472 -0.2809  0.0000  0.3817 15.8228
length (which( abs(z.scores_price) >3 ))# number of outliers detected
## [1] 474
outlier<-( which( abs(z.scores_price) >3 ))# Find the location of outlier whose absolute value is greater than 3
house_data<-house_data[-(outlier),]# remove suburbs that has the outliers values
boxplot(house_data$Price,outline=TRUE, names =" Price after Outliers removal", main="Price after outliers removed ", ylab = "Property Price", xlab = " Box plot") # recheck after outliers removal in property price

averprice <- house_data %>% group_by(Suburb) %>%   summarise(Average_price_of_Suburb = sum(Price)/n())
# REclassify the property prices to suburb by calculating the mean price os each suburb.

## OUtlier analysis on distance
## Boxplot for distance
boxplot(as.integer( house_data$Distance),outline=TRUE,names ="OUliers analysis on distnce from CBD", main=" Distance from CBD ", ylab = "Distance", xlab = " Box PLot ")

# we are only interested in calculating distance from CBD which are less than 60 Kms from the CBD
house_data$Distance <- ifelse(house_data$Distance > 60, NA,house_data$Distance)# Drop any distance over 60 Kms


# compute the average of the distance. this is not normally necessary but it will help us to match the data points with price
averdist <- house_data %>% group_by(Suburb) %>%
  summarise(Average_distance_fromCBD = sum(Distance)/n())

 #****************OUtlier Analysis End***************


#****************Data Preprocessing Start****************
# Create a new variable as a data frame to store the price and distance by suburb data.

dist_and_price<- as.data.frame(inner_join(averprice,averdist,by= "Suburb"))# new data frame
dist_and_price <- dist_and_price[!is.na(dist_and_price$Average_price_of_Suburb),]# omit any NAs
dist_and_price <- dist_and_price[!is.na(dist_and_price$Average_distance_fromCBD),]#Omit any NAs
print.data.frame(head(dist_and_price))# to view the crime data as it appear in the excel file
##         Suburb Average_price_of_Suburb Average_distance_fromCBD
## 1   Abbotsford               1033549.0                 2.700980
## 2   Aberfeldie               1269615.9                 8.288406
## 3 Airport West                751364.2                11.898675
## 4    Albanvale                536055.6                14.000000
## 5  Albert Park               1784901.2                 3.181395
## 6       Albion                615123.7                12.174627
#****************Data Preprocessing End****************



#****************Summary Statistics Start****************

summary_stat <- data.frame (mean= c(mean(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),mean(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                  median= c(median(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),median(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                 stdv = c(sd(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),sd(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                 maximum= c(max(dist_and_price$Average_price_of_Suburb),max(dist_and_price$Average_distance_fromCBD)),
                minimum= c(min(dist_and_price$Average_price_of_Suburb),min(dist_and_price$Average_distance_fromCBD)),
                Q1= c(quantile(dist_and_price$Average_price_of_Suburb, probs = 0.25),quantile(dist_and_price$Average_distance_fromCBD, probs = 0.25)),
                Q3= c(quantile(dist_and_price$Average_price_of_Suburb, probs = 0.75),quantile(dist_and_price$Average_distance_fromCBD, probs = 0.75)),
                IQR= c(IQR(dist_and_price$Average_price_of_Suburb),IQR(dist_and_price$Average_distance_fromCBD)) ) -> table1
knitr::kable(table1)
mean median stdv maximum minimum Q1 Q3 IQR
896242.4847 839076.7 296516.82279 1808234.4 380000.000000 659869.652614 1064566.28 404696.63109
17.2836 14.8 10.46858 48.1 1.005085 9.221416 23.05 13.82858
summary_stat_price<-t(summary_stat[1,]) 

barplot(summary_stat_price,
main = "Summary Statistics for Average HOuse Prices in Melbourne ",
xlab = " Data Summary  ",
ylab = "Property Prices",
col = c("red","green","blue","Yellow","Purple","darkred","black","Orange"),
beside = T,
horiz = FALSE)
legend("topright",
c("Mean","Median","St Dev","Maximum","Minimun", "Q1","Q3", "IQR"),
fill =c("red","green","blue","Yellow","Purple","darkred","black","Orange"))
grid(nx=NA, ny=NULL)

#****************Summary Statistics End****************

#***************Data Fitting Start*********************

# This is a chunk for your Distribution Fitting section.
#Analysis for Male
# PLotting histogram for male pre analysis
Price_plot <- ggplot() 
Price_plot <- Price_plot + geom_histogram(aes(x=dist_and_price$Average_price_of_Suburb), bins=100) 
Price_plot <- Price_plot + scale_x_continuous(trans='log10') 
Price_plot <- Price_plot + labs(x="Average property price", y= "Frequency ") 
Price_plot <- Price_plot + geom_vline(xintercept = mean(dist_and_price$Average_price_of_Suburb), size=1.5, colour ="red") 
Price_plot <- Price_plot + geom_vline(xintercept = median(dist_and_price$Average_price_of_Suburb), size=1.5, colour ="blue")
#Display the histogram
Price_plot

shapiro.test(dist_and_price$Average_price_of_Suburb)
## 
##  Shapiro-Wilk normality test
## 
## data:  dist_and_price$Average_price_of_Suburb
## W = 0.94878, p-value = 0.000000001478
kurtosis(dist_and_price$Average_price_of_Suburb)
## [1] 0.1370957
hist(dist_and_price$Average_price_of_Suburb, freq=F, breaks=12) 
lines(density(dist_and_price$Average_price_of_Suburb), col="red")
legend("topright", c("NOrmal Distribution "), fill =c("red","green"))

#***************Data Fitting End*********************

#**************** Correlation analysis Start********************

cor(dist_and_price$Average_price_of_Suburb, dist_and_price$Average_distance_fromCBD)# to calculate correlation coefficient
## [1] -0.4670735
correlation_graph <- ggplot(dist_and_price, aes(x = Average_distance_fromCBD, y = Average_price_of_Suburb)) +
    geom_point(aes()) +
    stat_smooth(method = "lm",
        col = "#C42126",
        se = FALSE,
        size = 1)

correlation_graph+
    scale_x_continuous(breaks = seq(1, 50, by = 2)) +
    scale_y_continuous(breaks = seq(300000, 3000000, by = 200000)) +
    labs(
        x = "Distance from CBD(Kms)",
        y = "Property Price($)",
        color = "Gear",
        title = "Correlation between distance from CBD and property price",
        subtitle = "Average sold price of the property",
        caption = "There exist a liner relationship"
    )

#**************** Correlation analysis End********************

Hypothesis Testing

Assumptions: * The distance and the property price are correlated. * If we know the distance of the property from the CBD we could compute the price. * Independence: Independence was assumed as property prices came from multiple data source. * Linearity: The scatter plot suggested a linear relationship.However all other parameters like land size, property age, sold year were ruled out.

Analysis: * From above analyis it can be seen that there is a liner relationship between distance and price * Property price follow the normal distribution and skewed at the right * We need to compute the statistically significance of the data.

#Hypotheses testing
#Fit a linear regression model to the data and assess the fit to determine if the model statistically significant
options(scipen = 999)# supress scientific notation
Hypotheses_model<-lm( Average_price_of_Suburb~Average_distance_fromCBD  , data = dist_and_price )
Hypotheses_model %>% summary()
## 
## Call:
## lm(formula = Average_price_of_Suburb ~ Average_distance_fromCBD, 
##     data = dist_and_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -455126 -204924  -31449  136248  798682 
## 
## Coefficients:
##                          Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)               1124898      27355  41.122 <0.0000000000000002 ***
## Average_distance_fromCBD   -13230       1354  -9.769 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262600 on 342 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2159 
## F-statistic: 95.43 on 1 and 342 DF,  p-value: < 0.00000000000000022
#Test the main assumptions for linear regression
#Interpret and test the statistical significance of the regression intercept and slope.
lm(formula =Average_price_of_Suburb~Average_distance_fromCBD , data = dist_and_price )
## 
## Call:
## lm(formula = Average_price_of_Suburb ~ Average_distance_fromCBD, 
##     data = dist_and_price)
## 
## Coefficients:
##              (Intercept)  Average_distance_fromCBD  
##                  1124898                    -13230
plot(Hypotheses_model)

Hypotheses_model %>% confint()
##                               2.5 %     97.5 %
## (Intercept)              1071092.61 1178702.85
## Average_distance_fromCBD  -15893.37  -10565.84

Residuals: Min 1Q Median 3Q Max -20.232 -7.800 -0.708 5.666 32.636

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.206e+01 1.593e+00 20.123 <2e-16 Average_price_of_Suburb -1.649e-05 1.688e-06 -9.769 <2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.27 on 342 degrees of freedom Multiple R-squared: 0.2182, Adjusted R-squared: 0.2159 F-statistic: 95.43 on 1 and 342 DF, p-value: < 2.2e-16

Discussion

In conclusion, there was a statistically significant positive linear relationship between property price and distance from the CBD. 22% in variability in property price in Melbourne was estimated to explain by the distance from the CBD. To conclude,data suggests there there exists a liner relationship between property price and the distance from the CBD but only 20% of the data can be calculated this way. If we want to establish the robust models to caludalte the property prices we need to further consider studying other variable like land size, number of bedroom and age of the property. This variables may be the strong predictor of the property price.

References

The data source was taken from: https://www.kaggle.com/anthonypino/melbourne-housing-market/download