Is there a correlation between Property price and distance from CBD in Melbourne

A analysis based on the statistics

Student(s) Names: Sovit BAral and Student ID’s:s3870624

Problem Statement

Property price is a hot topic of discussion in realestate. In recent days Melbourne prices has had a significant growth. This is of concern for first home buyers and middle class income earners.The analysis of property prices will help the consumer to make the informed decision.At the first glace it looks like the suburbs that like Toorak and Hawthorn are expensive to buy property and are closer to the CBD. It would be interesting to identify whether the property price in melbourne follow the normal distribution and if the distance determines the price of the property. Statistical analysis like distribution and correlation analysis will help buyers to make the informed decision in buying the property in Melbourne.The buyers can predict the property price based on the distance of the CBD if there is a strong correlation between property price and distance from CBD.

Data

The URL provides the zip file for property prices in Melbourne https://www.kaggle.com/anthonypino/melbourne-housing-market/download
The author of the data has scraped the data from publicly available results posted every week from Domain.com.au,
The variables included in the data are Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.
The zip file was download and excel file was extracted into working directly as processing zip file was out of scope of this course
The zip file has house.xls, whihc has been used for the purpose of this assignment.
The data was filtered to only hold property price and distance and all other information was dropped.
Mean price of each suburb was calculated to simplify the data processing.

Data Cont.

After prepossessing the variable in the datasets are: suburb, distance from CBD and the sales price
Distance over 60 Kms from CBD are omitted as it would not be relevant to consider the suburbs over 60 Kms.
The sale prices of the property is adjusted and omitted if it is presented as and outliers. The analysis has been performed on sale prices to understand if the selling price is an outlier.
The preprocessing steps involved in preprocessing the data are as follows:
- The data is imported and all unreleavent variable are dropped.
- Distance is reclassified as a factor variable
- The data is checked for any missing values, special values or infinite values
- OUtliers were detected by using box plot and property prices are removed if it occurred as a outliers.
- Mean of each suburb sales prices were calculated to reduce the data points
- The data is validated to check if there are any NA in the data for property price
- NA is replaced by mean of the property prices for property prices data

Descriptive Statistics and Visualisation

The important variable in the analysis are distance from CBD and the property price.
Summary statistics are calculated for property price and visually displayed.
Distribution analysis are performed for property price.
Scatter plot with line fitting are performed to understand if the is a correlation between distance from CBD and propety price.

# URL of the data for reference only if we wish to scrapre from the web
urldata1 <-"https://www.kaggle.com/anthonypino/melbourne-housing-market/download"
#set the working directory to import the excel file for analysis
filename<-setwd("C:/Users/212458752/Desktop/house")
getwd() # confirm if the default directly has been set

## [1] "C:/Users/212458752/Desktop/house"

#****************DATA Preprocessing START***************
#import the data into R from working directory
options(scipen = 999)
house_data<- data.frame(read.csv(file = "C:/Users/212458752/Desktop/house/Melbourne_housing_FULL.csv"))
house_data<- house_data[, c(1,5,9)]# drop all other unnecessary columns 
house_data <- house_data[!is.na(house_data$Price),]# omit NAs in property price
house_data <- house_data[!is.na(house_data$Distance),]# omit NAs in distance
house_data$Distance <-  as.double(house_data$Distance)# convert distance variable as a factor 
str(house_data)

## 'data.frame':    27247 obs. of  3 variables:
##  $ Suburb  : chr  "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
##  $ Price   : int  1480000 1035000 1465000 850000 1600000 941000 1876000 1636000 1000000 745000 ...
##  $ Distance: num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...

check_infinite<-sapply(house_data, is.infinite)%>% invisible()   # function to Checks if the dataframe has any infite value
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
 check_special<-sapply(house_data, is.special)%>% invisible()# call the function 
 View(house_data)# View the data

 #****************DATA Preprocessing End***************
 
 
 #****************OUtlier Analysis Start***************
boxplot(house_data$Price,outline=TRUE, names ="Outliers analysis on Prices", main="Outliers analysis on Prices ", ylab = "Property Price", xlab = " Box plot")

# From box plot  Pricee variable has an outlier.
# further Z score computation need to be performed on Price Data
#Z Score for average price data 
z.scores_price <- house_data$Price %>%  scores(type = "z")
z.scores_price %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5046 -0.6472 -0.2809  0.0000  0.3817 15.8228

length (which( abs(z.scores_price) >3 ))# number of outliers detected

## [1] 474

outlier<-( which( abs(z.scores_price) >3 ))# Find the location of outlier whose absolute value is greater than 3
house_data<-house_data[-(outlier),]# remove suburbs that has the outliers values
boxplot(house_data$Price,outline=TRUE, names =" Price after Outliers removal", main="Price after outliers removed ", ylab = "Property Price", xlab = " Box plot") # recheck after outliers removal in property price

averprice <- house_data %>% group_by(Suburb) %>%   summarise(Average_price_of_Suburb = sum(Price)/n())
# REclassify the property prices to suburb by calculating the mean price os each suburb.

## OUtlier analysis on distance
## Boxplot for distance
boxplot(as.integer( house_data$Distance),outline=TRUE,names ="OUliers analysis on distnce from CBD", main=" Distance from CBD ", ylab = "Distance", xlab = " Box PLot ")

# we are only interested in calculating distance from CBD which are less than 60 Kms from the CBD
house_data$Distance <- ifelse(house_data$Distance > 60, NA,house_data$Distance)# Drop any distance over 60 Kms


# compute the average of the distance. this is not normally necessary but it will help us to match the data points with price
averdist <- house_data %>% group_by(Suburb) %>%
  summarise(Average_distance_fromCBD = sum(Distance)/n())

 #****************OUtlier Analysis End***************


#****************Data Preprocessing Start****************
# Create a new variable as a data frame to store the price and distance by suburb data.

dist_and_price<- as.data.frame(inner_join(averprice,averdist,by= "Suburb"))# new data frame
dist_and_price <- dist_and_price[!is.na(dist_and_price$Average_price_of_Suburb),]# omit any NAs
dist_and_price <- dist_and_price[!is.na(dist_and_price$Average_distance_fromCBD),]#Omit any NAs
print.data.frame(head(dist_and_price))# to view the crime data as it appear in the excel file

##         Suburb Average_price_of_Suburb Average_distance_fromCBD
## 1   Abbotsford               1033549.0                 2.700980
## 2   Aberfeldie               1269615.9                 8.288406
## 3 Airport West                751364.2                11.898675
## 4    Albanvale                536055.6                14.000000
## 5  Albert Park               1784901.2                 3.181395
## 6       Albion                615123.7                12.174627

#****************Data Preprocessing End****************



#****************Summary Statistics Start****************

summary_stat <- data.frame (mean= c(mean(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),mean(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                  median= c(median(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),median(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                 stdv = c(sd(dist_and_price$Average_price_of_Suburb, na.rm = TRUE),sd(dist_and_price$Average_distance_fromCBD, na.rm = TRUE)),
                 maximum= c(max(dist_and_price$Average_price_of_Suburb),max(dist_and_price$Average_distance_fromCBD)),
                minimum= c(min(dist_and_price$Average_price_of_Suburb),min(dist_and_price$Average_distance_fromCBD)),
                Q1= c(quantile(dist_and_price$Average_price_of_Suburb, probs = 0.25),quantile(dist_and_price$Average_distance_fromCBD, probs = 0.25)),
                Q3= c(quantile(dist_and_price$Average_price_of_Suburb, probs = 0.75),quantile(dist_and_price$Average_distance_fromCBD, probs = 0.75)),
                IQR= c(IQR(dist_and_price$Average_price_of_Suburb),IQR(dist_and_price$Average_distance_fromCBD)) ) -> table1
knitr::kable(table1)

mean	median	stdv	maximum	minimum	Q1	Q3	IQR
896242.4847	839076.7	296516.82279	1808234.4	380000.000000	659869.652614	1064566.28	404696.63109
17.2836	14.8	10.46858	48.1	1.005085	9.221416	23.05	13.82858

summary_stat_price<-t(summary_stat[1,]) 

barplot(summary_stat_price,
main = "Summary Statistics for Average HOuse Prices in Melbourne ",
xlab = " Data Summary  ",
ylab = "Property Prices",
col = c("red","green","blue","Yellow","Purple","darkred","black","Orange"),
beside = T,
horiz = FALSE)
legend("topright",
c("Mean","Median","St Dev","Maximum","Minimun", "Q1","Q3", "IQR"),
fill =c("red","green","blue","Yellow","Purple","darkred","black","Orange"))
grid(nx=NA, ny=NULL)

#****************Summary Statistics End****************

#***************Data Fitting Start*********************

# This is a chunk for your Distribution Fitting section.
#Analysis for Male
# PLotting histogram for male pre analysis
Price_plot <- ggplot() 
Price_plot <- Price_plot + geom_histogram(aes(x=dist_and_price$Average_price_of_Suburb), bins=100) 
Price_plot <- Price_plot + scale_x_continuous(trans='log10') 
Price_plot <- Price_plot + labs(x="Average property price", y= "Frequency ") 
Price_plot <- Price_plot + geom_vline(xintercept = mean(dist_and_price$Average_price_of_Suburb), size=1.5, colour ="red") 
Price_plot <- Price_plot + geom_vline(xintercept = median(dist_and_price$Average_price_of_Suburb), size=1.5, colour ="blue")
#Display the histogram
Price_plot

shapiro.test(dist_and_price$Average_price_of_Suburb)

## 
##  Shapiro-Wilk normality test
## 
## data:  dist_and_price$Average_price_of_Suburb
## W = 0.94878, p-value = 0.000000001478

kurtosis(dist_and_price$Average_price_of_Suburb)

## [1] 0.1370957

hist(dist_and_price$Average_price_of_Suburb, freq=F, breaks=12) 
lines(density(dist_and_price$Average_price_of_Suburb), col="red")
legend("topright", c("NOrmal Distribution "), fill =c("red","green"))

#***************Data Fitting End*********************

#**************** Correlation analysis Start********************

cor(dist_and_price$Average_price_of_Suburb, dist_and_price$Average_distance_fromCBD)# to calculate correlation coefficient

## [1] -0.4670735

correlation_graph <- ggplot(dist_and_price, aes(x = Average_distance_fromCBD, y = Average_price_of_Suburb)) +
    geom_point(aes()) +
    stat_smooth(method = "lm",
        col = "#C42126",
        se = FALSE,
        size = 1)

correlation_graph+
    scale_x_continuous(breaks = seq(1, 50, by = 2)) +
    scale_y_continuous(breaks = seq(300000, 3000000, by = 200000)) +
    labs(
        x = "Distance from CBD(Kms)",
        y = "Property Price($)",
        color = "Gear",
        title = "Correlation between distance from CBD and property price",
        subtitle = "Average sold price of the property",
        caption = "There exist a liner relationship"
    )

#**************** Correlation analysis End********************

Hypothesis Testing

Assumptions: * The distance and the property price are correlated. * If we know the distance of the property from the CBD we could compute the price. * Independence: Independence was assumed as property prices came from multiple data source. * Linearity: The scatter plot suggested a linear relationship.However all other parameters like land size, property age, sold year were ruled out.

Analysis: * From above analyis it can be seen that there is a liner relationship between distance and price * Property price follow the normal distribution and skewed at the right * We need to compute the statistically significance of the data.

#Hypotheses testing
#Fit a linear regression model to the data and assess the fit to determine if the model statistically significant
options(scipen = 999)# supress scientific notation
Hypotheses_model<-lm( Average_price_of_Suburb~Average_distance_fromCBD  , data = dist_and_price )
Hypotheses_model %>% summary()

## 
## Call:
## lm(formula = Average_price_of_Suburb ~ Average_distance_fromCBD, 
##     data = dist_and_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -455126 -204924  -31449  136248  798682 
## 
## Coefficients:
##                          Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)               1124898      27355  41.122 <0.0000000000000002 ***
## Average_distance_fromCBD   -13230       1354  -9.769 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262600 on 342 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2159 
## F-statistic: 95.43 on 1 and 342 DF,  p-value: < 0.00000000000000022

#Test the main assumptions for linear regression
#Interpret and test the statistical significance of the regression intercept and slope.
lm(formula =Average_price_of_Suburb~Average_distance_fromCBD , data = dist_and_price )

## 
## Call:
## lm(formula = Average_price_of_Suburb ~ Average_distance_fromCBD, 
##     data = dist_and_price)
## 
## Coefficients:
##              (Intercept)  Average_distance_fromCBD  
##                  1124898                    -13230

plot(Hypotheses_model)

Hypotheses_model %>% confint()

##                               2.5 %     97.5 %
## (Intercept)              1071092.61 1178702.85
## Average_distance_fromCBD  -15893.37  -10565.84

Regression equation to calculate price: $Price in Dollar = \frac{1124898 - distance *13230}{}$

Residuals: Min 1Q Median 3Q Max -20.232 -7.800 -0.708 5.666 32.636

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.206e+01 1.593e+00 20.123 <2e-16 Average_price_of_Suburb -1.649e-05 1.688e-06 -9.769 <2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.27 on 342 degrees of freedom Multiple R-squared: 0.2182, Adjusted R-squared: 0.2159 F-statistic: 95.43 on 1 and 342 DF, p-value: < 2.2e-16

Discussion

There were no non-linear trends in the Residual vs. fitted plot.
Normality of residuals: Normal Q-Q plot didn’t show any obvious departures from normality.
Influential cases: There appeared to be no influential cases.
Homoscedasticity: Homoscedasticity looked OK according to the scale-location plot. The variance in residuals appeared constant across predicted values.
Residual standard error: 9.306 on 342 degrees of freedom
Multiple R-squared: 0.2121, Adjusted R-squared: 0.2098
F-statistic: 92.05 on 1 and 342 DF, p-value: < 2.2e-16
The linear model was statistically significant, F(1,342) = 95.43, p < .001. Distance explained 21.5 % of the variability in property price.
The equation can be derived as ( )
The estimated property price when distance=0 is not a valid argument in our regression equation.
The intercept of the regression was statistically significant, a = 88.87, p = .0001, 95% CI (44.60, 133.14).
For every KM you go away from CBD the price decreased by $13230.

In conclusion, there was a statistically significant positive linear relationship between property price and distance from the CBD. 22% in variability in property price in Melbourne was estimated to explain by the distance from the CBD. To conclude,data suggests there there exists a liner relationship between property price and the distance from the CBD but only 20% of the data can be calculated this way. If we want to establish the robust models to caludalte the property prices we need to further consider studying other variable like land size, number of bedroom and age of the property. This variables may be the strong predictor of the property price.