In this report, the main focus lies in analysing the relationship between the size of a family house and the selling price of the house in Melbourne. The random sample was collected by the two-stage cluster sampling method. This report will show the graphical presentation of sample data and eventually it will also show the estimation of the multiple regression model. That model will be used to predict the selling price for a house with the specification that is also located in Melbourne east. After that, the report will conduct an F-test and find out if there is any jointly significant relationship between a dependent variable and two independent variables at the 5% level of significance and if any, how they affect each other. Then the report will analyse another important term called a joint term and estimate the regression model again. Then the report will show the significance and implication of the P-value for the joint term. At last, the report will conclude by saying which regression model is the most appropriate for the set of data. In this report programming language and software Rstudio is been used for statistical analysis and graphical representation. Also, LaTex been used for equations.
To get a specific sample, a two-stage cluster sampling method has been used here. First, divide the population into smaller clusters and eventually form a final sample selecting from these clusters. This sampling method is used to get the sample from different suburbs that are geographically dispersed within Melbourne. These suburbs are grouped into two clusters - East and West. From these two sides, two cities have been selected as a representative of sample data – Dandenong and Sunshine, from where we choose randomly 10 houses from each side. To get this we made some assumptions as well. The first assumption is 10 houses from each side chosen randomly and without biases. And the second assumption is a sample chosen from each cluster from either representative suburb assumed small-scaled portrayal of the population on each side. Also considered that previous study reliable enough to differentiate between East and West.
Now convert source data xlsx to csv.
HousingPrices <-read.csv("C:/Users/Tahmina/OneDrive/Desktop/A2/Housing prices.csv")
head(HousingPrices)
## House Selling.Price Location Number.of.Rooms
## 1 1 345 0 8
## 2 2 655 0 9
## 3 3 325 1 7
## 4 4 478 0 4
## 5 5 432 1 10
## 6 6 233 1 6
dim(HousingPrices)
## [1] 20 4
Then Initial statistics on selling price.
summary(HousingPrices$Selling.Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 199.0 321.8 455.0 510.9 658.8 988.0
Checking the linear relationship between number of rooms(size of a family house) and selling price.
cor(HousingPrices$Number.of.Rooms, HousingPrices$Selling.Price)
## [1] 0.8133545
Correlation between number of rooms(size of a family house) and selling price is 0.8133545. Definitely the linear relationship between these two variable is strong. Plotting them to show a strong linear relationship.
plot(HousingPrices$Number.of.Rooms, HousingPrices$Selling.Price)
plot(HousingPrices$Number.of.Rooms, HousingPrices$Selling.Price,
xlab = "Number of Rooms",
ylab = "Selling Price",
main = "Scatterplot between size and price of house")
To represent sample data in the graphical form bubble plotting method has been used here. For this, I use ggplot in Rstudio. Here, the Darkest dots represent East side (Dandenong, marked 0) and the lightest dots represent West side (Sunshine, marked 1). The reason for using this method is to give a clear view of data from each side of the given location. By this visualisation, one can easily and quickly find out what kind of relationship is going on between the selling price, size of a family house and the location. #Representing east and west houses in different colour, so we could search them separately.
library('ggplot2')
## Warning: package 'ggplot2' was built under R version 4.1.3
ggplot(data=HousingPrices, aes(x=Number.of.Rooms, y=Selling.Price)) +
geom_point(aes(size=Location, colour=Location))+
guides(size="none")+
labs(x="Number of Rooms", y="Selling Price")+
scale_colour_viridis_b(option= "F")+
scale_size_continuous(range= c(3,3))
## Multiple Regression Equation
Estimating multiple regression equation.
reg=lm(HousingPrices$Selling.Price~HousingPrices$Number.of.Rooms +
HousingPrices$Location)
summary(reg)
##
## Call:
## lm(formula = HousingPrices$Selling.Price ~ HousingPrices$Number.of.Rooms +
## HousingPrices$Location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249.47 -52.99 -11.03 67.73 159.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 155.29 107.88 1.439 0.168191
## HousingPrices$Number.of.Rooms 54.90 10.60 5.177 7.58e-05 ***
## HousingPrices$Location -222.04 52.59 -4.222 0.000574 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.9 on 17 degrees of freedom
## Multiple R-squared: 0.8348, Adjusted R-squared: 0.8153
## F-statistic: 42.95 on 2 and 17 DF, p-value: 2.257e-07
Here we can see intercept 155.29 and two slopes are respectively 54.90(number of rooms) and -222.04(location). So the equation will be:
\(\text{Estimated Selling Price} = 155.29 + 54.9(\text{Number of Rooms}) − 222.04(\text{Location})\)
Interpretation of slopes:
Predicted selling price for a house with 9 rooms on Melbourne’s East: For 9 rooms on East side the selling price will be:
\(\text{Estimated Selling Price} = 155.29 + 54.9(9) − 222.04(0)\) \(= 649.39\)
So the predicted selling price with 9 rooms house on East side will be $649.39k.
F-statistic always measures the significance of the regression model. So the relationship between the dependent variable (selling price) and independent variables (number of rooms and location) can be shown as follows:
\(Ho: β1 = β2 = 0 (\text{No relationship between dependent and independent variables})\) \(H1:=(\text{At least one of the coefficients β is not zero})\)
To find F-statistic, the previous regression test result again:
reg=lm(HousingPrices$Selling.Price~HousingPrices$Number.of.Rooms +
HousingPrices$Location)
summary(reg)
##
## Call:
## lm(formula = HousingPrices$Selling.Price ~ HousingPrices$Number.of.Rooms +
## HousingPrices$Location)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249.47 -52.99 -11.03 67.73 159.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 155.29 107.88 1.439 0.168191
## HousingPrices$Number.of.Rooms 54.90 10.60 5.177 7.58e-05 ***
## HousingPrices$Location -222.04 52.59 -4.222 0.000574 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.9 on 17 degrees of freedom
## Multiple R-squared: 0.8348, Adjusted R-squared: 0.8153
## F-statistic: 42.95 on 2 and 17 DF, p-value: 2.257e-07
Here the F-statistic is 42.95 and the P-value is <0.05(at a 5% level of significance), so we can reject the null hypothesis. Because all coefficients are with either positive or negative values. That means all the variables have an impact on the regression model. Moreover, R-squared is showing 83.48% which represents 83% of the variation in selling price can be explained by changes in both independent variables. So there is a jointly-significant relationship between the selling price and the two independent variables.
Now determine whether each independent variable affects the dependent variable. In our multiple regression model we have one dependent variable(Selling.Price), two independent variable(Number.of.Rooms and Location).
When dependent variable- selling price, Independent variable- Number of rooms:
res.ftest <- var.test(HousingPrices$Selling.Price, HousingPrices$Number.of.Rooms,
data = HousingPrices)
res.ftest
##
## F test to compare two variances
##
## data: HousingPrices$Selling.Price and HousingPrices$Number.of.Rooms
## F = 8860.4, num df = 19, denom df = 19, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3507.069 22385.460
## sample estimates:
## ratio of variances
## 8860.437
Here p-value is less than 0.05(at a 5% level of significance), so we can reject the null hypothesis. It means a change in one unit of the number of rooms affects the selling price of a family house.
When dependent variable- selling price, Independent variable- Location:
res.ftest <- var.test(HousingPrices$Selling.Price, HousingPrices$Location,
data = HousingPrices)
res.ftest
##
## F test to compare two variances
##
## data: HousingPrices$Selling.Price and HousingPrices$Location
## F = 217967, num df = 19, denom df = 19, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 86273.89 550682.32
## sample estimates:
## ratio of variances
## 217966.8
Here p-value is also less than 0.05(at a 5% level of significance), so we can reject the null hypothesis. It means a change in one unit of the location affects the selling price of a family house.
By adding the joint term(X1 * X2) as an interaction term into the regression model and run again as a new regression model. Interaction term calculated by multiplying number of rooms and location for each of the 20 houses.
HousingPricesx1x2 <-
read.csv("C:/Users/Tahmina/OneDrive/Desktop/A2/Housing prices_x1x2.csv")
head(HousingPricesx1x2)
## House Selling.Price Location Number.of.Rooms Interaction.Term
## 1 1 345 0 8 0
## 2 2 655 0 9 0
## 3 3 325 1 7 7
## 4 4 478 0 4 0
## 5 5 432 1 10 10
## 6 6 233 1 6 6
Now we find out the regression from new CSV file created with interaction term.
reg=lm(HousingPricesx1x2$Selling.Price~HousingPricesx1x2$Number.of.Rooms
+ HousingPricesx1x2$Location + HousingPricesx1x2$Interaction.Term)
summary(reg)
##
## Call:
## lm(formula = HousingPricesx1x2$Selling.Price ~ HousingPricesx1x2$Number.of.Rooms +
## HousingPricesx1x2$Location + HousingPricesx1x2$Interaction.Term)
##
## Residuals:
## Min 1Q Median 3Q Max
## -233.12 -57.50 -5.01 47.19 168.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.997 130.940 0.473 0.642271
## HousingPricesx1x2$Number.of.Rooms 64.516 13.087 4.930 0.000151 ***
## HousingPricesx1x2$Location -5.011 185.099 -0.027 0.978736
## HousingPricesx1x2$Interaction.Term -26.569 21.752 -1.221 0.239620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.5 on 16 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8205
## F-statistic: 29.96 on 3 and 16 DF, p-value: 8.452e-07
After that the regression equation will be:
\(\text{Estimated Selling Price} = 61.997 + 64.516(\text{Number of Rooms}) − 5.011(\text{Location}) – 26.569(\text{Number of Rooms} * \text{Location})\)
Here we found a different p-value than the previous one which is 8.542e-07. This implies that compared with the original regression model, with a less negative coefficient and dependencies on the other variable (number of rooms), the location variable has a different effect on selling price.
From all of the analyses and based on their results, it can be said that the regression model with a joint term is the most appropriate one. Because it has a higher number of adjusted R-squared value (0.8205) compared to the original (0.8153). Here we get a better estimation of selling price of houses on either side of Melbourne with joint term, if there will be no problem of having more budget and rest of the restrictions could be eased to include more random samples(suburbs and houses), the model will become a more well-founded estimator of the selling price.
Sharpe, N, De, VR, & Velleman, P 2015, Business Statistics PDF Ebook, Global Edition, Pearson Education, Limited, Harlow. Available from: ProQuest Ebook Central. [6 February 2022].