I decided to analyse data on housing price prediction. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/muhammadbinimran/housing-price-prediction-data). I cleaned the data and made a factor variable.
data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/housingprice_dataset.xlsx")
colnames(mydata) <- c("ID", "Sq_feet", "Neighbourhood", "Price")
set.seed(1)
mydata1 <- mydata[sample(nrow(mydata), 300), ]
mydata1$Neighbourhood <- factor(mydata1$Neighbourhood,
levels = c("Urban", "Rural"),
labels = c("Urban", "Rural"))
head(mydata1, 15)
## # A tibble: 15 × 4
## ID Sq_feet Neighbourhood Price
## <dbl> <dbl> <fct> <dbl>
## 1 24388 2891 Urban 324842.
## 2 4050 1337 Urban 143337.
## 3 11571 2786 Urban 331326.
## 4 25173 1464 Rural 177784.
## 5 32618 1024 Rural 84251.
## 6 13903 2682 Urban 306779.
## 7 8229 1114 Rural 175469.
## 8 25305 2513 Urban 300529.
## 9 25061 2642 Rural 264384.
## 10 22306 2303 Rural 291378.
## 11 12204 1066 Urban 80667.
## 12 7075 1326 Rural 158061.
## 13 26954 1412 Rural 177194.
## 14 31276 2906 Rural 272761.
## 15 16044 1120 Urban 123791.
Definitions of all variables:
From my selected dataset I can derive the following regression function: Price = beta0 + beta1 x Sq_feet + beta2 x Neighbourhood
Predicted effects on dependant variable (Price):
summary(mydata1[colnames(mydata1) %in% c("Sq_feet", "Neighbourhood", "Price")])
## Sq_feet Neighbourhood Price
## Min. :1008 Urban:131 Min. : 17546
## 1st Qu.:1464 Rural:169 1st Qu.:166782
## Median :1972 Median :220478
## Mean :2005 Mean :223819
## 3rd Qu.:2569 3rd Qu.:283923
## Max. :2993 Max. :439471
The average size of a house in both areas combined is 2005 square feet. Minimum price for a house is 17.546 $ and 75% of houses was priced up to 283.923 $. There is more observed houses in rural area than in urban area.
Before performing the regression analysis, we have to check if all assumptions and additional requirements are met:
Additional requirements:
There is a linear relationship between Price and Sq-feet.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata1[ , c(-1, -3)],
smooth = FALSE)
scatterplot(Price ~ Sq_feet | Neighbourhood,
xlab = "Size (in square feet)",
ylab = "Price of the house",
main = "Price of the house based on neighbourhood",
smooth = FALSE,
data = mydata1)
Already based on the scatterplot above, I can see that that the average price of a house in a rural area does not differ from the average price of the house in urban area. There is no difference in the gradient of the line, therefore with each additional square feet of the size of the house, the price doesn’t change in rural compared to urban areas.
fit1 <- lm(Price ~ Sq_feet + Neighbourhood,
data = mydata1)
mydata1$StdResid <- round(rstandard(fit1), 3)
mydata1$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata1$StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
I can not reject H0 (p-value > 0.05), therefore I can conclude that the errors are normally distributed. We can also see that clearly from the graph above as no values exceed +/- 3.
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.99228, p-value = 0.1212
hist(mydata1$CooksD,
xlab = "Cook's D",
ylab = "Frequency",
main = "Histogram of Cook's D")
This graph indicates that there is no units with high impact, because there is no gaps in between.
mydata1$StdFitted <- scale(fit1$fitted.values)
library(car)
scatterplot(x = mydata1$StdResid, y = mydata1$StdFitted,
ylab = "Standardised residuals",
xlab = "Standardised fitted values",
boxplots = TRUE,
regline = FALSE,
smooth = FALSE)
## Warning in plot.window(...): "regline" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "regline" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "regline" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "regline" is not a graphical parameter
## Warning in box(...): "regline" is not a graphical parameter
## Warning in title(...): "regline" is not a graphical parameter
From the graph above I can conclude that there is no heteroscedasticity present as points are randomly scattered around the horizontal line at zero. This indicated that the model is capturing all systematic variance in my dataset. To prove, I also did Breusch-Pagan Test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.3277539
## Prob > Chi2 = 0.5669847
Based on the sample data, I can’t reject H0 (p-value > 0.05) and conclude that variances in my data sample are constant and there is homoscedasticity present.
vif(fit1)
## Sq_feet Neighbourhood
## 1.000185 1.000185
As both VIF statistics are lower than 5, I can conclude that there is no too strong multicoliniarity between the explanatory variables.
summary(fit1)
##
## Call:
## lm(formula = Price ~ Sq_feet + Neighbourhood, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124902 -30137 -3015 27516 131941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23110.307 9668.078 2.390 0.0175 *
## Sq_feet 100.782 4.405 22.877 <2e-16 ***
## NeighbourhoodRural -2407.386 5359.845 -0.449 0.6536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46040 on 297 degrees of freedom
## Multiple R-squared: 0.638, Adjusted R-squared: 0.6355
## F-statistic: 261.7 on 2 and 297 DF, p-value: < 2.2e-16
cor_coef <- sqrt(summary(fit1)$r.squared)
cor_coef
## [1] 0.7987327
From the linear regression model above I can see that only one partial coeficient for size of the house is statistically significant. If the size of the house increases for one square feet, the price of the house increases for 100.78 $ on average, assuming the neighbourhood stays the same (p-value < 0.001). Neighbourhood is not statistically significant coeficient therefore, based on sample data, I can’t conclude that price of the house is not affected by the neighbourhood (p-value = 0.654). If the coefficient would be statistically significant, I would explain it like this: Given the size of the house, the price of the house is in rural areas on average for 2407.39 $ lower than in urban areas.
Based on sample data, I can reject H0 (p-value < 0.001) and conclude that at least one of the variables (size or neighbourhood) impacts the price of the house. This also answers my research question.
Based on multiple correlation coefficient I can conclude that the relationship between price, size of the house and the neighbourhood is strong.
I can say that 63.8% of variability of price in houses is explained by the two coeficients, size in square feet and neighbourhood.
mydata1$Fitted <- round(fit1$fitted.values, 1)
mydata1$Residual <- round(mydata1$Price - mydata1$Fitted, 1)
head(mydata1[ , c("Sq_feet", "Neighbourhood", "Price", "Fitted", "Residual")])
## # A tibble: 6 × 5
## Sq_feet Neighbourhood Price Fitted Residual
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 2891 Urban 324842. 314472. 10370.
## 2 1337 Urban 143337. 157856. -14519.
## 3 2786 Urban 331326. 303890. 27436
## 4 1464 Rural 177784. 168248. 9536.
## 5 1024 Rural 84251. 123904. -39653.
## 6 2682 Urban 306779. 293409. 13370
sum(mydata1$Residual)
## [1] -0.4
The sum of residuals is not very close to zero, therefore I can say that the model does not fit the data the best. This can be due to the fact that I discovered that the neighbourhood is not a statistically significant coefficient.