Research Question:
Can area of a house (area) and whether a house has a basement or not
(basement) explain the differences in the house prices (price) expressed
in dollars?
mydata <- read.table("./Housing.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
## price area basement ID
## 1 13300000 7420 no 1
## 2 12250000 8960 no 2
## 3 12250000 9960 yes 3
## 4 12215000 7500 yes 4
## 5 11410000 7420 yes 5
## 6 10850000 7500 yes 6
Unit of observation: one house
Description of data:
The data set was taken from Kaggle.com (Housing prices dataset) and the sample size is 545 houses.
mydata$basement <- factor(mydata$basement,
levels = c("no", "yes"),
labels = c("no", "yes"))
head(mydata)
## price area basement ID
## 1 13300000 7420 no 1
## 2 12250000 8960 no 2
## 3 12250000 9960 yes 3
## 4 12215000 7500 yes 4
## 5 11410000 7420 yes 5
## 6 10850000 7500 yes 6
summary(mydata[c("price", "area", "basement")])
## price area basement
## Min. : 1750000 Min. : 1650 no :354
## 1st Qu.: 3430000 1st Qu.: 3600 yes:191
## Median : 4340000 Median : 4600
## Mean : 4766729 Mean : 5151
## 3rd Qu.: 5740000 3rd Qu.: 6360
## Max. :13300000 Max. :16200
In the explanatory variable basement, “no” is a reference group
because it is most common in our case (frequencies compared to
“yes”).
The average price of a house in our sample is a little bit less than 5
million dollars, and the most expensive house in our sample is worth
13.3 million dollars. The smallest area is 1650 square feet
(approximately 150 square meters) and the biggest is 16200 square feet
(approximately 1500 square meters).
In my regression model I will be trying to explain how big of a effect on a price of a house, area of the house and having the basement has. I expect that prices of houses would increase, if area of houses will increase as well. Therefore I assume that the relationship between price and area will be positive. Furthermore, I also assume that a factor of having a basement would also increase the price of a house, therefore I predict that the model where I would include both explanatory variables (area and basement) would have a stronger relationship with price, compared to the model where I would include only area as an explanatory variable.
#Checking if slopes are different (if so I need interactions)
library(car)
## Loading required package: carData
scatterplot(price ~ area | basement,
ylab = "price of a house in €",
xlab = "area of a house in square feet",
smooth = FALSE,
boxplots = FALSE,
data = mydata)
Slopes seem parallel, but I will still check if maybe they are statistically different. If so I will continue my analysis with interactions.
fit_no <- lm(price ~ area,
data = mydata[mydata$basement == "no", ])
summary(fit_no)
##
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "no",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4467752 -1026195 -240496 779457 7759605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.280e+06 2.138e+05 10.66 <2e-16 ***
## area 4.394e+02 3.875e+01 11.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1578000 on 352 degrees of freedom
## Multiple R-squared: 0.2676, Adjusted R-squared: 0.2655
## F-statistic: 128.6 on 1 and 352 DF, p-value: < 2.2e-16
fit_yes <- lm(price ~ area,
data = mydata[mydata$basement == "yes", ])
summary(fit_yes)
##
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "yes",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2824988 -865664 -251249 544235 5901283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.678e+06 2.874e+05 9.316 <2e-16 ***
## area 4.848e+02 5.027e+01 9.643 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1507000 on 189 degrees of freedom
## Multiple R-squared: 0.3298, Adjusted R-squared: 0.3262
## F-statistic: 92.98 on 1 and 189 DF, p-value: < 2.2e-16
fit0 <- lm(price ~ area + basement + area:basement,
data = mydata)
summary(fit0)
##
## Call:
## lm(formula = price ~ area + basement + area:basement, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4467752 -967041 -242498 688344 7759605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.280e+06 2.105e+05 10.833 <2e-16 ***
## area 4.394e+02 3.815e+01 11.519 <2e-16 ***
## basementyes 3.980e+05 3.635e+05 1.095 0.274
## area:basementyes 4.535e+01 6.436e+01 0.705 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1553000 on 541 degrees of freedom
## Multiple R-squared: 0.3141, Adjusted R-squared: 0.3103
## F-statistic: 82.59 on 3 and 541 DF, p-value: < 2.2e-16
P-value at area:basementyes is not statistically significant (p=48,1%), therefore interactions are not suitable for my case.
library(car)
scatterplot(price ~ area,
ylab = "Price of a house in €",
xlab = "Area of a house in square feet",
smooth = FALSE,
data = mydata)
It can be shown from the scaterplot above that linearity is met.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata, aes(y = price, x = area)) +
geom_point(colour = "darkred") +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
ylab("Price of a house in €") +
xlab("Area of a house in square feet") +
scale_x_continuous(breaks = seq(0, 10000, 1000), limits = c(0, 10000)) +
scale_y_continuous(breaks = seq(0, 2000000, 100000), limits = c(0, 20000000)) +
theme_dark() +
geom_text(
label = mydata$ID,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
geom_hline(yintercept = mean(mydata$price), linetype = "dashed", colour = "blue") +
geom_vline(xintercept = mean(mydata$area), linetype = "dashed", colour = "blue")
## Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values (`geom_point()`).
## Warning: Removed 18 rows containing missing values (`geom_text()`).
From the ggplot above, the unexplained, explained and total deviation can be seen. the distance from any unit (red dot) to the blue line (the average) is called the total deviation. The distance from the unit to the yellow line is called unexplained deviation or residual, and the distance between the yellow line and the blue line is called the explained deviation (the distance between fitted and average value).
Assumptions:
- Linearity in parameters (this one is met).
- The expected value of errors equals to 0 (with properly specified
regression models (inclusion of all relevant explanatory variables in
correct form and inclusion of regression constant), I can assume that
this assumption is met).
- Homoskedasticity needs to be met (later in the analysis I will show
that it is violated, therefore I will perform White’s robust
correction).
- Errors are independent (this assumption is met, because each unit is
observed only once).
- No perfect multicolinearity (it is met because I have only 1
explanatory numerical variable).
If those assumptions are met, the regression function estimated by
the least squares method is: 1. best (smallest variance of the parameter
estimates), 2. linear, 3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗) estimator (Best Linear
Unbiased Estimator, BLUE).
This is defined in the Gauss-Mark theorem.
Additional assumptions:
- Normal distribution of errors (Violated but I ignore it, because my
sample is large enough).
- The number of units is greater than the number of estimated parameters
(n>k) (this assumption is met).
Additional requirements:
- The dependent variable is numerical, the explanatory variables can be
numerical or dichotomous (i.e., Dummy variable) (this requirement is
met).
- Each explanatory variable must vary (nonzero variance), but it is
desirable that the explanatory variables assume as wide a range of
possible values as possible (this requirement is met).
fit <- lm(price ~ area,
data = mydata)
H0: Homoscedasticity is met.
H1: Homoscedasticity is violated.
mydata$StdFitted <- scale(fit$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the scatterplot above I can assume that homoscedasticity will be violated and if so, I will have to apply the White’s robust correction in the analysis further on.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 105.5301
## Prob > Chi2 = 9.346867e-25
Based on the p-value (p<0,001) I can reject null hypothesis. Homoscedasticity is violated, therefore heteroscedasticity is present. Because of that standard errors are biased, t-test is biased and p-value would also be biased. Furthermore because of that I need to apply White’s robust correction.
Checking the normal distribution of errors:
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
mydata$StdResid <- round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Based of the histogram above I cannot be to sure if errors are normally distributed, but still I assume that they are a little skewed to the right. If this assumption would be violated I would ignore it because my sample size is sufficiently large. There are also a few units which fall out of the interval +/- 3 (I will remove those units, because those units are outliers).
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95216, p-value = 2.743e-12
Based on the p-value (p<0,001) I reject null hypothesis, therefore normality is violated. But because my sample is big enough I ignore this violation.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histrogram of Cooks distances")
Based on the histogram I will probably have to drop units with big gaps in between them, because it seems that those are the units with high impact.
head(mydata[order(- mydata$StdResid),], 10)
## price area basement ID StdFitted StdResid CooksD
## 1 13300000 7420 no 1 1.0457655 4.745 0.043
## 4 12215000 7500 yes 4 1.0826295 4.034 0.033
## 2 12250000 8960 no 2 1.7553969 3.635 0.050
## 5 11410000 7420 yes 5 1.0457655 3.547 0.024
## 3 12250000 9960 yes 3 2.2161964 3.347 0.062
## 14 9240000 3500 no 14 -0.7605687 3.318 0.016
## 6 10850000 7500 yes 6 1.0826295 3.169 0.020
## 10 9800000 5750 no 10 0.2762303 3.012 0.009
## 12 9681000 6000 yes 12 0.3914302 2.864 0.009
## 19 8890000 4600 no 19 -0.2536892 2.772 0.008
head(mydata[order(mydata$StdResid),], 4)
## price area basement ID StdFitted StdResid CooksD
## 404 3500000 12944 no 404 3.591222 -3.120 0.128
## 126 5943000 15600 no 126 4.815106 -2.363 0.130
## 521 2450000 7700 no 521 1.174789 -2.216 0.011
## 212 4900000 12900 no 212 3.570947 -2.209 0.063
I will remove units ID 1,4,2,5,3,14,6,10,12,19 and 404 because those are outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(404, 1, 4, 2, 5, 3, 14, 6, 10, 12, 19)) #Removing units
#Checking if those units got removed and if there are any more outliers
head(mydata[order(- mydata$StdResid),], 10)
## price area basement ID StdFitted StdResid CooksD
## 11 8750000 4320 yes 21 -0.3827130 2.766 0.008
## 14 8645000 4560 yes 24 -0.2721212 2.629 0.007
## 7 9100000 6000 yes 16 0.3914302 2.496 0.007
## 5 9310000 6550 no 13 0.6448699 2.469 0.008
## 1 10150000 8580 no 7 1.5802930 2.411 0.019
## 3 9870000 8100 yes 9 1.3591092 2.373 0.015
## 8 9100000 6600 yes 17 0.6679099 2.321 0.007
## 23 8295000 4880 no 33 -0.1246653 2.314 0.005
## 10 8855000 6420 no 20 0.5849660 2.218 0.006
## 20 8400000 5500 yes 30 0.1610304 2.199 0.005
head(mydata[order(mydata$StdResid),], 4)
## price area basement ID StdFitted StdResid CooksD
## 116 5943000 15600 no 126 4.815106 -2.363 0.130
## 510 2450000 7700 no 521 1.174789 -2.216 0.011
## 202 4900000 12900 no 212 3.570947 -2.209 0.063
## 442 3150000 9000 no 453 1.773829 -2.156 0.018
No more outliers.
head(mydata[order(-mydata$CooksD),], 10)
## price area basement ID StdFitted StdResid CooksD
## 116 5943000 15600 no 126 4.815106 -2.363 0.130
## 202 4900000 12900 no 212 3.570947 -2.209 0.063
## 177 5110000 11410 no 187 2.884356 -1.626 0.023
## 268 4305000 10360 no 278 2.400516 -1.826 0.021
## 392 3500000 9500 no 402 2.004229 -2.082 0.020
## 1 10150000 8580 no 7 1.580293 2.411 0.019
## 442 3150000 9000 no 453 1.773829 -2.156 0.018
## 3 9870000 8100 yes 9 1.359109 2.373 0.015
## 182 5040000 10700 yes 192 2.557188 -1.459 0.015
## 57 6930000 13200 yes 67 3.709187 -0.998 0.014
tail(mydata)
## price area basement ID StdFitted StdResid CooksD
## 529 1855000 2990 no 540 -0.9955764 -1.213 0.003
## 530 1820000 3000 yes 541 -0.9909684 -1.238 0.003
## 531 1767150 2400 no 542 -1.2674482 -1.097 0.003
## 532 1750000 3620 no 543 -0.7052727 -1.463 0.003
## 533 1750000 2910 no 544 -1.0324404 -1.256 0.003
## 534 1750000 3850 no 545 -0.5992888 -1.530 0.003
I will remove units ID 116 and 202, because those two units are units with high impact.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(116, 202)) #Removing units
tail(mydata) #Those two units got removed, because the number Of units went from 534 to 532, so everything is OK
## price area basement ID StdFitted StdResid CooksD
## 527 1855000 2990 no 540 -0.9955764 -1.213 0.003
## 528 1820000 3000 yes 541 -0.9909684 -1.238 0.003
## 529 1767150 2400 no 542 -1.2674482 -1.097 0.003
## 530 1750000 3620 no 543 -0.7052727 -1.463 0.003
## 531 1750000 2910 no 544 -1.0324404 -1.256 0.003
## 532 1750000 3850 no 545 -0.5992888 -1.530 0.003
summary(fit)
##
## Call:
## lm(formula = price ~ area, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4867112 -1022228 -200135 683027 7484838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.387e+06 1.745e+05 13.68 <2e-16 ***
## area 4.620e+02 3.123e+01 14.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1581000 on 543 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.286
## F-statistic: 218.9 on 1 and 543 DF, p-value: < 2.2e-16
It can be seen that Std. Errors are biased (compared to the corrected model below (fit1)).
H0: Ro^2 is equal to 0.
H1: Ro^2 is greater than 0.
library(estimatr)
fit1 <- lm_robust(price ~ area,
se_type = "HC1",
data = mydata)
summary(fit1)
##
## Call:
## lm_robust(formula = price ~ area, data = mydata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2425391.2 156787.44 15.47 8.114e-45 2117390.1 2733392.3 530
## area 436.1 33.47 13.03 7.380e-34 370.3 501.8 530
##
## Multiple R-squared: 0.3109 , Adjusted R-squared: 0.3096
## F-statistic: 169.7 on 1 and 530 DF, p-value: < 2.2e-16
sqrt(summary(fit1)$r.squared)
## [1] 0.5575952
Based on the sample data I reject H0 (p<0,001) and conclude that area of a house has at least some impact on the price of a house.
Explanation of the Multiple R-squared:
31,09% of variability of the price of a house is explained by
variability of the area of a house.
Explanation of estimated coefficient (436.1):
If area increases by 1 square foot, the price of a house increases on
average by $436.1.
Explanation of multiple corrrelation coefficient
(0.5576):
The relationship between price and area is positive and semi strong.
I assume that homoscedasticity is still violated, therefore I will
use White’s robust correction again.
H0: Ro^2 is equal to 0.
H1: Ro^2 is greater than 0.
fit2 <- lm_robust(price ~ area + basement,
se_type = "HC1",
data = mydata)
summary(fit2)
##
## Call:
## lm_robust(formula = price ~ area + basement, data = mydata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2252535 149839.11 15.03 8.676e-43 1958182.6 2546888 529
## area 431 32.61 13.22 1.153e-34 366.9 495 529
## basementyes 575032 121824.10 4.72 3.021e-06 335714.0 814351 529
##
## Multiple R-squared: 0.3377 , Adjusted R-squared: 0.3352
## F-statistic: 115 on 2 and 529 DF, p-value: < 2.2e-16
sqrt(summary(fit2)$r.squared)
## [1] 0.5810975
Based on the sample data I reject H0 (p<0,001) and conclude that area of a house and having a basement or not, has some impact on the price of a house.
Explanation of the Multiple R-squared:
33,77% of variability of the price of a house is explained by
variability of the area of a house and basement.
Explanation of estimated coefficient (575032):
Given the value of area, the houses that have a basement have on average
higher price by $575.032, compared to those houses without the basement
(p<0,001).
Explanation of multiple correlation coefficient
(0.5811):
The relationship between price, area and basement is positive and semi
strong.
H0: Delta RO^2 is equal to 0.
H1: Delta RO^2 is greater than 0.
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Warning: package 'zoo' was built under R version 4.3.2
waldtest(fit1, fit2, test = "F")
## Wald test
##
## Model 1: price ~ area
## Model 2: price ~ area + basement
## Res.Df Df F Pr(>F)
## 1 530
## 2 529 1 22.28 3.021e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value I reject H0 at (p<0,001) and conclude that model 2 (with area and basement as explanatory variables) is better than model 1 (with only area as an explanatory variable).
Based on the sample data I can conclude that area of a house, and whether the house has a basement or not, can explain the differences in house prices (p<0,001). I can also conclude that this differences in house prices are explained better with model 2 at (p<0,001), where I had 2 explanatory variables (area and basement), compared to the model 1, where I included only 1 explanatory variable (area).