Research Question 1:
Can area of a house (area) and wether a house has a basement or not
(basement) explain the differences in the house prices (price) expressed
in €?
mydata <- read.table("./Housing.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
## price area basement ID
## 1 13300000 7420 no 1
## 2 12250000 8960 no 2
## 3 12250000 9960 yes 3
## 4 12215000 7500 yes 4
## 5 11410000 7420 yes 5
## 6 10850000 7500 yes 6
Unit of observation: one house
Description of data:
The data set was taken from Kaggle.com (Housing prices dataset) and the sample size is 545 houses.
mydata$basement <- factor(mydata$basement,
levels = c("no", "yes"),
labels = c("no", "yes"))
head(mydata)
## price area basement ID
## 1 13300000 7420 no 1
## 2 12250000 8960 no 2
## 3 12250000 9960 yes 3
## 4 12215000 7500 yes 4
## 5 11410000 7420 yes 5
## 6 10850000 7500 yes 6
summary(mydata[c("price", "area", "basement")])
## price area basement
## Min. : 1750000 Min. : 1650 no :354
## 1st Qu.: 3430000 1st Qu.: 3600 yes:191
## Median : 4340000 Median : 4600
## Mean : 4766729 Mean : 5151
## 3rd Qu.: 5740000 3rd Qu.: 6360
## Max. :13300000 Max. :16200
In the explanatory variable basement, “no” is a refference group because it is most common in our case (frequencies compared to “yes”).
#Checking if slopes are different (if so I need interactions)
library(car)
## Loading required package: carData
scatterplot(price ~ area | basement,
ylab = "price of a house in €",
xlab = "area of a house in square feet",
smooth = FALSE,
boxplots = FALSE,
data = mydata)
Slopes seem parallel, but I will still check if maybe they are statistically different. If so I will continue my analysis with interactions.
fit_no <- lm(price ~ area,
data = mydata[mydata$basement == "no", ])
summary(fit_no)
##
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "no",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4467752 -1026195 -240496 779457 7759605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.280e+06 2.138e+05 10.66 <2e-16 ***
## area 4.394e+02 3.875e+01 11.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1578000 on 352 degrees of freedom
## Multiple R-squared: 0.2676, Adjusted R-squared: 0.2655
## F-statistic: 128.6 on 1 and 352 DF, p-value: < 2.2e-16
fit_yes <- lm(price ~ area,
data = mydata[mydata$basement == "yes", ])
summary(fit_yes)
##
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "yes",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2824988 -865664 -251249 544235 5901283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.678e+06 2.874e+05 9.316 <2e-16 ***
## area 4.848e+02 5.027e+01 9.643 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1507000 on 189 degrees of freedom
## Multiple R-squared: 0.3298, Adjusted R-squared: 0.3262
## F-statistic: 92.98 on 1 and 189 DF, p-value: < 2.2e-16
fit0 <- lm(price ~ area + basement + area:basement,
data = mydata)
summary(fit0)
##
## Call:
## lm(formula = price ~ area + basement + area:basement, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4467752 -967041 -242498 688344 7759605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.280e+06 2.105e+05 10.833 <2e-16 ***
## area 4.394e+02 3.815e+01 11.519 <2e-16 ***
## basementyes 3.980e+05 3.635e+05 1.095 0.274
## area:basementyes 4.535e+01 6.436e+01 0.705 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1553000 on 541 degrees of freedom
## Multiple R-squared: 0.3141, Adjusted R-squared: 0.3103
## F-statistic: 82.59 on 3 and 541 DF, p-value: < 2.2e-16
P-value at area:basementyes is not statistically significant (p=48,1%), therefore interactions are not suitable for my case.
library(car)
scatterplot(price ~ area,
ylab = "Price of a house in €",
xlab = "Area of a house in square feet",
smooth = FALSE,
data = mydata)
It can be shown from the scaterplot above that linearity is met.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata, aes(y = price, x = area)) +
geom_point(colour = "darkred") +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
ylab("Price of a house in €") +
xlab("Area of a house in square feet") +
scale_x_continuous(breaks = seq(0, 10000, 1000), limits = c(0, 10000)) +
scale_y_continuous(breaks = seq(0, 2000000, 100000), limits = c(0, 20000000)) +
theme_dark() +
geom_text(
label = mydata$ID,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
geom_hline(yintercept = mean(mydata$price), linetype = "dashed", colour = "blue") +
geom_vline(xintercept = mean(mydata$area), linetype = "dashed", colour = "blue")
## Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values (`geom_point()`).
## Warning: Removed 18 rows containing missing values (`geom_text()`).
From the ggplot above, the unexplained, explained and total deviation can be seen. the distance from any unit (red dot) to the blue line (the average) is called the total deviation. The distance from the unit to the yellow line is called unexplained deviation or residual, and the distance between the yellow line and the blue line is called the explained deviation (the distance between fitted and average value).
Assumptions:
- Linearity in parameters (this one is met).
- The expected value of errors equals to 0 (with properly specified
regression models (inclusion of all relevant explanatory variables in
correct form and inclusion of regression constant), I can assume that
this assumption is met).
- Homoskedasticity needs to be met (later in the analysis I will show
that it is violated, therefore I will perform White’s robust
correction).
- Errors are independent (this assumption is met, because each unit is
observed only once).
- No perfect multicolinearity (it is met because I have only 1
explanatory numerical variable).
If those assumptions are met, the regression function estimated by
the least squares method is: 1. best (smallest variance of the parameter
estimates), 2. linear, 3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗) estimator (Best Linear
Unbiased Estimator, BLUE).
This is defined in the Gauss-Mark theorem.
Additional assumptions:
- Normal distribution of errors (Violated but I ignore it, because my
sample is large enough).
- The number of units is greater than the number of estimated parameters
(n>k) (this assumption is met).
Additional requirements:
- The dependent variable is numerical, the explanatory variables can be
numerical or dichotomous (i.e., Dummy variable) (this requirement is
met).
- Each explanatory variable must vary (nonzero variance), but it is
desirable that the explanatory variables assume as wide a range of
possible values as possible (this requirement is met).
fit <- lm(price ~ area,
data = mydata)
H0: Homoscedasticity is met.
H1: Homoscedasticity is violated.
mydata$StdFitted <- scale(fit$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the scatterplot above I can assume that homoscedasticity will be violated and if so, I will have to apply the White’s robust correction in the analysis further on.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 105.5301
## Prob > Chi2 = 9.346867e-25
Based on the p-value (p<0,001) I can reject null hypothesis. Homoscedasticity is violated, therefore heteroscedasticity is present. Because of that standard errors are biased, t-test is biased and p-value would also be biased. Furthermore because of that I need to apply White’s robust correction.
Checking the normal distribution of errors:
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
mydata$StdResid <- round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Based of the histogram above I cannot be to sure if errors are normally distributed, but still I assume that they are a little skewed to the right. If this assumption would be violated I would ignore it because my sample size is above 30. There are also a few units which fall out of the interval +/- 3 (I will remove those units, because those units are outliers).
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95216, p-value = 2.743e-12
Based on the p-value (p<0,001) I reject null hypothesis, therefore normality is violated. But because my sample is big enough I ignore this violation.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histrogram of Cooks distances")
Based on the histogram I will probably have to drop units with big gaps in between them, because it seems that those are the units with high impact.
head(mydata[order(- mydata$StdResid),], 10)
## price area basement ID StdFitted StdResid CooksD
## 1 13300000 7420 no 1 1.0457655 4.745 0.043
## 4 12215000 7500 yes 4 1.0826295 4.034 0.033
## 2 12250000 8960 no 2 1.7553969 3.635 0.050
## 5 11410000 7420 yes 5 1.0457655 3.547 0.024
## 3 12250000 9960 yes 3 2.2161964 3.347 0.062
## 14 9240000 3500 no 14 -0.7605687 3.318 0.016
## 6 10850000 7500 yes 6 1.0826295 3.169 0.020
## 10 9800000 5750 no 10 0.2762303 3.012 0.009
## 12 9681000 6000 yes 12 0.3914302 2.864 0.009
## 19 8890000 4600 no 19 -0.2536892 2.772 0.008
head(mydata[order(mydata$StdResid),], 4)
## price area basement ID StdFitted StdResid CooksD
## 404 3500000 12944 no 404 3.591222 -3.120 0.128
## 126 5943000 15600 no 126 4.815106 -2.363 0.130
## 521 2450000 7700 no 521 1.174789 -2.216 0.011
## 212 4900000 12900 no 212 3.570947 -2.209 0.063
I will remove units ID 1,4,2,5,3,14,6,10,12,19 and 404 because those are outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(404, 1, 4, 2, 5, 3, 14, 6, 10, 12, 19)) #Removing units
#Checking if those units got removed and if there are any more outliers
head(mydata[order(- mydata$StdResid),], 10)
## price area basement ID StdFitted StdResid CooksD
## 11 8750000 4320 yes 21 -0.3827130 2.766 0.008
## 14 8645000 4560 yes 24 -0.2721212 2.629 0.007
## 7 9100000 6000 yes 16 0.3914302 2.496 0.007
## 5 9310000 6550 no 13 0.6448699 2.469 0.008
## 1 10150000 8580 no 7 1.5802930 2.411 0.019
## 3 9870000 8100 yes 9 1.3591092 2.373 0.015
## 8 9100000 6600 yes 17 0.6679099 2.321 0.007
## 23 8295000 4880 no 33 -0.1246653 2.314 0.005
## 10 8855000 6420 no 20 0.5849660 2.218 0.006
## 20 8400000 5500 yes 30 0.1610304 2.199 0.005
head(mydata[order(mydata$StdResid),], 4)
## price area basement ID StdFitted StdResid CooksD
## 116 5943000 15600 no 126 4.815106 -2.363 0.130
## 510 2450000 7700 no 521 1.174789 -2.216 0.011
## 202 4900000 12900 no 212 3.570947 -2.209 0.063
## 442 3150000 9000 no 453 1.773829 -2.156 0.018
head(mydata[order(-mydata$CooksD),], 10)
## price area basement ID StdFitted StdResid CooksD
## 116 5943000 15600 no 126 4.815106 -2.363 0.130
## 202 4900000 12900 no 212 3.570947 -2.209 0.063
## 177 5110000 11410 no 187 2.884356 -1.626 0.023
## 268 4305000 10360 no 278 2.400516 -1.826 0.021
## 392 3500000 9500 no 402 2.004229 -2.082 0.020
## 1 10150000 8580 no 7 1.580293 2.411 0.019
## 442 3150000 9000 no 453 1.773829 -2.156 0.018
## 3 9870000 8100 yes 9 1.359109 2.373 0.015
## 182 5040000 10700 yes 192 2.557188 -1.459 0.015
## 57 6930000 13200 yes 67 3.709187 -0.998 0.014
tail(mydata)
## price area basement ID StdFitted StdResid CooksD
## 529 1855000 2990 no 540 -0.9955764 -1.213 0.003
## 530 1820000 3000 yes 541 -0.9909684 -1.238 0.003
## 531 1767150 2400 no 542 -1.2674482 -1.097 0.003
## 532 1750000 3620 no 543 -0.7052727 -1.463 0.003
## 533 1750000 2910 no 544 -1.0324404 -1.256 0.003
## 534 1750000 3850 no 545 -0.5992888 -1.530 0.003
I will remove units ID 116 and 202, because those two units are units with high impact.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(116, 202)) #Removing units
tail(mydata) #Those two units got removed, because the number Of units went from 534 to 532, so everything is OK
## price area basement ID StdFitted StdResid CooksD
## 527 1855000 2990 no 540 -0.9955764 -1.213 0.003
## 528 1820000 3000 yes 541 -0.9909684 -1.238 0.003
## 529 1767150 2400 no 542 -1.2674482 -1.097 0.003
## 530 1750000 3620 no 543 -0.7052727 -1.463 0.003
## 531 1750000 2910 no 544 -1.0324404 -1.256 0.003
## 532 1750000 3850 no 545 -0.5992888 -1.530 0.003
summary(fit)
##
## Call:
## lm(formula = price ~ area, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4867112 -1022228 -200135 683027 7484838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.387e+06 1.745e+05 13.68 <2e-16 ***
## area 4.620e+02 3.123e+01 14.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1581000 on 543 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.286
## F-statistic: 218.9 on 1 and 543 DF, p-value: < 2.2e-16
It can be seen that Std. Errors are biased (compared to the corrected model below (fit1)).
H0: Ro^2 is equal to 0.
H1: Ro^2 is not equal to 0.
library(estimatr)
fit1 <- lm_robust(price ~ area,
se_type = "HC1",
data = mydata)
summary(fit1)
##
## Call:
## lm_robust(formula = price ~ area, data = mydata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2425391.2 156787.44 15.47 8.114e-45 2117390.1 2733392.3 530
## area 436.1 33.47 13.03 7.380e-34 370.3 501.8 530
##
## Multiple R-squared: 0.3109 , Adjusted R-squared: 0.3096
## F-statistic: 169.7 on 1 and 530 DF, p-value: < 2.2e-16
sqrt(summary(fit1)$r.squared)
## [1] 0.5575952
Based on the sample data I reject H0 (p<0,001) and conclude that area of a house has at least some impact on the price of a house.
Explanation of the Multiple R-squared:
31,09% of variability of the price of a house is explained by
variability of the area of a house.
Explanation of estimated coefficient (436.1):
If area increases by 1 square foot, the price of a house increases on
average, 436.1€.
Explanation of multiple corrrelation
coefficient:
The relationship between price and area is positive and semi strong.
I assume that homoscedasticity is still violated, therefore I will
use White’s robust correction again.
H0: Ro^2 is equal to 0.
H1: Ro^2 is not equal to 0.
fit2 <- lm_robust(price ~ area + basement,
data = mydata)
summary(fit2)
##
## Call:
## lm_robust(formula = price ~ area + basement, data = mydata)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2252535 150766.57 14.941 2.305e-42 1956360.6 2548710.0 529
## area 431 32.84 13.125 2.895e-34 366.5 495.5 529
## basementyes 575032 121902.26 4.717 3.064e-06 335560.5 814504.3 529
##
## Multiple R-squared: 0.3377 , Adjusted R-squared: 0.3352
## F-statistic: 113.7 on 2 and 529 DF, p-value: < 2.2e-16
sqrt(summary(fit2)$r.squared)
## [1] 0.5810975
Based on the sample data I reject H0 (p<0,001) and conclude that area of a house and having a basement or not, has some impact on the price of a house.
Explanation of the Multiple R-squared:
33,77% of variability of the price of a house is explained by
variability of the area of a house and basement.
Explanation of estimated coefficient
(5.750032):
Given the value of area, the houses that have a basement have on average
higher price by 575.032€, compared to those houses without the basement
(p<0,001).
Explanation of multiple correlation coefficient: The relationship between price and area is positive and semi strong.
H0: Delta RO^2 is equal to 0.
H1: Delta RO^2 in not equal to 0.
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Warning: package 'zoo' was built under R version 4.3.2
waldtest(fit1, fit2, test = "F")
## Wald test
##
## Model 1: price ~ area
## Model 2: price ~ area + basement
## Res.Df Df F Pr(>F)
## 1 530
## 2 529 1 22.252 3.064e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value I reject H0 at (p<0,001) and conclude that model 2 (with area and basement as explanatory variables) is better than model 1 (with only area as explanatory variable).