Matej Suhalj

Research Question:
Can area of a house (area) and whether a house has a basement or not (basement) explain the differences in the house prices (price) expressed in dollars?

mydata <- read.table("./Housing.csv", header=TRUE, sep=",")
mydata$ID <- seq(1, nrow(mydata)) #Creating new variable ID, just in case if I will need it
head(mydata)
##      price area basement ID
## 1 13300000 7420       no  1
## 2 12250000 8960       no  2
## 3 12250000 9960      yes  3
## 4 12215000 7500      yes  4
## 5 11410000 7420      yes  5
## 6 10850000 7500      yes  6

Unit of observation: one house

Description of data:

The data set was taken from Kaggle.com (Housing prices dataset) and the sample size is 545 houses.

mydata$basement <- factor(mydata$basement,
                         levels = c("no", "yes"),
                         labels = c("no", "yes"))
head(mydata)
##      price area basement ID
## 1 13300000 7420       no  1
## 2 12250000 8960       no  2
## 3 12250000 9960      yes  3
## 4 12215000 7500      yes  4
## 5 11410000 7420      yes  5
## 6 10850000 7500      yes  6
summary(mydata[c("price", "area", "basement")])
##      price               area       basement 
##  Min.   : 1750000   Min.   : 1650   no :354  
##  1st Qu.: 3430000   1st Qu.: 3600   yes:191  
##  Median : 4340000   Median : 4600            
##  Mean   : 4766729   Mean   : 5151            
##  3rd Qu.: 5740000   3rd Qu.: 6360            
##  Max.   :13300000   Max.   :16200

In the explanatory variable basement, “no” is a reference group because it is most common in our case (frequencies compared to “yes”).
The average price of a house in our sample is a little bit less than 5 million dollars, and the most expensive house in our sample is worth 13.3 million dollars. The smallest area is 1650 square feet (approximately 150 square meters) and the biggest is 16200 square feet (approximately 1500 square meters).

REGRESSION MODEL

In my regression model I will be trying to explain how big of a effect on a price of a house, area of the house and having the basement has. I expect that prices of houses would increase, if area of houses will increase as well. Therefore I assume that the relationship between price and area will be positive. Furthermore, I also assume that a factor of having a basement would also increase the price of a house, therefore I predict that the model where I would include both explanatory variables (area and basement) would have a stronger relationship with price, compared to the model where I would include only area as an explanatory variable.

#Checking if slopes are different (if so I need interactions)
library(car)
## Loading required package: carData
scatterplot(price ~ area | basement, 
            ylab = "price of a house in €", 
            xlab = "area of a house in square feet", 
            smooth = FALSE,
            boxplots = FALSE,
            data = mydata)

Slopes seem parallel, but I will still check if maybe they are statistically different. If so I will continue my analysis with interactions.

REGRESSION SPLIT BY BASEMENT (NO, YES)

fit_no <- lm(price ~ area, 
                 data = mydata[mydata$basement == "no", ])

summary(fit_no)
## 
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "no", 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4467752 -1026195  -240496   779457  7759605 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.280e+06  2.138e+05   10.66   <2e-16 ***
## area        4.394e+02  3.875e+01   11.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1578000 on 352 degrees of freedom
## Multiple R-squared:  0.2676, Adjusted R-squared:  0.2655 
## F-statistic: 128.6 on 1 and 352 DF,  p-value: < 2.2e-16
fit_yes <- lm(price ~ area, 
               data = mydata[mydata$basement == "yes", ])

summary(fit_yes)
## 
## Call:
## lm(formula = price ~ area, data = mydata[mydata$basement == "yes", 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2824988  -865664  -251249   544235  5901283 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.678e+06  2.874e+05   9.316   <2e-16 ***
## area        4.848e+02  5.027e+01   9.643   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1507000 on 189 degrees of freedom
## Multiple R-squared:  0.3298, Adjusted R-squared:  0.3262 
## F-statistic: 92.98 on 1 and 189 DF,  p-value: < 2.2e-16
fit0 <- lm(price ~ area + basement + area:basement, 
          data = mydata)

summary(fit0)
## 
## Call:
## lm(formula = price ~ area + basement + area:basement, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4467752  -967041  -242498   688344  7759605 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.280e+06  2.105e+05  10.833   <2e-16 ***
## area             4.394e+02  3.815e+01  11.519   <2e-16 ***
## basementyes      3.980e+05  3.635e+05   1.095    0.274    
## area:basementyes 4.535e+01  6.436e+01   0.705    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1553000 on 541 degrees of freedom
## Multiple R-squared:  0.3141, Adjusted R-squared:  0.3103 
## F-statistic: 82.59 on 3 and 541 DF,  p-value: < 2.2e-16

P-value at area:basementyes is not statistically significant (p=48,1%), therefore interactions are not suitable for my case.

library(car)
scatterplot(price ~ area, 
            ylab = "Price of a house in €", 
            xlab = "Area of a house in square feet", 
            smooth = FALSE, 
            data = mydata)

It can be shown from the scaterplot above that linearity is met.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(mydata, aes(y = price, x = area)) +
  geom_point(colour = "darkred") +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
  ylab("Price of a house in €") + 
  xlab("Area of a house in square feet") +
  scale_x_continuous(breaks = seq(0, 10000, 1000), limits = c(0, 10000)) +
  scale_y_continuous(breaks = seq(0, 2000000, 100000), limits = c(0, 20000000)) +
  theme_dark() +
  geom_text(
    label = mydata$ID, 
    nudge_x = 0.50, nudge_y = 0.3, 
    check_overlap = T, size = 2.5) +
  geom_hline(yintercept = mean(mydata$price), linetype = "dashed", colour = "blue") +
  geom_vline(xintercept = mean(mydata$area), linetype = "dashed", colour = "blue") 
## Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values (`geom_point()`).
## Warning: Removed 18 rows containing missing values (`geom_text()`).

From the ggplot above, the unexplained, explained and total deviation can be seen. the distance from any unit (red dot) to the blue line (the average) is called the total deviation. The distance from the unit to the yellow line is called unexplained deviation or residual, and the distance between the yellow line and the blue line is called the explained deviation (the distance between fitted and average value).

Assumptions:
- Linearity in parameters (this one is met).
- The expected value of errors equals to 0 (with properly specified regression models (inclusion of all relevant explanatory variables in correct form and inclusion of regression constant), I can assume that this assumption is met).
- Homoskedasticity needs to be met (later in the analysis I will show that it is violated, therefore I will perform White’s robust correction).
- Errors are independent (this assumption is met, because each unit is observed only once).
- No perfect multicolinearity (it is met because I have only 1 explanatory numerical variable).

If those assumptions are met, the regression function estimated by the least squares method is: 1. best (smallest variance of the parameter estimates), 2. linear, 3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗) estimator (Best Linear Unbiased Estimator, BLUE).
This is defined in the Gauss-Mark theorem.

Additional assumptions:
- Normal distribution of errors (Violated but I ignore it, because my sample is large enough).
- The number of units is greater than the number of estimated parameters (n>k) (this assumption is met).

Additional requirements:
- The dependent variable is numerical, the explanatory variables can be numerical or dichotomous (i.e., Dummy variable) (this requirement is met).
- Each explanatory variable must vary (nonzero variance), but it is desirable that the explanatory variables assume as wide a range of possible values as possible (this requirement is met).

fit <- lm(price ~ area, 
          data = mydata)

CHECKING HOMOSCEDASTICITY

H0: Homoscedasticity is met.
H1: Homoscedasticity is violated.

mydata$StdFitted <- scale(fit$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the scatterplot above I can assume that homoscedasticity will be violated and if so, I will have to apply the White’s robust correction in the analysis further on.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    105.5301 
##  Prob > Chi2   =    9.346867e-25

Based on the p-value (p<0,001) I can reject null hypothesis. Homoscedasticity is violated, therefore heteroscedasticity is present. Because of that standard errors are biased, t-test is biased and p-value would also be biased. Furthermore because of that I need to apply White’s robust correction.

Checking the normal distribution of errors:
H0: Errors are normally distributed.
H1: Errors are not normally distributed.

mydata$StdResid <- round(rstandard(fit), 3) 
mydata$CooksD <- round(cooks.distance(fit), 3)

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Based of the histogram above I cannot be to sure if errors are normally distributed, but still I assume that they are a little skewed to the right. If this assumption would be violated I would ignore it because my sample size is sufficiently large. There are also a few units which fall out of the interval +/- 3 (I will remove those units, because those units are outliers).

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95216, p-value = 2.743e-12

Based on the p-value (p<0,001) I reject null hypothesis, therefore normality is violated. But because my sample is big enough I ignore this violation.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histrogram of Cooks distances")

Based on the histogram I will probably have to drop units with big gaps in between them, because it seems that those are the units with high impact.

IDENTIFYING AND REMOVING OUTLIERS AND HIGH IMPACT UNITS

head(mydata[order(- mydata$StdResid),], 10)
##       price area basement ID  StdFitted StdResid CooksD
## 1  13300000 7420       no  1  1.0457655    4.745  0.043
## 4  12215000 7500      yes  4  1.0826295    4.034  0.033
## 2  12250000 8960       no  2  1.7553969    3.635  0.050
## 5  11410000 7420      yes  5  1.0457655    3.547  0.024
## 3  12250000 9960      yes  3  2.2161964    3.347  0.062
## 14  9240000 3500       no 14 -0.7605687    3.318  0.016
## 6  10850000 7500      yes  6  1.0826295    3.169  0.020
## 10  9800000 5750       no 10  0.2762303    3.012  0.009
## 12  9681000 6000      yes 12  0.3914302    2.864  0.009
## 19  8890000 4600       no 19 -0.2536892    2.772  0.008
head(mydata[order(mydata$StdResid),], 4)
##       price  area basement  ID StdFitted StdResid CooksD
## 404 3500000 12944       no 404  3.591222   -3.120  0.128
## 126 5943000 15600       no 126  4.815106   -2.363  0.130
## 521 2450000  7700       no 521  1.174789   -2.216  0.011
## 212 4900000 12900       no 212  3.570947   -2.209  0.063

I will remove units ID 1,4,2,5,3,14,6,10,12,19 and 404 because those are outliers.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!ID %in% c(404, 1, 4, 2, 5, 3, 14, 6, 10, 12, 19)) #Removing units
#Checking if those units got removed and if there are any more outliers
head(mydata[order(- mydata$StdResid),], 10)
##       price area basement ID  StdFitted StdResid CooksD
## 11  8750000 4320      yes 21 -0.3827130    2.766  0.008
## 14  8645000 4560      yes 24 -0.2721212    2.629  0.007
## 7   9100000 6000      yes 16  0.3914302    2.496  0.007
## 5   9310000 6550       no 13  0.6448699    2.469  0.008
## 1  10150000 8580       no  7  1.5802930    2.411  0.019
## 3   9870000 8100      yes  9  1.3591092    2.373  0.015
## 8   9100000 6600      yes 17  0.6679099    2.321  0.007
## 23  8295000 4880       no 33 -0.1246653    2.314  0.005
## 10  8855000 6420       no 20  0.5849660    2.218  0.006
## 20  8400000 5500      yes 30  0.1610304    2.199  0.005
head(mydata[order(mydata$StdResid),], 4)
##       price  area basement  ID StdFitted StdResid CooksD
## 116 5943000 15600       no 126  4.815106   -2.363  0.130
## 510 2450000  7700       no 521  1.174789   -2.216  0.011
## 202 4900000 12900       no 212  3.570947   -2.209  0.063
## 442 3150000  9000       no 453  1.773829   -2.156  0.018

No more outliers.

head(mydata[order(-mydata$CooksD),], 10)
##        price  area basement  ID StdFitted StdResid CooksD
## 116  5943000 15600       no 126  4.815106   -2.363  0.130
## 202  4900000 12900       no 212  3.570947   -2.209  0.063
## 177  5110000 11410       no 187  2.884356   -1.626  0.023
## 268  4305000 10360       no 278  2.400516   -1.826  0.021
## 392  3500000  9500       no 402  2.004229   -2.082  0.020
## 1   10150000  8580       no   7  1.580293    2.411  0.019
## 442  3150000  9000       no 453  1.773829   -2.156  0.018
## 3    9870000  8100      yes   9  1.359109    2.373  0.015
## 182  5040000 10700      yes 192  2.557188   -1.459  0.015
## 57   6930000 13200      yes  67  3.709187   -0.998  0.014
tail(mydata)
##       price area basement  ID  StdFitted StdResid CooksD
## 529 1855000 2990       no 540 -0.9955764   -1.213  0.003
## 530 1820000 3000      yes 541 -0.9909684   -1.238  0.003
## 531 1767150 2400       no 542 -1.2674482   -1.097  0.003
## 532 1750000 3620       no 543 -0.7052727   -1.463  0.003
## 533 1750000 2910       no 544 -1.0324404   -1.256  0.003
## 534 1750000 3850       no 545 -0.5992888   -1.530  0.003

I will remove units ID 116 and 202, because those two units are units with high impact.

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(116, 202)) #Removing units
tail(mydata) #Those two units got removed, because the number Of units went from 534 to 532, so everything is OK
##       price area basement  ID  StdFitted StdResid CooksD
## 527 1855000 2990       no 540 -0.9955764   -1.213  0.003
## 528 1820000 3000      yes 541 -0.9909684   -1.238  0.003
## 529 1767150 2400       no 542 -1.2674482   -1.097  0.003
## 530 1750000 3620       no 543 -0.7052727   -1.463  0.003
## 531 1750000 2910       no 544 -1.0324404   -1.256  0.003
## 532 1750000 3850       no 545 -0.5992888   -1.530  0.003

FIT MODEL WITHOUT CORRECTION

summary(fit)
## 
## Call:
## lm(formula = price ~ area, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4867112 -1022228  -200135   683027  7484838 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.387e+06  1.745e+05   13.68   <2e-16 ***
## area        4.620e+02  3.123e+01   14.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1581000 on 543 degrees of freedom
## Multiple R-squared:  0.2873, Adjusted R-squared:  0.286 
## F-statistic: 218.9 on 1 and 543 DF,  p-value: < 2.2e-16

It can be seen that Std. Errors are biased (compared to the corrected model below (fit1)).

WHITE’S ROBUST CORRECTION

H0: Ro^2 is equal to 0.
H1: Ro^2 is greater than 0.

library(estimatr)
fit1 <- lm_robust(price ~ area,  
                  se_type = "HC1",
                  data = mydata)


summary(fit1)
## 
## Call:
## lm_robust(formula = price ~ area, data = mydata, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept) 2425391.2  156787.44   15.47 8.114e-45 2117390.1 2733392.3 530
## area            436.1      33.47   13.03 7.380e-34     370.3     501.8 530
## 
## Multiple R-squared:  0.3109 ,    Adjusted R-squared:  0.3096 
## F-statistic: 169.7 on 1 and 530 DF,  p-value: < 2.2e-16
sqrt(summary(fit1)$r.squared)
## [1] 0.5575952

Based on the sample data I reject H0 (p<0,001) and conclude that area of a house has at least some impact on the price of a house.

Explanation of the Multiple R-squared:
31,09% of variability of the price of a house is explained by variability of the area of a house.

Explanation of estimated coefficient (436.1):
If area increases by 1 square foot, the price of a house increases on average by $436.1.

Explanation of multiple corrrelation coefficient (0.5576):
The relationship between price and area is positive and semi strong.

INCLUDING A CATEGORICAL VARIABLE

I assume that homoscedasticity is still violated, therefore I will use White’s robust correction again.
H0: Ro^2 is equal to 0.
H1: Ro^2 is greater than 0.

fit2 <- lm_robust(price ~ area + basement,
                  se_type = "HC1",
           data = mydata)

summary(fit2)
## 
## Call:
## lm_robust(formula = price ~ area + basement, data = mydata, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper  DF
## (Intercept)  2252535  149839.11   15.03 8.676e-43 1958182.6  2546888 529
## area             431      32.61   13.22 1.153e-34     366.9      495 529
## basementyes   575032  121824.10    4.72 3.021e-06  335714.0   814351 529
## 
## Multiple R-squared:  0.3377 ,    Adjusted R-squared:  0.3352 
## F-statistic:   115 on 2 and 529 DF,  p-value: < 2.2e-16
sqrt(summary(fit2)$r.squared)
## [1] 0.5810975

Based on the sample data I reject H0 (p<0,001) and conclude that area of a house and having a basement or not, has some impact on the price of a house.

Explanation of the Multiple R-squared:
33,77% of variability of the price of a house is explained by variability of the area of a house and basement.

Explanation of estimated coefficient (575032):
Given the value of area, the houses that have a basement have on average higher price by $575.032, compared to those houses without the basement (p<0,001).

Explanation of multiple correlation coefficient (0.5811):
The relationship between price, area and basement is positive and semi strong.

COMPARISON BETWEEN THE 2 MODELS

H0: Delta RO^2 is equal to 0.
H1: Delta RO^2 is greater than 0.

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Warning: package 'zoo' was built under R version 4.3.2
waldtest(fit1, fit2, test = "F")
## Wald test
## 
## Model 1: price ~ area
## Model 2: price ~ area + basement
##   Res.Df Df     F    Pr(>F)    
## 1    530                       
## 2    529  1 22.28 3.021e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value I reject H0 at (p<0,001) and conclude that model 2 (with area and basement as explanatory variables) is better than model 1 (with only area as an explanatory variable).

CONCLUSION

Based on the sample data I can conclude that area of a house, and whether the house has a basement or not, can explain the differences in house prices (p<0,001). I can also conclude that this differences in house prices are explained better with model 2 at (p<0,001), where I had 2 explanatory variables (area and basement), compared to the model 1, where I included only 1 explanatory variable (area).