Assignment 3

Research question: How is the price of a house affected by area and furnishing status of a house?

#Importing data set
mydata <- read.csv("./Housing.csv")

#Removing variables from the data set, so I can keep the ones I need for the analysis
mydata1 <- mydata[, !(names(mydata) %in% c("bedrooms", "bathrooms", "stories", "mainroad", "guestroom", "basement", "hotwaterheating", "airconditioning", "parking", "prefarea"))]

#Creating ID variable for each unit in case I need it later
mydata1$ID <- seq(1, nrow(mydata))

head(mydata1)

##      price area furnishingstatus ID
## 1 13300000 7420        furnished  1
## 2 12250000 8960        furnished  2
## 3 12250000 9960   semi-furnished  3
## 4 12215000 7500        furnished  4
## 5 11410000 7420        furnished  5
## 6 10850000 7500   semi-furnished  6

Explanation of a data set:

unit of observation: a house
number of observations: 545
number of variables: 4 (including ID)
source: Kaggle.com (Housing Prices data set)

Explanation of the variables in the data set:

Price: price of the house in $ (numerical)
Area: area of the house in square feet (numerical)
Furnishing status: indicates whether the house is “furnished”, “semi-furnished”, “unfurnished” (categorical)

#Checking which of the values of variable "furnishingstatus" is most commonly selected, so I can see what is the reference category
table(mydata1$furnishingstatus)

## 
##      furnished semi-furnished    unfurnished 
##            140            227            178

Since the most commonly status is “semi-furnished” I will select it as a reference category and I will make sure it is on the first place when factoring it.

#Factoring variable
mydata1$furnishingstatus <- factor(mydata1$furnishingstatus,
                             levels = c("semi-furnished", "unfurnished", "furnished"), 
                             labels = c("semi-furnished", "unfurnished", "furnished"))

head(mydata1)

##      price area furnishingstatus ID
## 1 13300000 7420        furnished  1
## 2 12250000 8960        furnished  2
## 3 12250000 9960   semi-furnished  3
## 4 12215000 7500        furnished  4
## 5 11410000 7420        furnished  5
## 6 10850000 7500   semi-furnished  6

#Descriptive statistics, excluding ID
summary(mydata1[ , -4])

##      price               area             furnishingstatus
##  Min.   : 1750000   Min.   : 1650   semi-furnished:227    
##  1st Qu.: 3430000   1st Qu.: 3600   unfurnished   :178    
##  Median : 4340000   Median : 4600   furnished     :140    
##  Mean   : 4766729   Mean   : 5151                         
##  3rd Qu.: 5740000   3rd Qu.: 6360                         
##  Max.   :13300000   Max.   :16200

Mean of 4766729 for variable “price”: Average price of a house in our sample is 4766729$
Min of 1650 for variable “area”: Minimum area of a house in our sample is 1650 square feet

Regression model

In my regression model I will explain the size of effect of area and furnishing status on price of the house. The price of a house is dependent variable, while area and furnishing status are independent variables. I expect that the price of a house will increase with the increase in area, so I will assume that the relationship between price and area will be positive. When it comes to furnishing status, I expect that furnished and semi furnished houses will positively influence the price of a house, while unfurnished houses will negatively influence the price of the house.

Assumptions of a regression model

Linearity of parameters

#Checking for linearity
library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

scatterplotMatrix(mydata1[ ,c(-3,-4)],
                  smooth = FALSE)

Based on the scatterplot Matrix there is a positive linear relationship between price and area. This means that as area increases, the price also increases. The linearity assumption is met.

Expected value of errors equals 0: for this assumption to be met the model should be based on the theory which means that we should include all theoretically important explanatory variables. However, in this case we will assume that this assumption is met.
Homoscedasticity

#Estimating regression function
fit <- lm(price ~ area + furnishingstatus,
           data = mydata1)

#Adding standardized fitted values, standardized residuals, and Cook’s distances to my data set 
mydata1$StdFitted <- scale(fit$fitted.values) 
mydata1$StdResid <- round(rstandard(fit), 3) 
mydata1$CooksD <- round(cooks.distance(fit), 3)

library(car)

##Checking for homoscedasticity
scatterplot(y = mydata1$StdResid, x = mydata1$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From this picture I can assume that homoscedasticity is violated as we can see a pattern, however, I will check it with formal test.

library(olsrr)

## Warning: package 'olsrr' was built under R version 4.3.2

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    106.6553 
##  Prob > Chi2   =    5.297418e-25

H0: The variance is constant (homoscedasticity)

H1: The variance is not constant (heteroscedasticity)

We reject H0 and conclude that the variance is not constant, we have heteroskedasticity (p<0.001). We can solve this issue using robust standard errors which I will show later.

Normal distribution of errors

#Checking for normal distribution and outliers
hist(mydata1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col= "darkblue",
     border = "black")

Based on the histogram it seems like the errors are not normally distributed, but I will also check it using formal Shapiro-Wilk test. This could probably be due to outliers which are the units that fall out of the interval of +/-3 and they should be removed.

shapiro.test(mydata1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.94437, p-value = 1.99e-13

H0: Errors are normally distributed

H1: Errors are not normally distributed

We can reject H0 and conclude that errors are not normally distributed (p<0.001). However, since we have large enough sample, this can be ignored.

#Checking if there are any outliers that fall below the treshold of -3
head(mydata1[order(mydata1$StdResid),], 6)

##       price  area furnishingstatus  ID StdFitted StdResid CooksD
## 404 3500000 12944      unfurnished 404  2.571507   -2.697  0.062
## 212 4900000 12900        furnished 212  3.532277   -2.453  0.043
## 126 5943000 15600   semi-furnished 126  4.267241   -2.313  0.068
## 453 3150000  9000   semi-furnished 453  1.650081   -2.240  0.013
## 355 3780000  8400        furnished 355  1.747850   -1.895  0.009
## 187 5110000 11410        furnished 187  2.941433   -1.881  0.018

#Checking if there are any outliers that fall above +3
head(mydata1[order(-mydata1$StdResid),], 10)

##       price area furnishingstatus ID   StdFitted StdResid CooksD
## 1  13300000 7420        furnished  1  1.35924154    4.638  0.045
## 4  12215000 7500        furnished  4  1.39096469    3.903  0.033
## 2  12250000 8960        furnished  2  1.96991207    3.519  0.036
## 10  9800000 5750      unfurnished 10 -0.28119725    3.503  0.019
## 3  12250000 9960   semi-furnished  3  2.03075893    3.479  0.042
## 5  11410000 7420        furnished  5  1.35924154    3.396  0.024
## 6  10850000 7500   semi-furnished  6  1.05527224    3.242  0.017
## 14  9240000 3500        furnished 14 -0.19519254    3.079  0.022
## 12  9681000 6000   semi-furnished 12  0.46046328    2.895  0.010
## 17  9100000 6600      unfurnished 17  0.05586116    2.805  0.014

#Removing outliers above +3
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata1 <- mydata1 %>%
  filter(!ID == 1)
mydata1 <- mydata1 %>%
  filter(!ID == 4)
mydata1 <- mydata1 %>%
  filter(!ID == 2)
mydata1 <- mydata1 %>%
  filter(!ID == 10)
mydata1 <- mydata1 %>%
  filter(!ID == 3)
mydata1 <- mydata1 %>%
  filter(!ID == 5)
mydata1 <- mydata1 %>%
  filter(!ID == 6)
mydata1 <- mydata1 %>%
  filter(!ID == 14)

head(mydata1[order(-mydata1$StdResid),], 6)

##      price area furnishingstatus ID   StdFitted StdResid CooksD
## 5  9681000 6000   semi-furnished 12  0.46046328    2.895  0.010
## 9  9100000 6600      unfurnished 17  0.05586116    2.805  0.014
## 13 8750000 4320   semi-furnished 21 -0.20572276    2.758  0.009
## 11 8890000 4600        furnished 19  0.24100070    2.536  0.012
## 8  9100000 6000   semi-furnished 16  0.46046328    2.514  0.007
## 6  9310000 6550   semi-furnished 13  0.67855990    2.498  0.008

I removed 8 units in total, so I will create a histogram of standardized residuals once again to check if the distribution is normal.

hist(mydata1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col= "darkblue",
     border = "black")

From the histogram, we can see that distribution looks much more normal now that I removed outliers. However, I will also need to check for the units of large impact by observing values of Cooks’ distances.

hist(mydata1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency",
     col= "darkblue",
     border = "black",
     main = "Histogram of Cooks distances")

Based on the histogram, there are some gaps between the units, so I will probably need to remove some of them.

head(mydata1[order(-mydata1$CooksD), c("ID", "CooksD")], 10)

##      ID CooksD
## 118 126  0.068
## 396 404  0.062
## 204 212  0.043
## 179 187  0.018
## 1     7  0.014
## 9    17  0.014
## 270 278  0.014
## 445 453  0.013
## 3     9  0.012
## 11   19  0.012

All the values of Cooks distances fall far below 1, however, there are some gaps and those units should be removed.

library(dplyr)

#Removing units of high impact
mydata1 <- mydata1 %>%
  filter(!ID == 126)
mydata1 <- mydata1 %>%
  filter(!ID == 404)
mydata1 <- mydata1 %>%
  filter(!ID == 212)

Errors are independent: this assumption is met because each unit is observed only once, there is no panel or time-series data.
No perfect multicolinearity

#Checking for multicolinearity using VIF statistics
vif(fit)

##                      GVIF Df GVIF^(1/(2*Df))
## area             1.030338  1        1.015056
## furnishingstatus 1.030338  2        1.007500

mean(vif(fit))

## [1] 1.180538

There are no issues of too strong multicolinearity because VIF is lower than 5 and we want VIF to be as close to 1 as possible. The assumption of no perfect colinearity is met.

The number of units is larger than the number of parameters estimated (n>k): the assumption is met

Additional requirements

The dependent variable is numerical, explanatory variables can be numerical or dichotomous (or transformed into dummies for categorical variables). This requirement is met.
Each explanatory variable must vary (have non-zero variance), but it is desirable they assume as wide range of values as possible. This requirement is met.
Removing outliers and units of high impact.
Not too strong multicolinearity. This requirement is met since the assumption of no perfect multicolinearity is met.

White’s robust correction

#install.packages("estimatr")
library(estimatr)

## Warning: package 'estimatr' was built under R version 4.3.2

fit1 <- lm_robust(price ~ area + furnishingstatus,
                  se_type="HC1",
                  data = mydata1)

summary(fit1)

## 
## Call:
## lm_robust(formula = price ~ area + furnishingstatus, data = mydata1, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                             Estimate Std. Error t value  Pr(>|t|)  CI Lower
## (Intercept)                  2630284  161115.36  16.325 7.791e-49 2313780.8
## area                             435      32.37  13.437 1.247e-35     371.4
## furnishingstatusunfurnished  -671330  134957.25  -4.974 8.858e-07 -936446.7
## furnishingstatusfurnished     208660  153658.40   1.358 1.751e-01  -93193.8
##                              CI Upper  DF
## (Intercept)                 2946786.9 530
## area                            498.6 530
## furnishingstatusunfurnished -406213.2 530
## furnishingstatusfurnished    510514.7 530
## 
## Multiple R-squared:  0.3626 ,    Adjusted R-squared:  0.359 
## F-statistic: 89.14 on 3 and 530 DF,  p-value: < 2.2e-16

H0: Ro squared = 0

H1: Ro squared =/ 0

Based on the sample data we reject H0 (at p<0.001) and conclude that we can explain at least some probability in the price with area and furnishing status.

Explanation of partial regression coefficients

H0: beta1 = 0

H1: beta1 =/ 0

Based on the sample data we reject H0 and conclude that area has a statistically significant effect on price (p<0.001).

Interpretation: If area of a house increase by 1 square feet, the price will on average increase by 435$, assuming everything else remains unchanged.

H0: beta2 = 0

H1: beta2 =/ 0

Based on the sample data we reject H0 and conclude that furnishing status of a house has a statistically significant effect on a price of a house (p<0.001).

Interpretation:

- Given the area, unfurnished houses will on average have by 671330$ lower price than semi-furnished         houses (p<0.001).

- Given the area, furnished houses will on average have by 208660$ higher price than semi-furnished 
  (p=0.03).

Coefficient of determination: 36.26% of variability of price can be explained by linear effect of area and furnishing status.

#Coefficient of multiple correlation
sqrt(summary(fit1)$r.squared)

## [1] 0.6021623

Relationship between our dependent variable price and independent variables area and furnishing status is semi-strong.

Conclusion

Based on the sample data I found that area and furnishing status have a statistically significant effect on price.