#Importing data set
mydata <- read.csv("./Housing.csv")
#Removing variables from the data set, so I can keep the ones I need for the analysis
mydata1 <- mydata[, !(names(mydata) %in% c("bedrooms", "bathrooms", "stories", "mainroad", "guestroom", "basement", "hotwaterheating", "airconditioning", "parking", "prefarea"))]
#Creating ID variable for each unit in case I need it later
mydata1$ID <- seq(1, nrow(mydata))
head(mydata1)
## price area furnishingstatus ID
## 1 13300000 7420 furnished 1
## 2 12250000 8960 furnished 2
## 3 12250000 9960 semi-furnished 3
## 4 12215000 7500 furnished 4
## 5 11410000 7420 furnished 5
## 6 10850000 7500 semi-furnished 6
Explanation of a data set:
unit of observation: a house
number of observations: 545
number of variables: 4 (including ID)
source: Kaggle.com (Housing Prices data set)
Explanation of the variables in the data set:
Price: price of the house in $ (numerical)
Area: area of the house in square feet (numerical)
Furnishing status: indicates whether the house is “furnished”, “semi-furnished”, “unfurnished” (categorical)
#Checking which of the values of variable "furnishingstatus" is most commonly selected, so I can see what is the reference category
table(mydata1$furnishingstatus)
##
## furnished semi-furnished unfurnished
## 140 227 178
Since the most commonly status is “semi-furnished” I will select it as a reference category and I will make sure it is on the first place when factoring it.
#Factoring variable
mydata1$furnishingstatus <- factor(mydata1$furnishingstatus,
levels = c("semi-furnished", "unfurnished", "furnished"),
labels = c("semi-furnished", "unfurnished", "furnished"))
head(mydata1)
## price area furnishingstatus ID
## 1 13300000 7420 furnished 1
## 2 12250000 8960 furnished 2
## 3 12250000 9960 semi-furnished 3
## 4 12215000 7500 furnished 4
## 5 11410000 7420 furnished 5
## 6 10850000 7500 semi-furnished 6
#Descriptive statistics, excluding ID
summary(mydata1[ , -4])
## price area furnishingstatus
## Min. : 1750000 Min. : 1650 semi-furnished:227
## 1st Qu.: 3430000 1st Qu.: 3600 unfurnished :178
## Median : 4340000 Median : 4600 furnished :140
## Mean : 4766729 Mean : 5151
## 3rd Qu.: 5740000 3rd Qu.: 6360
## Max. :13300000 Max. :16200
Mean of 4766729 for variable “price”: Average price of a house in our sample is 4766729$
Min of 1650 for variable “area”: Minimum area of a house in our sample is 1650 square feet
In my regression model I will explain the size of effect of area and furnishing status on price of the house. The price of a house is dependent variable, while area and furnishing status are independent variables. I expect that the price of a house will increase with the increase in area, so I will assume that the relationship between price and area will be positive. When it comes to furnishing status, I expect that furnished and semi furnished houses will positively influence the price of a house, while unfurnished houses will negatively influence the price of the house.
#Checking for linearity
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
scatterplotMatrix(mydata1[ ,c(-3,-4)],
smooth = FALSE)
Based on the scatterplot Matrix there is a positive linear relationship between price and area. This means that as area increases, the price also increases. The linearity assumption is met.
Expected value of errors equals 0: for this assumption to be met the model should be based on the theory which means that we should include all theoretically important explanatory variables. However, in this case we will assume that this assumption is met.
Homoscedasticity
#Estimating regression function
fit <- lm(price ~ area + furnishingstatus,
data = mydata1)
#Adding standardized fitted values, standardized residuals, and Cook’s distances to my data set
mydata1$StdFitted <- scale(fit$fitted.values)
mydata1$StdResid <- round(rstandard(fit), 3)
mydata1$CooksD <- round(cooks.distance(fit), 3)
library(car)
##Checking for homoscedasticity
scatterplot(y = mydata1$StdResid, x = mydata1$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From this picture I can assume that homoscedasticity is violated as we can see a pattern, however, I will check it with formal test.
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.2
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 106.6553
## Prob > Chi2 = 5.297418e-25
H0: The variance is constant (homoscedasticity)
H1: The variance is not constant (heteroscedasticity)
We reject H0 and conclude that the variance is not constant, we have heteroskedasticity (p<0.001). We can solve this issue using robust standard errors which I will show later.
#Checking for normal distribution and outliers
hist(mydata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col= "darkblue",
border = "black")
Based on the histogram it seems like the errors are not normally distributed, but I will also check it using formal Shapiro-Wilk test. This could probably be due to outliers which are the units that fall out of the interval of +/-3 and they should be removed.
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.94437, p-value = 1.99e-13
H0: Errors are normally distributed
H1: Errors are not normally distributed
We can reject H0 and conclude that errors are not normally distributed (p<0.001). However, since we have large enough sample, this can be ignored.
#Checking if there are any outliers that fall below the treshold of -3
head(mydata1[order(mydata1$StdResid),], 6)
## price area furnishingstatus ID StdFitted StdResid CooksD
## 404 3500000 12944 unfurnished 404 2.571507 -2.697 0.062
## 212 4900000 12900 furnished 212 3.532277 -2.453 0.043
## 126 5943000 15600 semi-furnished 126 4.267241 -2.313 0.068
## 453 3150000 9000 semi-furnished 453 1.650081 -2.240 0.013
## 355 3780000 8400 furnished 355 1.747850 -1.895 0.009
## 187 5110000 11410 furnished 187 2.941433 -1.881 0.018
#Checking if there are any outliers that fall above +3
head(mydata1[order(-mydata1$StdResid),], 10)
## price area furnishingstatus ID StdFitted StdResid CooksD
## 1 13300000 7420 furnished 1 1.35924154 4.638 0.045
## 4 12215000 7500 furnished 4 1.39096469 3.903 0.033
## 2 12250000 8960 furnished 2 1.96991207 3.519 0.036
## 10 9800000 5750 unfurnished 10 -0.28119725 3.503 0.019
## 3 12250000 9960 semi-furnished 3 2.03075893 3.479 0.042
## 5 11410000 7420 furnished 5 1.35924154 3.396 0.024
## 6 10850000 7500 semi-furnished 6 1.05527224 3.242 0.017
## 14 9240000 3500 furnished 14 -0.19519254 3.079 0.022
## 12 9681000 6000 semi-furnished 12 0.46046328 2.895 0.010
## 17 9100000 6600 unfurnished 17 0.05586116 2.805 0.014
#Removing outliers above +3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata1 <- mydata1 %>%
filter(!ID == 1)
mydata1 <- mydata1 %>%
filter(!ID == 4)
mydata1 <- mydata1 %>%
filter(!ID == 2)
mydata1 <- mydata1 %>%
filter(!ID == 10)
mydata1 <- mydata1 %>%
filter(!ID == 3)
mydata1 <- mydata1 %>%
filter(!ID == 5)
mydata1 <- mydata1 %>%
filter(!ID == 6)
mydata1 <- mydata1 %>%
filter(!ID == 14)
head(mydata1[order(-mydata1$StdResid),], 6)
## price area furnishingstatus ID StdFitted StdResid CooksD
## 5 9681000 6000 semi-furnished 12 0.46046328 2.895 0.010
## 9 9100000 6600 unfurnished 17 0.05586116 2.805 0.014
## 13 8750000 4320 semi-furnished 21 -0.20572276 2.758 0.009
## 11 8890000 4600 furnished 19 0.24100070 2.536 0.012
## 8 9100000 6000 semi-furnished 16 0.46046328 2.514 0.007
## 6 9310000 6550 semi-furnished 13 0.67855990 2.498 0.008
I removed 8 units in total, so I will create a histogram of standardized residuals once again to check if the distribution is normal.
hist(mydata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col= "darkblue",
border = "black")
From the histogram, we can see that distribution looks much more normal now that I removed outliers. However, I will also need to check for the units of large impact by observing values of Cooks’ distances.
hist(mydata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
col= "darkblue",
border = "black",
main = "Histogram of Cooks distances")
Based on the histogram, there are some gaps between the units, so I will probably need to remove some of them.
head(mydata1[order(-mydata1$CooksD), c("ID", "CooksD")], 10)
## ID CooksD
## 118 126 0.068
## 396 404 0.062
## 204 212 0.043
## 179 187 0.018
## 1 7 0.014
## 9 17 0.014
## 270 278 0.014
## 445 453 0.013
## 3 9 0.012
## 11 19 0.012
All the values of Cooks distances fall far below 1, however, there are some gaps and those units should be removed.
library(dplyr)
#Removing units of high impact
mydata1 <- mydata1 %>%
filter(!ID == 126)
mydata1 <- mydata1 %>%
filter(!ID == 404)
mydata1 <- mydata1 %>%
filter(!ID == 212)
Errors are independent: this assumption is met because each unit is observed only once, there is no panel or time-series data.
No perfect multicolinearity
#Checking for multicolinearity using VIF statistics
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## area 1.030338 1 1.015056
## furnishingstatus 1.030338 2 1.007500
mean(vif(fit))
## [1] 1.180538
There are no issues of too strong multicolinearity because VIF is lower than 5 and we want VIF to be as close to 1 as possible. The assumption of no perfect colinearity is met.
The dependent variable is numerical, explanatory variables can be numerical or dichotomous (or transformed into dummies for categorical variables). This requirement is met.
Each explanatory variable must vary (have non-zero variance), but it is desirable they assume as wide range of values as possible. This requirement is met.
Removing outliers and units of high impact.
Not too strong multicolinearity. This requirement is met since the assumption of no perfect multicolinearity is met.
#install.packages("estimatr")
library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
fit1 <- lm_robust(price ~ area + furnishingstatus,
se_type="HC1",
data = mydata1)
summary(fit1)
##
## Call:
## lm_robust(formula = price ~ area + furnishingstatus, data = mydata1,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 2630284 161115.36 16.325 7.791e-49 2313780.8
## area 435 32.37 13.437 1.247e-35 371.4
## furnishingstatusunfurnished -671330 134957.25 -4.974 8.858e-07 -936446.7
## furnishingstatusfurnished 208660 153658.40 1.358 1.751e-01 -93193.8
## CI Upper DF
## (Intercept) 2946786.9 530
## area 498.6 530
## furnishingstatusunfurnished -406213.2 530
## furnishingstatusfurnished 510514.7 530
##
## Multiple R-squared: 0.3626 , Adjusted R-squared: 0.359
## F-statistic: 89.14 on 3 and 530 DF, p-value: < 2.2e-16
H0: Ro squared = 0
H1: Ro squared =/ 0
Based on the sample data we reject H0 (at p<0.001) and conclude that we can explain at least some probability in the price with area and furnishing status.
H0: beta1 = 0
H1: beta1 =/ 0
Based on the sample data we reject H0 and conclude that area has a statistically significant effect on price (p<0.001).
Interpretation: If area of a house increase by 1 square feet, the price will on average increase by 435$, assuming everything else remains unchanged.
H0: beta2 = 0
H1: beta2 =/ 0
Based on the sample data we reject H0 and conclude that furnishing status of a house has a statistically significant effect on a price of a house (p<0.001).
Interpretation:
- Given the area, unfurnished houses will on average have by 671330$ lower price than semi-furnished houses (p<0.001).
- Given the area, furnished houses will on average have by 208660$ higher price than semi-furnished
(p=0.03).
Coefficient of determination: 36.26% of variability of price can be explained by linear effect of area and furnishing status.
#Coefficient of multiple correlation
sqrt(summary(fit1)$r.squared)
## [1] 0.6021623
Relationship between our dependent variable price and independent variables area and furnishing status is semi-strong.
Based on the sample data I found that area and furnishing status have a statistically significant effect on price.