Student: Katja Volk Štefić
mydata <- read.table("./Housing.csv", header=TRUE, sep = ",", dec = ".")
head(mydata)
## price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420 4 2 3 yes no
## 2 12250000 8960 4 4 4 yes no
## 3 12250000 9960 3 2 2 yes no
## 4 12215000 7500 4 2 2 yes no
## 5 11410000 7420 4 1 2 yes yes
## 6 10850000 7500 3 3 1 yes no
## basement hotwaterheating airconditioning parking prefarea
## 1 no no yes 2 yes
## 2 no no yes 3 no
## 3 yes no no 2 yes
## 4 yes no yes 3 yes
## 5 yes no yes 2 no
## 6 yes no yes 2 yes
## furnishingstatus
## 1 furnished
## 2 furnished
## 3 semi-furnished
## 4 furnished
## 5 furnished
## 6 semi-furnished
Unit of observation is one house.
Sample size is 545.
Explanation of data:
Price: The price of the house in euros.
Area: The area or size of the house in square feet.
Bedrooms: The number of bedrooms in the house.
Bathrooms: The number of bathrooms in the house.
Stories: The number of stories or floors in the house.
Mainroad: Categorical variable indicating whether the house is located near the main road or not.
Guestroom: Categorical variable indicating whether the house has a guest room or not.
Basement: Categorical variable indicating whether the house has a basement or not.
Hotwaterheating: Categorical variable indicating whether the house has hot water heating or not.
Airconditioning: Categorical variable indicating whether the house has air conditioning or not.
Parking: The number of parking spaces available with the house.
Prefarea: Categorical variable indicating whether the house is in a preferred area or not.
Furnishingstatus: The furnishing status of the house (e.g., unfurnished, semi-furnished, fully furnished).
The source of the data is Kaggle.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>% mutate(ID = row_number()) #Creating ID
mydata$furnishingstatusF <- factor(mydata$furnishingstatus,
levels = c("semi-furnished", "unfurnished", "furnished"),
labels = c("semi-furnished", "unfurnished", "furnished"))
mydata$mainroadF <- factor(mydata$mainroad,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$guestroomF <- factor(mydata$guestroom,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$basementF <- factor(mydata$basement,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$mainroadF <- factor(mydata$mainroad,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$hotwaterheatingF <- factor(mydata$hotwaterheating,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$airconditioningF <- factor(mydata$airconditioning,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$prefareaF <- factor(mydata$prefarea,
levels = c("yes", "no"),
labels = c("yes", "no"))
summary(mydata[ , c(-6, -7, -8, -9, -10, -12, -13, -14)])
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories parking furnishingstatusF mainroadF
## Min. :1.000 Min. :0.0000 semi-furnished:227 yes:468
## 1st Qu.:1.000 1st Qu.:0.0000 unfurnished :178 no : 77
## Median :2.000 Median :0.0000 furnished :140
## Mean :1.806 Mean :0.6936
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :3.0000
## guestroomF basementF hotwaterheatingF airconditioningF prefareaF
## yes: 97 yes:191 yes: 25 yes:172 yes:128
## no :448 no :354 no :520 no :373 no :417
##
##
##
##
#Descriptive statistics
Descriptive statistics:
Bedrooms: Min. :1.000 The minimum number of bedrooms in a house is 1.
Price: Mean : 4766729 The average price of a house is 4 766 729 euros.
Bathrooms: 3rd Qu.:2.000 75% of houses have 2 bathrooms or less and 25% of houses has more than 2 bathrooms.
I chose price as dependent variable and 2 explanatory variables:
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata[ , c( -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21)]), 2)
## price area
## nbr.val 5.450000e+02 545.00
## nbr.null 0.000000e+00 0.00
## nbr.na 0.000000e+00 0.00
## min 1.750000e+06 1650.00
## max 1.330000e+07 16200.00
## range 1.155000e+07 14550.00
## sum 2.597867e+09 2807045.00
## median 4.340000e+06 4600.00
## mean 4.766729e+06 5150.54
## SE.mean 8.012083e+04 92.96
## CI.mean.0.95 1.573841e+05 182.60
## var 3.498544e+12 4709512.06
## std.dev 1.870440e+06 2170.14
## coef.var 3.900000e-01 0.42
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ , c(1, 2)],
smooth=FALSE) #Checking linearity in parameters
As we can see from the scatterplot, first assumption of linerarity in parameters is met.
fit <- lm(price ~ area + furnishingstatusF,
data = mydata)
summary(fit) #Establishment of a regression model
##
## Call:
## lm(formula = price ~ area + furnishingstatusF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4054257 -979671 -225799 645075 7059845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2686769.16 188072.74 14.286 < 2e-16
## area 429.85 30.65 14.023 < 2e-16
## furnishingstatusFunfurnished -696501.42 153673.99 -4.532 7.19e-06
## furnishingstatusFfurnished 363892.47 165034.61 2.205 0.0279
##
## (Intercept) ***
## area ***
## furnishingstatusFunfurnished ***
## furnishingstatusFfurnished *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1529000 on 541 degrees of freedom
## Multiple R-squared: 0.3359, Adjusted R-squared: 0.3322
## F-statistic: 91.2 on 3 and 541 DF, p-value: < 2.2e-16
vif(fit) #Checking for multicolinearity
## GVIF Df GVIF^(1/(2*Df))
## area 1.030338 1 1.015056
## furnishingstatusF 1.030338 2 1.007500
There is no problem with multicolinearity. VIF is less than 5.
mydata$StdResid <- round(rstandard(fit), 3) #Standardized residuals
mydata$CooksD <- round(cooks.distance(fit), 3) #Cooks distances
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals") #Identifying outliers
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.94437, p-value = 1.99e-13
HO:Errors are normally distributed.
H1:Errors are not normally distributed.
We can reject H0 (p-value < 0.001) and conclude that errors are not normally distributed. Since we have a sample of 545 units, the violation of this assumption is not problematic.
hist(mydata$CooksD,
xlab = "Cook's distances",
ylab = "Frequency",
main = "Histogram of Cook's distances") #Identifying units with high impact
head(mydata[order(mydata$StdResid), c("ID", "StdResid")], 3)
## ID StdResid
## 404 404 -2.697
## 212 212 -2.453
## 126 126 -2.313
head(mydata[order(-mydata$StdResid), c("ID", "StdResid")], 10)
## ID StdResid
## 1 1 4.638
## 4 4 3.903
## 2 2 3.519
## 10 10 3.503
## 3 3 3.479
## 5 5 3.396
## 6 6 3.242
## 14 14 3.079
## 12 12 2.895
## 17 17 2.805
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 10)
## ID CooksD
## 126 126 0.068
## 404 404 0.062
## 1 1 0.045
## 212 212 0.043
## 3 3 0.042
## 2 2 0.036
## 4 4 0.033
## 5 5 0.024
## 14 14 0.022
## 10 10 0.019
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(1, 4, 2, 10, 3, 5, 6, 14, 126, 404)) #Removing units
fit <- lm(price ~ area + furnishingstatusF,
data = mydata)
mydata$StdFitted <- scale(fit$fitted.values)
library(car)
scatterplot(StdResid ~ StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
boxplot = FALSE,
data = mydata) #Checking for heteroskedasticity and linearity
As we can see from the scatterplot linearity in parameters is met. The scatterplot does not look homoskedastic. Consequently, we do Breusch-Pagan test to check for heteroskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 56.81202
## Prob > Chi2 = 4.795257e-14
HO: Homoskedasticity is present.
H1: Heteroskedasticity is present.
We reject H0 at p < 0.001 and conclude that heteroskedasticity is present.
library(estimatr)
fit <- lm_robust(price ~ area + furnishingstatusF,
se_type = "HC1",
data = mydata) #Using robust standard errors since heteroskedasticity is present.
summary(fit)
##
## Call:
## lm_robust(formula = price ~ area + furnishingstatusF, data = mydata,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2688478.5 167047.65 16.094 9.321e-48
## area 423.6 33.26 12.733 1.338e-32
## furnishingstatusFunfurnished -676283.2 135176.73 -5.003 7.687e-07
## furnishingstatusFfurnished 188685.5 154500.17 1.221 2.225e-01
## CI Lower CI Upper DF
## (Intercept) 2360323.2 3016633.9 531
## area 358.2 488.9 531
## furnishingstatusFunfurnished -941830.0 -410736.5 531
## furnishingstatusFfurnished -114821.1 492192.0 531
##
## Multiple R-squared: 0.3546 , Adjusted R-squared: 0.351
## F-statistic: 79.9 on 3 and 531 DF, p-value: < 2.2e-16
If size of the house is increased by 1 square meter, the price of the house on average increases by 423.6 euros (p<0.001), assuming all other variables remain unchanged.
Given the size of the house, the house that is unfurnished have on average price lower by 676283.2 euros compared to semi-furnished house (p<0.001).
We cannot say that the house that is furnished has a significantly different average price compared to a semi-furnished house.
Coefficient of determination (R2): 35.46% of variability of price is explained by variability of area (size of the house) and furnishing status.
F-statistic: HO: ρ2 =0 H1: ρ2 >0 We reject H0 (p<0.001) and conclude that at least one of the explanatory variables (area and furnishing status) impact salary.