Student: Katja Volk Štefić

mydata <- read.table("./Housing.csv", header=TRUE, sep = ",", dec = ".")

head(mydata)
##      price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420        4         2       3      yes        no
## 2 12250000 8960        4         4       4      yes        no
## 3 12250000 9960        3         2       2      yes        no
## 4 12215000 7500        4         2       2      yes        no
## 5 11410000 7420        4         1       2      yes       yes
## 6 10850000 7500        3         3       1      yes        no
##   basement hotwaterheating airconditioning parking prefarea
## 1       no              no             yes       2      yes
## 2       no              no             yes       3       no
## 3      yes              no              no       2      yes
## 4      yes              no             yes       3      yes
## 5      yes              no             yes       2       no
## 6      yes              no             yes       2      yes
##   furnishingstatus
## 1        furnished
## 2        furnished
## 3   semi-furnished
## 4        furnished
## 5        furnished
## 6   semi-furnished

Unit of observation is one house.

Sample size is 545.

Explanation of data:

The source of the data is Kaggle.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>% mutate(ID = row_number()) #Creating ID
mydata$furnishingstatusF <- factor(mydata$furnishingstatus, 
                         levels = c("semi-furnished", "unfurnished", "furnished"), 
                         labels = c("semi-furnished", "unfurnished", "furnished"))
mydata$mainroadF <- factor(mydata$mainroad,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$guestroomF <- factor(mydata$guestroom,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$basementF <- factor(mydata$basement,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$mainroadF <- factor(mydata$mainroad,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$hotwaterheatingF <- factor(mydata$hotwaterheating,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$airconditioningF <- factor(mydata$airconditioning,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$prefareaF <- factor(mydata$prefarea,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
summary(mydata[ , c(-6, -7, -8, -9, -10, -12, -13, -14)]) 
##      price               area          bedrooms       bathrooms    
##  Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
##  Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
##  3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
##     stories         parking            furnishingstatusF mainroadF
##  Min.   :1.000   Min.   :0.0000   semi-furnished:227     yes:468  
##  1st Qu.:1.000   1st Qu.:0.0000   unfurnished   :178     no : 77  
##  Median :2.000   Median :0.0000   furnished     :140              
##  Mean   :1.806   Mean   :0.6936                                   
##  3rd Qu.:2.000   3rd Qu.:1.0000                                   
##  Max.   :4.000   Max.   :3.0000                                   
##  guestroomF basementF hotwaterheatingF airconditioningF prefareaF
##  yes: 97    yes:191   yes: 25          yes:172          yes:128  
##  no :448    no :354   no :520          no :373          no :417  
##                                                                  
##                                                                  
##                                                                  
## 
#Descriptive statistics

Descriptive statistics:

Bedrooms: Min. :1.000 The minimum number of bedrooms in a house is 1.

Price: Mean : 4766729 The average price of a house is 4 766 729 euros.

Bathrooms: 3rd Qu.:2.000 75% of houses have 2 bathrooms or less and 25% of houses has more than 2 bathrooms.

Research question: How is the price of a house affected by the size of the house and furnishing status?

I chose price as dependent variable and 2 explanatory variables:

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(mydata[ , c( -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21)]), 2)
##                     price       area
## nbr.val      5.450000e+02     545.00
## nbr.null     0.000000e+00       0.00
## nbr.na       0.000000e+00       0.00
## min          1.750000e+06    1650.00
## max          1.330000e+07   16200.00
## range        1.155000e+07   14550.00
## sum          2.597867e+09 2807045.00
## median       4.340000e+06    4600.00
## mean         4.766729e+06    5150.54
## SE.mean      8.012083e+04      92.96
## CI.mean.0.95 1.573841e+05     182.60
## var          3.498544e+12 4709512.06
## std.dev      1.870440e+06    2170.14
## coef.var     3.900000e-01       0.42
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata[ , c(1, 2)],
                  smooth=FALSE) #Checking linearity in parameters

As we can see from the scatterplot, first assumption of linerarity in parameters is met.

fit <- lm(price ~ area + furnishingstatusF,
          data = mydata)

summary(fit) #Establishment of a regression model
## 
## Call:
## lm(formula = price ~ area + furnishingstatusF, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4054257  -979671  -225799   645075  7059845 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  2686769.16  188072.74  14.286  < 2e-16
## area                             429.85      30.65  14.023  < 2e-16
## furnishingstatusFunfurnished -696501.42  153673.99  -4.532 7.19e-06
## furnishingstatusFfurnished    363892.47  165034.61   2.205   0.0279
##                                 
## (Intercept)                  ***
## area                         ***
## furnishingstatusFunfurnished ***
## furnishingstatusFfurnished   *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1529000 on 541 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.3322 
## F-statistic:  91.2 on 3 and 541 DF,  p-value: < 2.2e-16
vif(fit) #Checking for multicolinearity
##                       GVIF Df GVIF^(1/(2*Df))
## area              1.030338  1        1.015056
## furnishingstatusF 1.030338  2        1.007500

There is no problem with multicolinearity. VIF is less than 5.

mydata$StdResid <- round(rstandard(fit), 3) #Standardized residuals
mydata$CooksD <- round(cooks.distance(fit), 3) #Cooks distances

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals") #Identifying outliers

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.94437, p-value = 1.99e-13

HO:Errors are normally distributed.

H1:Errors are not normally distributed.

We can reject H0 (p-value < 0.001) and conclude that errors are not normally distributed. Since we have a sample of 545 units, the violation of this assumption is not problematic.

hist(mydata$CooksD,
     xlab = "Cook's distances",
     ylab = "Frequency",
     main = "Histogram of Cook's distances") #Identifying units with high impact

head(mydata[order(mydata$StdResid), c("ID", "StdResid")], 3)
##      ID StdResid
## 404 404   -2.697
## 212 212   -2.453
## 126 126   -2.313
head(mydata[order(-mydata$StdResid), c("ID", "StdResid")], 10)
##    ID StdResid
## 1   1    4.638
## 4   4    3.903
## 2   2    3.519
## 10 10    3.503
## 3   3    3.479
## 5   5    3.396
## 6   6    3.242
## 14 14    3.079
## 12 12    2.895
## 17 17    2.805
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 10)
##      ID CooksD
## 126 126  0.068
## 404 404  0.062
## 1     1  0.045
## 212 212  0.043
## 3     3  0.042
## 2     2  0.036
## 4     4  0.033
## 5     5  0.024
## 14   14  0.022
## 10   10  0.019
library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(1, 4, 2, 10, 3, 5, 6, 14, 126, 404)) #Removing units
fit <- lm(price ~ area + furnishingstatusF,
          data = mydata)

mydata$StdFitted <- scale(fit$fitted.values)

library(car)
scatterplot(StdResid ~ StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            boxplot = FALSE,
            data = mydata) #Checking for heteroskedasticity and linearity

As we can see from the scatterplot linearity in parameters is met. The scatterplot does not look homoskedastic. Consequently, we do Breusch-Pagan test to check for heteroskedasticity.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    56.81202 
##  Prob > Chi2   =    4.795257e-14

HO: Homoskedasticity is present.

H1: Heteroskedasticity is present.

We reject H0 at p < 0.001 and conclude that heteroskedasticity is present.

library(estimatr)
fit <- lm_robust(price ~ area + furnishingstatusF,
                 se_type = "HC1",
                 data = mydata) #Using robust standard errors since heteroskedasticity is present.

summary(fit)
## 
## Call:
## lm_robust(formula = price ~ area + furnishingstatusF, data = mydata, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                  2688478.5  167047.65  16.094 9.321e-48
## area                             423.6      33.26  12.733 1.338e-32
## furnishingstatusFunfurnished -676283.2  135176.73  -5.003 7.687e-07
## furnishingstatusFfurnished    188685.5  154500.17   1.221 2.225e-01
##                               CI Lower  CI Upper  DF
## (Intercept)                  2360323.2 3016633.9 531
## area                             358.2     488.9 531
## furnishingstatusFunfurnished -941830.0 -410736.5 531
## furnishingstatusFfurnished   -114821.1  492192.0 531
## 
## Multiple R-squared:  0.3546 ,    Adjusted R-squared:  0.351 
## F-statistic:  79.9 on 3 and 531 DF,  p-value: < 2.2e-16

If size of the house is increased by 1 square meter, the price of the house on average increases by 423.6 euros (p<0.001), assuming all other variables remain unchanged.

Given the size of the house, the house that is unfurnished have on average price lower by 676283.2 euros compared to semi-furnished house (p<0.001).

We cannot say that the house that is furnished has a significantly different average price compared to a semi-furnished house.

Coefficient of determination (R2): 35.46% of variability of price is explained by variability of area (size of the house) and furnishing status.

F-statistic: HO: ρ2 =0 H1: ρ2 >0 We reject H0 (p<0.001) and conclude that at least one of the explanatory variables (area and furnishing status) impact salary.