MVA HOMEWORK 3 (regression)

RESEARCH QUESTION: HOW DOES AREA AND BASEMENT INFLUENCE THE PRICE OF A HOUSE?

Larisa O.

house_price <- read.csv("Property prices.csv")
head(house_price, 10) #Table with first 10 rows.
##       price  area bedrooms bathrooms stories mainroad guestroom basement
## 1  13300000  7420        4         2       3      yes        no       no
## 2  12250000  8960        4         4       4      yes        no       no
## 3  12250000  9960        3         2       2      yes        no      yes
## 4  12215000  7500        4         2       2      yes        no      yes
## 5  11410000  7420        4         1       2      yes       yes      yes
## 6  10850000  7500        3         3       1      yes        no      yes
## 7  10150000  8580        4         3       4      yes        no       no
## 8  10150000 16200        5         3       2      yes        no       no
## 9   9870000  8100        4         1       2      yes       yes      yes
## 10  9800000  5750        3         2       4      yes       yes       no
##    hotwaterheating airconditioning parking prefarea furnishingstatus
## 1               no             yes       2      yes        furnished
## 2               no             yes       3       no        furnished
## 3               no              no       2      yes   semi-furnished
## 4               no             yes       3      yes        furnished
## 5               no             yes       2       no        furnished
## 6               no             yes       2      yes   semi-furnished
## 7               no             yes       2      yes   semi-furnished
## 8               no              no       0       no      unfurnished
## 9               no             yes       2      yes        furnished
## 10              no             yes       1      yes      unfurnished

Description of data:

  • ‘price’: The price of the house in Rupees (target variable).

  • ‘area’: The area or size of the house in square feet.

  • ‘bedrooms’: The number of bedrooms in the house.

  • ‘bathrooms’: The number of bathrooms in the house.

  • ‘stories’: The number of stories or floors in the house.

  • ‘mainroad’: Categorical variable indicating whether the house is located near the main road or not.

  • ‘guestroom’: Categorical variable indicating whether the house has a guest room or not.

  • ‘basement’: Categorical variable indicating whether the house has a basement or not.

  • ‘hotwaterheating’: Categorical variable indicating whether the house has hot water heating or not.

  • ‘airconditioning’: Categorical variable indicating whether the house has air conditioning or not.

  • ‘parking’: The number of parking spaces available with the house.

  • ‘prefarea’: Categorical variable indicating whether the house is in a preferred area or not.

  • ‘furnishingstatus’: The furnishing status of the house (e.g., unfurnished, semi-furnished, fully furnished).

This is a collection of data pertaining 545 houses for sale. The unit of observation in the dataset is a house for sale; sample size is 545. There are 13 variables (described above).

The dataset is from the Kaggle website (https://www.kaggle.com/datasets/ashydv/housing-dataset).

any(is.na(house_price)) #I looked if there are some missing data in dataset.
## [1] FALSE

There is no missing data in my dataset.

house_price <- house_price[, -5] #I removed the number of stories.
house_price <- house_price[, -6] #I removed the guestroom variable.
house_price <- house_price[, -7] #I removed the hot water heating variable.
house_price <- house_price[, -7] #I removed the airconditioning variable.
house_price <- house_price[, -5] #I removed the main road variable.
house_price <- house_price[, -3] #I removed the bedrooms.
house_price <- house_price[, -5] #I removed the parking.
house_price <- house_price[, -3] #I removed the bathrooms variable.
house_price <- house_price[, -4] #I removed the preferred area variable.
house_price <- house_price[, -4] #I removed the furnishing status variable.
house_price$basement <- factor(house_price$basement, 
                                levels = c("yes", "no"), 
                                labels = c("yes", "no")) #I did the factoring.
head(house_price, 10)
##       price  area basement
## 1  13300000  7420       no
## 2  12250000  8960       no
## 3  12250000  9960      yes
## 4  12215000  7500      yes
## 5  11410000  7420      yes
## 6  10850000  7500      yes
## 7  10150000  8580       no
## 8  10150000 16200       no
## 9   9870000  8100      yes
## 10  9800000  5750       no
summary(house_price)
##      price               area       basement 
##  Min.   : 1750000   Min.   : 1650   yes:191  
##  1st Qu.: 3430000   1st Qu.: 3600   no :354  
##  Median : 4340000   Median : 4600            
##  Mean   : 4766729   Mean   : 5151            
##  3rd Qu.: 5740000   3rd Qu.: 6360            
##  Max.   :13300000   Max.   :16200

Price: The minimum house price is 1,750,000 and the maximum is 13,300,000 Rupees.The average house price is 4,766,729 Rupees.

Area:The minimum house area is 1,650 square feet and the maximum is 16,200 square feet.25% of the houses have areas below 3,600 square feet. The median area is 4,600 square feet, indicating that half of the houses have area sized this much or less.

Basement: There are 191 houses with basement and 354 houses that do not have basement.

#install.packages("psych")
library(psych)
describeBy(house_price)
## Warning in describeBy(house_price): no grouping variable requested
##           vars   n       mean         sd  median    trimmed        mad     min
## price        1 545 4766729.25 1870439.62 4340000 4559299.43 1556730.00 1750000
## area         2 545    5150.54    2170.14    4600    4908.41    2060.81    1650
## basement*    3 545       1.65       0.48       2       1.69       0.00       1
##                max    range  skew kurtosis       se
## price     13300000 11550000  1.21     1.91 80120.83
## area         16200    14550  1.31     2.69    92.96
## basement*        2        1 -0.63    -1.61     0.02

For price we can observe a positive skewness (1.21), which indicates a right-skewed distribution (tail is longer on the right side). Meanwhile a positive kurtosis (1.91) suggests that the distribution has heavier tails and a more peaked shape compared to a normal distribution.

hist(house_price$price,
     main = "Histogram of House Prices",
     xlab = "Price of houses",
     ylab = "Number of houses",
     col = "lightblue",
     border = "black")

CHAPTER: THE REGRESSION MODEL

In my regression model I will use area and basement to explain do these two variables influence the price of a house.I expect that larger area and having a basement will positively effect house price.

RQ: HOW DOES AREA AND BASEMENT INFLUENCE THE PRICE OF A HOUSE?

Price = βo + β1 × Area + β2 × Basement + Error

fit <- lm(price ~ area + basement,
          data = house_price)

Assumptions

LINEARITY

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(price ~ area,
            ylab = "Price of house",
            xlab = "Area size",
            smooth = FALSE, 
            data = house_price)

library(car)
numeric <- house_price[, c("price", "area")]
scatterplotMatrix(numeric, smooth = FALSE) #Same as before but with vice-versa axis also.

As observed, the assumption of linearity is met (no curved trend).

HOMOSKEDACITY

house_price$Stdfitted <- scale(fit$fitted.values)
house_price$StdResid <- round(rstandard(fit), 3)

library(car)
scatterplot(y = house_price$StdResid, x = house_price$Stdfitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    90.88678 
##  Prob > Chi2   =    1.521327e-21

Ho: Variance is constant.

H1: Variance is not constant.

Based on the BP test, we can reject the null hypothesis (p<0.001), meaning I have heteroskedasticity here. That is why I will perform robust standard errors correction.

MULTICOLINEARITY

vif(fit)
##     area basement 
## 1.002253 1.002253
mean(vif(fit))
## [1] 1.002253

In our case VIF of individual variables and also mean of VIFs all have a value close to 1 (which is what we want).

NORMALITY

house_price$StdResid <- round(rstandard(fit), 3)

hist(house_price$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals")

Some units fall out of the +/- 3 threshold.

shapiro.test(house_price$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  house_price$StdResid
## W = 0.94268, p-value = 1.166e-13

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

BDS we can reject the null hypothesis (p<0.001). But because my data set is large, I will assume normality going forward.

UNITS WITH HIGH IMPACT AND REMOVAL OF UNITS

house_price$CooksD <- round(cooks.distance(fit), 3)

hist(house_price$CooksD,
     xlab = "Cook`s distances",
     ylab = "Frequency", 
     main = "Histogram of Cook`s distances")

From this graph I observe a gap (between 0.05-0.07). Let`s continue with checking the values…

head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)], 10)
##        price  area basement StdResid CooksD
## 404  3500000 12944       no   -2.999  0.083
## 126  5943000 15600       no   -2.216  0.079
## 3   12250000  9960      yes    3.166  0.047
## 2   12250000  8960       no    3.863  0.044
## 1   13300000  7420       no    4.986  0.041
## 212  4900000 12900       no   -2.072  0.039
## 4   12215000  7500      yes    3.856  0.036
## 5   11410000  7420      yes    3.359  0.027
## 6   10850000  7500      yes    2.974  0.021
## 7   10150000  8580       no    2.615  0.018

The Cooks distance shows us which of the rows in the data set have the highest influence on our model.

house_price <- house_price[!(house_price$StdResid < -3),]
house_price <- house_price[!(house_price$StdResid > 3),] #Removed all units with values bigger than 3, and smaller than -3.

house_price <- house_price[!(house_price$CooksD > 0.050),] #The first two seem to have the most impact (house nr. 404 and 126).
head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)])
##        price  area basement StdResid CooksD
## 212  4900000 12900       no   -2.072  0.039
## 6   10850000  7500      yes    2.974  0.021
## 7   10150000  8580       no    2.615  0.018
## 192  5040000 10700      yes   -1.731  0.017
## 67   6930000 13200      yes   -1.252  0.016
## 187  5110000 11410       no   -1.485  0.014

Here we removed observations where the standardized residuals are less than -3 (lower end) and observations where the standardized residuals are greater than 3 (upper end).

We also removed the observations where Cooks distances are greater than 0.050.

house_price <- house_price[!(house_price$CooksD > 0.021),] #Now the first unit in the graph one seems to be weird, so I will remove also that one.  
head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)])
##        price  area basement StdResid CooksD
## 6   10850000  7500      yes    2.974  0.021
## 7   10150000  8580       no    2.615  0.018
## 192  5040000 10700      yes   -1.731  0.017
## 67   6930000 13200      yes   -1.252  0.016
## 187  5110000 11410       no   -1.485  0.014
## 402  3500000  9500       no   -1.959  0.014

INDEPENDENT ERRORS

This assumption is met (each unit is observed once and we have a single answer on the price, area and basement - if this independent, we assume our errors are independent as well).

MORE UNITS THAN PARAMETERS ESTIMATED

This assumption is met.

THE FINAL REGRESSION MODEL

#install.packages("estimatr")
library(estimatr)
fit_robust <- lm_robust (price ~ area + basement,
                         se_type = "HC1", 
                         data = house_price)
summary(fit_robust)
## 
## Call:
## lm_robust(formula = price ~ area + basement, data = house_price, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept) 2722172.4  174846.01  15.569 2.618e-45 2378699.1 3065645.7 532
## area            460.3      31.61  14.565 1.105e-40     398.2     522.4 532
## basementno  -585356.8  125206.06  -4.675 3.728e-06 -831315.7 -339397.9 532
## 
## Multiple R-squared:  0.3449 ,    Adjusted R-squared:  0.3425 
## F-statistic: 129.5 on 2 and 532 DF,  p-value: < 2.2e-16

All three coefficients have p-values of less than 0.05 showing that they are statistically significant, so I will interpret them.

Intercept: This value 2,722,172.4 represents the predicted house price if both area and basement variables would be zero, which is not really possible (area has to be bigger than 0 to be existent).

Area: Coefficient is 460.3, which means that on average for every square feet (that is the unit) increase in area size the predicted house price increases by 460.3 Rupees.

Basement: The coefficient is -585,356.8, which means that for houses with no basement compared to those that do have a basement the predicted house price is on average lower by 585.357, all other remains constant.

Multiple R-squared is 0.3449 suggesting that 34.49% of the variability in house price is explained by the variability in the basement and area variables.

F-statistics:

  • Ho: ro-squared = 0

  • H1: ro-squared > 0

BDS we can reject the null hypothesis (p<0.001). The value of F-statistics is 129.5. This all indicates that the regression model is statistically significant, and that at least one of the explanatory variables impact house price.

sqrt(summary(fit_robust)$r.squared)
## [1] 0.5873179

The square root of multiple correlation coefficient shows relationship between house price and the explanatory variables. This relationship (0.587) is semi-strong.

To answer the RQ: Both area and basement(yes) variables have a positive effect on house prices. If there is no basement, that negatively effects house price.