house_price <- read.csv("Property prices.csv")
head(house_price, 10) #Table with first 10 rows.
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## 7 10150000 8580 4 3 4 yes no no
## 8 10150000 16200 5 3 2 yes no no
## 9 9870000 8100 4 1 2 yes yes yes
## 10 9800000 5750 3 2 4 yes yes no
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
## 7 no yes 2 yes semi-furnished
## 8 no no 0 no unfurnished
## 9 no yes 2 yes furnished
## 10 no yes 1 yes unfurnished
Description of data:
‘price’: The price of the house in Rupees (target variable).
‘area’: The area or size of the house in square feet.
‘bedrooms’: The number of bedrooms in the house.
‘bathrooms’: The number of bathrooms in the house.
‘stories’: The number of stories or floors in the house.
‘mainroad’: Categorical variable indicating whether the house is located near the main road or not.
‘guestroom’: Categorical variable indicating whether the house has a guest room or not.
‘basement’: Categorical variable indicating whether the house has a basement or not.
‘hotwaterheating’: Categorical variable indicating whether the house has hot water heating or not.
‘airconditioning’: Categorical variable indicating whether the house has air conditioning or not.
‘parking’: The number of parking spaces available with the house.
‘prefarea’: Categorical variable indicating whether the house is in a preferred area or not.
‘furnishingstatus’: The furnishing status of the house (e.g., unfurnished, semi-furnished, fully furnished).
This is a collection of data pertaining 545 houses for sale. The unit of observation in the dataset is a house for sale; sample size is 545. There are 13 variables (described above).
The dataset is from the Kaggle website (https://www.kaggle.com/datasets/ashydv/housing-dataset).
any(is.na(house_price)) #I looked if there are some missing data in dataset.
## [1] FALSE
There is no missing data in my dataset.
house_price <- house_price[, -5] #I removed the number of stories.
house_price <- house_price[, -6] #I removed the guestroom variable.
house_price <- house_price[, -7] #I removed the hot water heating variable.
house_price <- house_price[, -7] #I removed the airconditioning variable.
house_price <- house_price[, -5] #I removed the main road variable.
house_price <- house_price[, -3] #I removed the bedrooms.
house_price <- house_price[, -5] #I removed the parking.
house_price <- house_price[, -3] #I removed the bathrooms variable.
house_price <- house_price[, -4] #I removed the preferred area variable.
house_price <- house_price[, -4] #I removed the furnishing status variable.
house_price$basement <- factor(house_price$basement,
levels = c("yes", "no"),
labels = c("yes", "no")) #I did the factoring.
head(house_price, 10)
## price area basement
## 1 13300000 7420 no
## 2 12250000 8960 no
## 3 12250000 9960 yes
## 4 12215000 7500 yes
## 5 11410000 7420 yes
## 6 10850000 7500 yes
## 7 10150000 8580 no
## 8 10150000 16200 no
## 9 9870000 8100 yes
## 10 9800000 5750 no
summary(house_price)
## price area basement
## Min. : 1750000 Min. : 1650 yes:191
## 1st Qu.: 3430000 1st Qu.: 3600 no :354
## Median : 4340000 Median : 4600
## Mean : 4766729 Mean : 5151
## 3rd Qu.: 5740000 3rd Qu.: 6360
## Max. :13300000 Max. :16200
Price: The minimum house price is 1,750,000 and the maximum is 13,300,000 Rupees.The average house price is 4,766,729 Rupees.
Area:The minimum house area is 1,650 square feet and the maximum is 16,200 square feet.25% of the houses have areas below 3,600 square feet. The median area is 4,600 square feet, indicating that half of the houses have area sized this much or less.
Basement: There are 191 houses with basement and 354 houses that do not have basement.
#install.packages("psych")
library(psych)
describeBy(house_price)
## Warning in describeBy(house_price): no grouping variable requested
## vars n mean sd median trimmed mad min
## price 1 545 4766729.25 1870439.62 4340000 4559299.43 1556730.00 1750000
## area 2 545 5150.54 2170.14 4600 4908.41 2060.81 1650
## basement* 3 545 1.65 0.48 2 1.69 0.00 1
## max range skew kurtosis se
## price 13300000 11550000 1.21 1.91 80120.83
## area 16200 14550 1.31 2.69 92.96
## basement* 2 1 -0.63 -1.61 0.02
For price we can observe a positive skewness (1.21), which indicates a right-skewed distribution (tail is longer on the right side). Meanwhile a positive kurtosis (1.91) suggests that the distribution has heavier tails and a more peaked shape compared to a normal distribution.
hist(house_price$price,
main = "Histogram of House Prices",
xlab = "Price of houses",
ylab = "Number of houses",
col = "lightblue",
border = "black")
In my regression model I will use area and basement to explain do these two variables influence the price of a house.I expect that larger area and having a basement will positively effect house price.
RQ: HOW DOES AREA AND BASEMENT INFLUENCE THE PRICE OF A HOUSE?
fit <- lm(price ~ area + basement,
data = house_price)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(price ~ area,
ylab = "Price of house",
xlab = "Area size",
smooth = FALSE,
data = house_price)
library(car)
numeric <- house_price[, c("price", "area")]
scatterplotMatrix(numeric, smooth = FALSE) #Same as before but with vice-versa axis also.
As observed, the assumption of linearity is met (no curved trend).
house_price$Stdfitted <- scale(fit$fitted.values)
house_price$StdResid <- round(rstandard(fit), 3)
library(car)
scatterplot(y = house_price$StdResid, x = house_price$Stdfitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 90.88678
## Prob > Chi2 = 1.521327e-21
Ho: Variance is constant.
H1: Variance is not constant.
Based on the BP test, we can reject the null hypothesis (p<0.001), meaning I have heteroskedasticity here. That is why I will perform robust standard errors correction.
vif(fit)
## area basement
## 1.002253 1.002253
mean(vif(fit))
## [1] 1.002253
In our case VIF of individual variables and also mean of VIFs all have a value close to 1 (which is what we want).
house_price$StdResid <- round(rstandard(fit), 3)
hist(house_price$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
Some units fall out of the +/- 3 threshold.
shapiro.test(house_price$StdResid)
##
## Shapiro-Wilk normality test
##
## data: house_price$StdResid
## W = 0.94268, p-value = 1.166e-13
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
BDS we can reject the null hypothesis (p<0.001). But because my data set is large, I will assume normality going forward.
house_price$CooksD <- round(cooks.distance(fit), 3)
hist(house_price$CooksD,
xlab = "Cook`s distances",
ylab = "Frequency",
main = "Histogram of Cook`s distances")
From this graph I observe a gap (between 0.05-0.07). Let`s continue with checking the values…
head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)], 10)
## price area basement StdResid CooksD
## 404 3500000 12944 no -2.999 0.083
## 126 5943000 15600 no -2.216 0.079
## 3 12250000 9960 yes 3.166 0.047
## 2 12250000 8960 no 3.863 0.044
## 1 13300000 7420 no 4.986 0.041
## 212 4900000 12900 no -2.072 0.039
## 4 12215000 7500 yes 3.856 0.036
## 5 11410000 7420 yes 3.359 0.027
## 6 10850000 7500 yes 2.974 0.021
## 7 10150000 8580 no 2.615 0.018
The Cooks distance shows us which of the rows in the data set have the highest influence on our model.
house_price <- house_price[!(house_price$StdResid < -3),]
house_price <- house_price[!(house_price$StdResid > 3),] #Removed all units with values bigger than 3, and smaller than -3.
house_price <- house_price[!(house_price$CooksD > 0.050),] #The first two seem to have the most impact (house nr. 404 and 126).
head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)])
## price area basement StdResid CooksD
## 212 4900000 12900 no -2.072 0.039
## 6 10850000 7500 yes 2.974 0.021
## 7 10150000 8580 no 2.615 0.018
## 192 5040000 10700 yes -1.731 0.017
## 67 6930000 13200 yes -1.252 0.016
## 187 5110000 11410 no -1.485 0.014
Here we removed observations where the standardized residuals are less than -3 (lower end) and observations where the standardized residuals are greater than 3 (upper end).
We also removed the observations where Cooks distances are greater than 0.050.
house_price <- house_price[!(house_price$CooksD > 0.021),] #Now the first unit in the graph one seems to be weird, so I will remove also that one.
head(house_price[order(-house_price$CooksD), c(1,2,3,5,6)])
## price area basement StdResid CooksD
## 6 10850000 7500 yes 2.974 0.021
## 7 10150000 8580 no 2.615 0.018
## 192 5040000 10700 yes -1.731 0.017
## 67 6930000 13200 yes -1.252 0.016
## 187 5110000 11410 no -1.485 0.014
## 402 3500000 9500 no -1.959 0.014
This assumption is met (each unit is observed once and we have a single answer on the price, area and basement - if this independent, we assume our errors are independent as well).
This assumption is met.
#install.packages("estimatr")
library(estimatr)
fit_robust <- lm_robust (price ~ area + basement,
se_type = "HC1",
data = house_price)
summary(fit_robust)
##
## Call:
## lm_robust(formula = price ~ area + basement, data = house_price,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2722172.4 174846.01 15.569 2.618e-45 2378699.1 3065645.7 532
## area 460.3 31.61 14.565 1.105e-40 398.2 522.4 532
## basementno -585356.8 125206.06 -4.675 3.728e-06 -831315.7 -339397.9 532
##
## Multiple R-squared: 0.3449 , Adjusted R-squared: 0.3425
## F-statistic: 129.5 on 2 and 532 DF, p-value: < 2.2e-16
All three coefficients have p-values of less than 0.05 showing that they are statistically significant, so I will interpret them.
Intercept: This value 2,722,172.4 represents the predicted house price if both area and basement variables would be zero, which is not really possible (area has to be bigger than 0 to be existent).
Area: Coefficient is 460.3, which means that on average for every square feet (that is the unit) increase in area size the predicted house price increases by 460.3 Rupees.
Basement: The coefficient is -585,356.8, which means that for houses with no basement compared to those that do have a basement the predicted house price is on average lower by 585.357, all other remains constant.
Multiple R-squared is 0.3449 suggesting that 34.49% of the variability in house price is explained by the variability in the basement and area variables.
F-statistics:
Ho: ro-squared = 0
H1: ro-squared > 0
BDS we can reject the null hypothesis (p<0.001). The value of F-statistics is 129.5. This all indicates that the regression model is statistically significant, and that at least one of the explanatory variables impact house price.
sqrt(summary(fit_robust)$r.squared)
## [1] 0.5873179
The square root of multiple correlation coefficient shows relationship between house price and the explanatory variables. This relationship (0.587) is semi-strong.
To answer the RQ: Both area and basement(yes) variables have a positive effect on house prices. If there is no basement, that negatively effects house price.