Homework 3

Sara Bračun Duhovnik

About the data:

I decided to analyse data on housing price prediction. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/muhammadbinimran/housing-price-prediction-data). I cleaned the data and made a factor variable.

data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/housingprice_dataset.xlsx")
colnames(mydata) <- c("ID", "Sq_feet", "Neighbourhood", "Price")
set.seed(1)
mydata1 <- mydata[sample(nrow(mydata), 300), ]
mydata1$Neighbourhood <- factor(mydata1$Neighbourhood,
        levels = c("Urban", "Rural"),
        labels = c("Urban", "Rural"))
head(mydata1, 15)
## # A tibble: 15 × 4
##       ID Sq_feet Neighbourhood   Price
##    <dbl>   <dbl> <fct>           <dbl>
##  1 24388    2891 Urban         324842.
##  2  4050    1337 Urban         143337.
##  3 11571    2786 Urban         331326.
##  4 25173    1464 Rural         177784.
##  5 32618    1024 Rural          84251.
##  6 13903    2682 Urban         306779.
##  7  8229    1114 Rural         175469.
##  8 25305    2513 Urban         300529.
##  9 25061    2642 Rural         264384.
## 10 22306    2303 Rural         291378.
## 11 12204    1066 Urban          80667.
## 12  7075    1326 Rural         158061.
## 13 26954    1412 Rural         177194.
## 14 31276    2906 Rural         272761.
## 15 16044    1120 Urban         123791.

Describtion:

  • Unit of observation: one house
  • Sample size: 300 observations
  • Number of variables: 4

Definitions of all variables:

  • Sq_feet: size of the house (in square feet)
  • Neighborhood: area or neighborhood where the house is located
  • Price: price of the house (in $)

From my selected dataset I can derive the following regression function: Price = beta0 + beta1 x Sq_feet + beta2 x Neighbourhood

Predicted effects on dependant variable (Price):

  • the more the square feet, higher the Price
  • if the house is in urban area, higher the Price
summary(mydata1[colnames(mydata1) %in% c("Sq_feet", "Neighbourhood", "Price")])
##     Sq_feet     Neighbourhood     Price       
##  Min.   :1008   Urban:131     Min.   : 17546  
##  1st Qu.:1464   Rural:169     1st Qu.:166782  
##  Median :1972                 Median :220478  
##  Mean   :2005                 Mean   :223819  
##  3rd Qu.:2569                 3rd Qu.:283923  
##  Max.   :2993                 Max.   :439471
Interpretation:

The average size of a house in both areas combined is 2005 square feet. Minimum price for a house is 17.546 $ and 75% of houses was priced up to 283.923 $. There is more observed houses in rural area than in urban area.

Research question 1: Do size and location (neighbourhood) of a house influence its price?

- H0: There is no relationship between price, size and location of a house.
- H1: There is a relationship between price, size and location of a house.

Before performing the regression analysis, we have to check if all assumptions and additional requirements are met:

  • Linearity in parameters: I checked this with scatterplot –> MET
  • The expected value of error is 0: All relevant explanatory variables are included in a right form and regression constant is also included - it is important that the model is built on theory
  • Homoscedasticity: Constant variance of errors - I checked this with Breusch-Pagan test –> MET
  • Normal distribution of errors: my sample is big enough, but I will still check that with Shapiro-Wilk test –> MET
  • Errors are independent: Each unit is observed only once –> MET
  • No perfect multicoliniarity: one explanatory variable is a combination of remaining explanatory variables which gives it a different sign that expected (H0: beta k = 0 ; H1: beta k ≠ 0) –>MET
  • Number of units is greater than the number of estimated parameters –> MET

Additional requirements:

  • Dependant variable has to be numeric, while explanatory variables can be numeric or categorical (Dummy)
  • Non-zero variance of explanatory variables
  • Absense of too strong multicoliniarity: I checked that for all variables VIF < 5
  • No outliers and units with high impact: I checked that standardised residuals are on the interval [-3, +3] for outliers and for high impact units I looked at Cook’s D
Scatterplot:

There is a linear relationship between Price and Sq-feet.

library(car)
## Loading required package: carData
scatterplotMatrix(mydata1[ , c(-1, -3)],
                  smooth = FALSE)

scatterplot(Price ~ Sq_feet | Neighbourhood,
            xlab = "Size (in square feet)",
            ylab = "Price of the house",
            main = "Price of the house based on neighbourhood",
            smooth = FALSE,
            data = mydata1)

Already based on the scatterplot above, I can see that that the average price of a house in a rural area does not differ from the average price of the house in urban area. There is no difference in the gradient of the line, therefore with each additional square feet of the size of the house, the price doesn’t change in rural compared to urban areas.

fit1 <- lm(Price ~ Sq_feet + Neighbourhood,
           data = mydata1)
Shapiro-Wilk test:
mydata1$StdResid <- round(rstandard(fit1), 3)
mydata1$CooksD <- round(cooks.distance(fit1), 3)

hist(mydata1$StdResid,
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

  • H0: Errors are normally distributed
  • H1: Errors are not normally distributed.

I can not reject H0 (p-value > 0.05), therefore I can conclude that the errors are normally distributed. We can also see that clearly from the graph above as no values exceed +/- 3.

shapiro.test(mydata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.99228, p-value = 0.1212
Breusch-Pagan test:
hist(mydata1$CooksD,
     xlab = "Cook's D",
     ylab = "Frequency",
     main = "Histogram of Cook's D")

This graph indicates that there is no units with high impact, because there is no gaps in between.

mydata1$StdFitted <- scale(fit1$fitted.values)

library(car)
scatterplot(x = mydata1$StdResid, y = mydata1$StdFitted,
            ylab = "Standardised residuals",
            xlab = "Standardised fitted values",
            boxplots = TRUE,
            smooth = FALSE)

From the graph above I can conclude that there is no heteroscedasticity present as points are randomly scattered around the horizontal line at zero. This indicated that the model is capturing all systematic variance in my dataset. To prove, I also did Breusch-Pagan Test.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.3277539 
##  Prob > Chi2   =    0.5669847
  • H0: Variances are constant (homoscedasticity).
  • H1: Variances are not constant (heteroscedasticity).

Based on the sample data, I can’t reject H0 (p-value > 0.05) and conclude that variances in my data sample are constant and there is homoscedasticity present.

VIF:
vif(fit1)
##       Sq_feet Neighbourhood 
##      1.000185      1.000185

As both VIF statistics are lower than 5, I can conclude that there is no too strong multicoliniarity between the explanatory variables.

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Sq_feet + Neighbourhood, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -124902  -30137   -3015   27516  131941 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        23110.307   9668.078   2.390   0.0175 *  
## Sq_feet              100.782      4.405  22.877   <2e-16 ***
## NeighbourhoodRural -2407.386   5359.845  -0.449   0.6536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46040 on 297 degrees of freedom
## Multiple R-squared:  0.638,  Adjusted R-squared:  0.6355 
## F-statistic: 261.7 on 2 and 297 DF,  p-value: < 2.2e-16
cor_coef <- sqrt(summary(fit1)$r.squared)
cor_coef
## [1] 0.7987327

From the linear regression model above I can see that only one partial coeficient for size of the house is statistically significant. If the size of the house increases for one square feet, the price of the house increases for 100.78 $ on average, assuming the neighbourhood stays the same (p-value < 0.001). Neighbourhood is not statistically significant coeficient therefore, based on sample data, I can’t conclude that price of the house is not affected by the neighbourhood (p-value = 0.654). If the coefficient would be statistically significant, I would explain it like this: Given the size of the house, the price of the house is in rural areas on average for 2407.39 $ lower than in urban areas.

  • H0: rho squared = 0
  • H1: rho squared > 0
  • p-value < 0.001

Based on sample data, I can reject H0 (p-value < 0.001) and conclude that at least one of the variables (size or neighbourhood) impacts the price of the house. This also answers my research question.

Based on multiple correlation coefficient I can conclude that the relationship between price, size of the house and the neighbourhood is strong.

I can say that 63.8% of variability of price in houses is explained by the two coeficients, size in square feet and neighbourhood.

mydata1$Fitted <- round(fit1$fitted.values, 1)
mydata1$Residual <- round(mydata1$Price - mydata1$Fitted, 1)
head(mydata1[ , c("Sq_feet", "Neighbourhood", "Price", "Fitted", "Residual")])
## # A tibble: 6 × 5
##   Sq_feet Neighbourhood   Price  Fitted Residual
##     <dbl> <fct>           <dbl>   <dbl>    <dbl>
## 1    2891 Urban         324842. 314472.   10370.
## 2    1337 Urban         143337. 157856.  -14519.
## 3    2786 Urban         331326. 303890.   27436 
## 4    1464 Rural         177784. 168248.    9536.
## 5    1024 Rural          84251. 123904.  -39653.
## 6    2682 Urban         306779. 293409.   13370
sum(mydata1$Residual)
## [1] -0.4

The sum of residuals is not very close to zero, therefore I can say that the model does not fit the data the best. This can be due to the fact that I discovered that the neighbourhood is not a statistically significant coefficient.