ADA HW 3

Anna Kostiukovych

1. Import the dataset Apartments.xlsx

Library readxl has to be activated now:

library(readxl)

Adata <- read_xlsx("Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

Which features most significantly affect the price of an apartment?

Does the availability of parking or a balcony influence the price more?

Do newer apartments command a price premium, even if they are farther from the city center?

3. Change categorical variables into factors. (0.5 p)

Adata$Parking <- factor(Adata$Parking, levels = c(0, 1), labels = c("No", "Yes"))
Adata$Balcony <- factor(Adata$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

H0: The average price per m2 is EUR 1900.

H1: The average price per m2 is not EUR 1900.

The Shapiro-Wilk normality test has to be performed:

shapiro.test(Adata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  Adata$Price
## W = 0.94017, p-value = 0.0006513

Based on the p-value, which is below the typical threshold of 0.05, I reject the null hypothesis that the prices per m2 are normally distributed. Consequently, I proceed with the non-parametric median test.

H0: The median price per m2 is equal to EUR 1900.

H1: The median price per m2 is not equal to EUR 1900.

wilcox.test(Adata$Price, mu=1900, alternative = "two.sided")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Adata$Price
## V = 2328, p-value = 0.02844
## alternative hypothesis: true location is not equal to 1900

The null hypothesis can be rejected at p=0.02844. The median price is not equal to EUR 1900.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = Adata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Adata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression Coefficient (Slope): -8.975.

For every additional year of apartment age, the rental price decreases by approximately €8.98, on average, assuming a linear relationship.

cor(Adata$Price, Adata$Age)

## [1] -0.230255

Coefficient of Correlation: -0.23.

There is a weak negative relationship between the price per m2 and age of the apartments.

Coefficient of Determination: 0.053.

About 5.3% of the variation in price per m2 is explained by the age of the apartment.

6. Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

scatterplotMatrix(Adata[ , c(3,1,2)], smooth = FALSE)

A slight negative linear relationship exists between price and age — as age increases, price per m2 tends to decrease. A clear negative relationship between the price and distance from the city centre is apparent — apartments farther from the city center have lower prices. The predictor variables (Age, Distance) are not highly correlated with each other, which means multicollinearity should not distort the regression.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Adata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Adata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

None of the explanatory variables cause multicollinearity as all VIF statistics are very close to 1.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

Adata$StdResid <- round(rstandard(fit2), 3) #standardized residuals
Adata$CooksD <- round(cooks.distance(fit2), 3) #Cooks Distances

hist(Adata$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histrogram of standardized residuals")

The histogram of the standardized residuals indicates that the errors are most likely not normally distributed. It can also be seen from the histogram that the sample does not contain any units that are outliers (standardized residuals outside ±3).

hist(Adata$CooksD, xlab = "Cooks distance", ylab = "Frequency", main = "Histrogram of Cooks distance")

From the distribution of Cook’s distances I observe that some units have significantly stronger impact on the estimation of partial regression coefficients (Cook’s distance 0.10 and higher).

head(Adata[order(-Adata$CooksD),])

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061
## 6    40        2  2400 No      Yes         1.09  0.038

The two observations with high influence (Cook’s distance 0.10 and higher) can now be removed from the sample:

Adata <- Adata[Adata$CooksD < 0.1, ]

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

The regression is now re-estimated:

fit2 <- lm(Price ~ Age + Distance, data = Adata)
Adata$StdResid <- round(rstandard(fit2), 3) #standardized residuals
Adata$CooksD <- round(cooks.distance(fit2), 3) #Cooks Distances
Adata$StdFitted <- scale(fit2$fitted.values) #fitted values

scatterplot(y = Adata$StdResid, x = Adata$StdFitted, 
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            boxplots = FALSE, regLine = FALSE, smooth = FALSE)

From the scatter plot it cannt be concluded that homoskedasticity is violated, as the variance of errors appears to be relatively constant for different levels of the explanatory variables. I additionally perform the Breusch-Pagan test for heteroskedasticity:

library(olsrr)

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

The Breusch-Pagan test for heteroskedasticity shows that the null hypothesis (the variance of the errors is constant) should not be rejected.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(Adata$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histrogram of standardized residuals")

The histogram of the standardized residuals indicates that the errors are most likely not normally distributed. I also formally test this assumption:

shapiro.test(Adata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Adata$StdResid
## W = 0.95952, p-value = 0.01044

The null hypothesis that the errors are normally distributed should be rejected at p < 0.05.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

The regression is now re-estimated:

fit2 <- lm(Price ~ Age + Distance, data = Adata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Adata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

For each additional year of building age, the rental price decreases on average by approximately EUR 7.9, holding distance from the city centre constant.

For each additional km from the city center, the price drops by approximately EUR 23.95, on average, holding the age of the apartment constant.

About 49.7% of the variation in price is explained by age and distance from the centre together. The model overall is highly statistically significant — at least one predictor has a non-zero coefficient F-statistic (39.49, p < 0.00001).

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Adata) 
#parking and balcony are just included directly
#they were previously converted to factors
#No was the reference category for both

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value = 0.028, I reject the null hypothesis that both models fit the data equally well. The more complex model fit3 provides a statistically significant improvement in model fit over fit2. That means the additional variables in fit3 add explanatory power to my regression.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Adata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingYes   168.921     62.166   2.717  0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

On average, having parking increases the price per m2 by approximately EUR 169, compared to not having parking, holding all else constant.

On average, having a balcony decreases the price by about EUR 7, but the effect of this variable is not statistically significant, so it is not possible to conclude a real impact on the price per m2 from having a balcony.

The F-test checks the overall significance of the regression model.

H0: All slope coefficients are zero.

H1: At least one of the coefficients is not zero.

Since the p-value = 1.449e-12 < 0.001, I reject the null hypothesis. At least one of the predictors (Age, Distance, Parking, or Balcony) contributes significantly to explaining variation in price per m2.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

fitted_val <- fit3$fitted.values

The residual is the difference between the actual price and the predicted (fitted) value:

Adata$Price[2] - fitted_val[2]

##        2 
## 422.9572

Check the result:

fit3$residuals[2]

##        2 
## 422.9572