HW 3

Daniil Li

1. Import the dataset Apartments.xlsx

library(readxl)

apartments <- read_excel("Apartments.xlsx")
head(apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

How do apartment’s age, proximity to the city center, along with conveniences like having parking and balcony, affect its price per square meter?

3. Change categorical variables into factors. (0.5 p)

apartments$ParkingF <- factor(apartments$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

apartments$BalconyF <- factor(apartments$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

str(apartments)

## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

This is a one-sample t-test with:

\(H_0\): The true mean price \(= 1900\)
\(H_1\): The true mean price \(\ne 1900\)

t.test(apartments$Price, 
       mu = 1900, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The average price per square meter is significantly different from 1900 EUR (at p-value = 0.005), with a 95% confidence interval ranging from 1937 to 2100 EUR.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = apartments)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(apartments$Price, apartments$Age) #check for the sign

## [1] -0.230255

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

The slope estimate for Age is -8.975, meaning that for each additional year of apartment age, the price per m\(^2\) decreases by approximately EUR 8.98, on average. The correlation (\(r\)) of -0.23 indicates a weak negative linear relationship between the variables, and the \(R^2\) of 0.053 means that about 5.3% of the variation in apartment prices can be explained by its age.

6. Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

scatterplotMatrix(~ Price + Age + Distance, data = apartments,
                  smooth = FALSE)

From the scatterplot of Age and Distance, we can clearly see there is almost no correlation between two, hence multicollinearity is not a concern in our case.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Check the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Both Age and Distance have VIF values very close to 1, therefore we observe no multicollinearity.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

apartments$StdResid <- round(rstandard(fit2), 3)  
apartments$CooksD <- round(cooks.distance(fit2), 3)  

hist(apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments$StdResid
## W = 0.95303, p-value = 0.003645

According to Shapiro-Wilk normality test, we would reject the null hypothesis that errors are normally distributed. However, our sample is fairly large (>80), so the central limit theorem helps the t and F statistics remain reasonably robust, even with mild departures from normality. Also, with big samples, any small deviation from normality can show up as significant in Shapiro-Wilk test. So the assumption of normality is less important to us and we can continue even while it is violated, in our case.

hist(apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

We can see a big jump in the histogram, so we will definitely have to remove some potential observations with high influence.

apartments$ID <- seq(1, nrow(apartments))

head(apartments[order(apartments$StdResid), c("ID", "StdResid")], 3)

## # A tibble: 3 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    53    -2.15
## 2    13    -1.50
## 3    72    -1.50

head(apartments[order(-apartments$StdResid), c("ID", "StdResid")], 3)

## # A tibble: 3 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    38     2.58
## 2    33     2.05
## 3     2     1.78

We can see that the standard residuals fall between [-3; 3] interval. So, we do not observe any outliers.

head(apartments[order(-apartments$CooksD), c("ID", "CooksD")], 6)

## # A tibble: 6 × 2
##      ID CooksD
##   <int>  <dbl>
## 1    38  0.32 
## 2    55  0.104
## 3    33  0.069
## 4    53  0.066
## 5    22  0.061
## 6    39  0.038

However, we can clearly see that ID 38 is an observation with a high influence, even though it is less than 0.5 or 1 (general rule of thumb), because of the huge difference in values between the next biggest value (0.32 vs 0.10). Therefore, we remove the ID 38.

library(dplyr)

apartments <- apartments %>%
  filter(!ID == 38)

Now, when we dropped one observation, let’s fit the function and calculate the values again.

fit2_new <- lm(Price ~ Age + Distance, data = apartments)

apartments$StdResid <- round(rstandard(fit2_new), 3)  
apartments$CooksD <- round(cooks.distance(fit2_new), 3)

hist(apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

head(apartments[order(apartments$StdResid), c("ID", "StdResid")], 3)

## # A tibble: 3 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    53    -2.24
## 2    13    -1.49
## 3    72    -1.49

head(apartments[order(-apartments$StdResid), c("ID", "StdResid")], 3)

## # A tibble: 3 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    33     2.22
## 2    25     1.82
## 3     2     1.78

Standard residuals are between [-3; 3].

hist(apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

No big jumps observed, so we can safely move on.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

library(car)

apartments$StdFitted <- scale(fit2_new$fitted.values)

scatterplot(y = apartments$StdResid, x = apartments$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

ols_test_breusch_pagan(fit2_new)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

From Breusch Pagan Test for Heteroskedasticity, we reject the null hypothesis that the variance is constant (p-value = 0.09), so we would have to use robust standard errors.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

As already done in task 9, we concluded that standardized residuals are not normally distributed. Let’s check again, but this time with Q-Q plot.

library(ggpubr)
ggqqplot(rstandard(fit2_new))

shapiro.test(rstandard(fit2_new))

## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(fit2_new)
## W = 0.9565, p-value = 0.00636

We can see that some of the points clearly fall outside the CI, and the Shapiro-Wilk normality test confirms non-normal distribution of errors. However, as mentioned already, our sample is fairly large (>80), so the central limit theorem helps the t and F statistics remain reasonably robust, even with mild departures from normality. Also, with big samples, any small deviation from normality can show up as significant in Shapiro-Wilk test. So the assumption of normality is less important to us and we can continue even while it is violated, in our case.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

library(estimatr)

fit2_robust <- lm_robust(Price ~ Age + Distance,
                        se_type = "HC1",  # White's robust standard errors
                        data = apartments)

summary(fit2_robust)

## 
## Call:
## lm_robust(formula = Price ~ Age + Distance, data = apartments, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower  CI Upper DF
## (Intercept) 2456.076     86.333  28.449 6.358e-44  2284.30 2627.8528 81
## Age           -6.464      3.338  -1.936 5.633e-02   -13.11    0.1786 81
## Distance     -22.955      2.598  -8.837 1.672e-13   -28.12  -17.7860 81
## 
## Multiple R-squared:  0.4838 ,    Adjusted R-squared:  0.4711 
## F-statistic: 40.12 on 2 and 81 DF,  p-value: 7.781e-13

Intercept: The estimated price when Age = 0 and Distance = 0 is 2456.
Coefficient for Age: The price decreases by 6.464 (in EUR/m\(^2\)) on average for each additional year of apartment age, holding distance constant (p-value = 0.056, not significant).
Coefficient for Distance: The price decreases by 22.955 (in EUR/m\(^2\)) on average for each additional 1 km from the center, holding age constant (p-value < 0.001, highly significant).
48.38% of variability in Price is explained by the linear effect of Age and Distance.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm_robust(Price ~ Age + Distance + ParkingF + BalconyF,
                  se_type = "HC1",
                  data = apartments)
summary(fit3)

## 
## Call:
## lm_robust(formula = Price ~ Age + Distance + ParkingF + BalconyF, 
##     data = apartments, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower  CI Upper DF
## (Intercept) 2329.724    102.132 22.8109 1.659e-36  2126.43 2533.0123 79
## Age           -5.821      3.227 -1.8038 7.507e-02   -12.24    0.6022 79
## Distance     -20.279      2.860 -7.0902 5.021e-10   -25.97  -14.5861 79
## ParkingFYes  167.531     64.716  2.5887 1.146e-02    38.72  296.3457 79
## BalconyFYes  -15.207     59.330 -0.2563 7.984e-01  -133.30  102.8873 79
## 
## Multiple R-squared:  0.5275 ,    Adjusted R-squared:  0.5035 
## F-statistic: 24.02 on 4 and 79 DF,  p-value: 5.078e-13

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2_robust, fit3)

## Error in UseMethod("anova"): no applicable method for 'anova' applied to an object of class "lm_robust"

Since we are using lm_robust() instead of lm(), the resulting object is of class “lm_robust”, which does not have a built‐in anova() method, therefore we get the error: Error in UseMethod(“anova”) : no applicable method for ‘anova’ applied to an object of class “lm_robust”. In this case, we could just compare adjusted R-squared – and if we do, then model 3 fits data better than model 2 according to the adjusted R-squared (0.5035 > 0.4711). However, let’s run the models without using robust standard errors, in order to make use of anova() function (we will see later that all the coefficients remain the same).

library(stargazer)
fit2_lm <- lm(Price ~ Age + Distance, data = apartments)
fit3_lm <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)

stargazer(fit2_lm, fit3_lm, type="text",
          title = "Fit 2 vs Fit 3",
          align = TRUE, 
          no.space = TRUE)

## 
## Fit 2 vs Fit 3
## =================================================================
##                                  Dependent variable:             
##                     ---------------------------------------------
##                                         Price                    
##                              (1)                    (2)          
## -----------------------------------------------------------------
## Age                        -6.464**               -5.821*        
##                            (3.159)                (3.074)        
## Distance                  -22.955***             -20.279***      
##                            (2.786)                (2.886)        
## ParkingFYes                                      167.531***      
##                                                   (62.864)       
## BalconyFYes                                       -15.207        
##                                                   (59.201)       
## Constant                 2,456.076***           2,329.724***     
##                            (73.931)               (93.066)       
## -----------------------------------------------------------------
## Observations                  84                     84          
## R2                          0.484                  0.527         
## Adjusted R2                 0.471                  0.504         
## Residual Std. Error   276.146 (df = 81)      267.536 (df = 79)   
## F Statistic         37.959*** (df = 2; 81) 22.045*** (df = 4; 79)
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

anova(fit2_lm, fit3_lm)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that all the coefficients in both models are the same in both cases (with and without White’s robust standard errors).

Model fit3_lm does indeed fit data better than model fit2_lm (p-value = 0.0306).

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm_robust(formula = Price ~ Age + Distance + ParkingF + BalconyF, 
##     data = apartments, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower  CI Upper DF
## (Intercept) 2329.724    102.132 22.8109 1.659e-36  2126.43 2533.0123 79
## Age           -5.821      3.227 -1.8038 7.507e-02   -12.24    0.6022 79
## Distance     -20.279      2.860 -7.0902 5.021e-10   -25.97  -14.5861 79
## ParkingFYes  167.531     64.716  2.5887 1.146e-02    38.72  296.3457 79
## BalconyFYes  -15.207     59.330 -0.2563 7.984e-01  -133.30  102.8873 79
## 
## Multiple R-squared:  0.5275 ,    Adjusted R-squared:  0.5035 
## F-statistic: 24.02 on 4 and 79 DF,  p-value: 5.078e-13

Intercept (\(\beta_0\)):
The predicted price per m\(^2\) when \(\text{Age} = 0\), \(\text{Distance} = 0\) and with the baseline categories (Parking = No, Balcony = No) is about 2329.72 EUR/m\(^2\).
Age (\(\beta_1\)):
Each additional 1 year in apartment age is associated with 5.82 EUR less in price per m\(^2\), on average, holding everything else constant.
The p-value \(\approx 0.075\) suggests this effect is not significant.
Distance (\(\beta_2\)):
For every 1 km further from the city center, the price per m\(^2\) decreases by about 20.28 EUR, on average, holding everything else constant (\(p < 0.001\), highly significant).
ParkingFYes (\(\beta_3\)):
Holding age, distance, and balcony the same, having a parking spot adds about 167.53 EUR to the price per m\(^2\) versus no parking, on average (\(p \approx 0.012\), highly significant).
BalconyFYes (\(\beta_4\)):
Holding age, distance, and parking the same, having a balcony reduces price by about 15.21 EUR per m\(^2\) relative to no balcony, on average (\(p \approx 0.8\), not significant).
Null Hypothesis \(H_0\):
All of partial regression coefficients are 0.
Alternative Hypothesis \(H_1\):
At least one of the coefficients is not zero.

Because the p-value is extremely small (\(5.078 \times 10^{-13}\)), we reject \(H_0\) and conclude that, collectively, these predictors do indeed help explain variability in apartment price.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

apartments$Fit3_Fitted <- fit3$fitted.values

apartments$Price[apartments$ID=="2"] - apartments$Fit3_Fitted[apartments$ID=="2"]

##        2 
## 427.8029