HW 3

Jakob Pürzelmayer

1. Import the dataset Apartments.xlsx

library(readxl)

mydata <- read_xlsx("Apartments.xlsx")
mydata <- as.data.frame(mydata)

head(mydata)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

How do age, distance from the city center, parking, and balcony affect the average price per square meter of apartments?

3. Change categorical variables into factors. (0.5 p)

library(dplyr)
mydata$ParkingF <- factor(mydata$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

mydata$BalconyF <- factor(mydata$Balcony,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))


head(mydata)

##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Conclusion

Since the p-value is 0.005, we reject the null hypothesis.
This shows that the average price per square meter of an apartment is significantly different from 1900€.
The sample mean is 2018,94€ which suggest it is higher than 1900€.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, 
           data = mydata)


summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(mydata$Price, mydata$Age)

## [1] -0.230255

Conclusion

The estimated regression function is: Price = 2185.45 − 8.98 × Age

Regression Coefficient: This means that for each additional year of apartment age, the price per square meter decreases by approximately 8.98€, on average – assuming all other factors remain the same.

Coefficient of Correlation: The correlation between Price and Age is approximately −0.23, indicating a weak negative linear relationship.

Coefficient of Determination (R2): The R2 value of 0.053 shows that about 5.3% of the variation in apartment prices is explained by their age.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)
scatterplotMatrix(mydata[, c("Price", "Age", "Distance")], 
                  smooth = FALSE, 
                  main = "Scatterplot Matrix: Price, Age, Distance")

Conclusion

Based on a visual check, we do not observe any strong signs of multicollinearity.
There is no strong correlation between the independent variables.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Check the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Conclusion:

In this case, both variables Age and Distance have VIF values of 1.0018, and the mean VIF is also 1.0018.

Since all VIF values are well below the threshold of 5, there is no indication of multicollinearity in the model.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD   <- round(cooks.distance(fit2), 3)

hist(mydata$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

mydata$ID <- seq(1, nrow(mydata))
head(mydata[order(mydata$StdResid), c("ID", "StdResid")], 3)

##    ID StdResid
## 53 53   -2.152
## 13 13   -1.499
## 72 72   -1.499

head(mydata[order(-mydata$StdResid), c("ID", "StdResid")], 3)

##    ID StdResid
## 38 38    2.577
## 33 33    2.051
## 2   2    1.783

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 6)

##    ID CooksD
## 38 38  0.320
## 55 55  0.104
## 33 33  0.069
## 53 53  0.066
## 22 22  0.061
## 39 39  0.038

mydata_clean <- mydata %>%
  filter(!ID %in% c(38, 55))

Conclusion:

Used standardized residuals and Cook’s distance to check for outliers and unites of high impact in our model fit2

Created a new dataset without these two units of high impact and outlines to improve the model.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, 
            x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Conclusion:

From the plot we don’t see any clear pattern like a funnel shape or increasing spread.
The residuals seem to be randomly scattered around zero across all fitted values. Therefor we assume homoscedasticity.
Below you can also see the formal test for Homoscedasticity. The variance of error is constant.(p=0,33)

library(olsrr)
ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95303, p-value = 0.003645

Conclusion

Since the p-value is 0.004, we reject the null hypothesis of the Shapiro-Wilk test. This means the errors are not normally distributed.

Due to the relatively large sample size, this deviation from normality was expected, but it still needs to be considered when checking the assumptions of linear regression.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2_clean <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation

Price = 2460,10 − 7,93 × Age − 20,67 × Distance

🔹 Interpretation of Coefficients:

Intercept (2460,10)
If both Age and Distance are 0, the expected apartment price per m2 is approximately 2460,10€.
This would mean that a new building located exactly in the city center (Distance = 0 and Age = 0) would cost approximately €2460.10 per m2.

Age (−7,93)
For each additional year of age, the price per m2 decreases by approximately 7,93€ while holding distance constant.
This is significant (p = 0,02), so the age effect is meaningful.

Distance (−20,67)
For each additional kilometer from the city center, the price per m2 decreases by 20,67€, holding age constant.
It is significant (p < 0,001).

Other

R-squared = 0.4396 -> About 44% of the variation in apartment prices can be explained by age and distance.
(p < 0.001) confirms that the model as a whole is statistically significant.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

fit2_clean <- lm(Price ~ Age + Distance, data = mydata_clean)
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)

anova(fit2_clean, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Since the p-value is 0.03, we reject the null hypothesis. This means that adding the categorical variables Parking and Balcony leads to a significant improvement in the model.Therefore, these variables explain additional variation in apartment price even more than what is explained by Age and Distance.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Categorical variables:

ParkingFYes is 168.92 (p-value = 0.0081)

Apartments with parking cost on average 168,92€ per M2 more than those without parking, holding all other variables constant.

BalconyFYes is -6.99 (p-value = 0.9057), which is not statistically significant.

There is no significant effect of having a balcony on the apartment price.

It tests the hypothesis:

HO: All regression coefficients are equal to zero.
H1: At least one explanatory variable has a significant effect on the outcome.

We reject H0. This means the model with Age, Distance, Parking, and Balcony explains a significant amount of variation in average apartment prices.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

fitted_vals <- fitted(fit3)
residual_vals <- resid(fit3)


fitted_vals[2]

##        2 
## 2377.043

residual_vals[2]

##        2 
## 422.9572

Manuel:

y^ = 2358,900 − 7.197⋅Age − 21.241⋅Distance + 168.921⋅ParkingYes − 6.985⋅BalconyYes

Age: 18
Distance: 1
ParkingF: Yes
BalconyF: No
Price: 2800

2358.900+(−7.197⋅18)+(−21.241⋅1)+(168.921⋅1)+(–6.985⋅0)= 2376.99€

Residual for ID2 2800€ - 2376,99 = 423€