HW 3

Lars Binger

1. Import the dataset Apartments.xlsx

library(readxl)
data <- read_xlsx("Apartments(1).xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

data$BalconyFactor <- factor(x=data$Balcony,
                      levels = c(1, 0),
                      labels = c("Balcony", "No Balcony"))

data$ParkingFactor <- factor(x=data$Parking,
                      levels = c(1, 0),
                      labels = c("Parking", "No Parking"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

shapiro.test(data$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Price
## W = 0.94017, p-value = 0.0006513

#P-value < 0.001, hence we reject the Null Hypothesis.
#As normality distribution assumption not fulfilled, a Wilcoxon Signed Rank Test is conducted.

wilcox.test(data$Price,
            mu=1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  data$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

#We reject the Null Hypothesis at p = 0.03.

H0: Mu_Price = EUR 1900 H1: Mu_Price =/ EUR 1900

Assumptions and Requirements:

1.Variable is numeric - TRUE

2.Normality

H0: Variable normally distributed. H1: Variable not normally distributed.

P-value < 0.001, hence we reject the Null Hypothesis.

As normality distribution assumption not fulfilled, a Wilcoxon Signed Rank Test is conducted. We reject the Null Hypothesis at p = 0.03.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age,
          data = data)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(data$Price, data$Age)

## [1] -0.230255

Regression coefficient: Everything else equal, a one year increase in age decreases price per square meter by EUR 8.975, on average (p<0.05).

Coefficient of correlation: There is a negative correlation between price per square meter and age of the apartment. This relationship is weak, however. This could also be calculated by taking the root of the r squared value.

Coefficient of determination: The total variation in price that is explained by age is 5.302%. This is a medium effect size.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(data[,1:3], smooth = FALSE)

The correlation between distance and age is comparatively low. Hence, I would expect this not to be too strong. Nevertheless, to be sure, the VIF should be calculated (done below).

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = data)

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

With VIF values of less than 5, we can assume that there is no multicolinearity.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

data$StdResid <- round(rstandard(fit2), 3)
data$CooksDist <- round(cooks.distance(fit2), 3)

hist(data$CooksDist,
     xlab = "Cook's Distance",
     ylab = "Frequency",
     main = "Histogram of Cook's Distance")

#Removing any problematic values
#For the standardized residuals we only remove values larger/smaller than +3 or -3. There are no such values in the dataset. 
#For Cook's Distance, we are removing values that are relatively large compared to the other ones. For example the one between 0.30 and 0.35 in the second histogram plot above. We do so until the histogram plot is continuous. 

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data1 <- data %>% 
  filter(!Age=="5")

data1 <- data1 %>% 
  filter(!Age=="43")

data1 <- data1 %>% 
  filter(!Age=="2")

data1 <- data1 %>% 
  filter(!Price=="1760")

data1 <- data1 %>% 
  filter(!Age=="37")

hist(data1$CooksDist,
     xlab = "Cook's Distance",
     ylab = "Frequency",
     main = "Histogram of Cook's Distance")

9. Check for potential heteroskedasticity with scatterplot between standardized residuals and standrdized fitted values. Explain the findings. (1 p)

#Again calculated fit2 with new data = data1 which excludes the outliers. 
removed_values_fit2 <- lm(Price ~ Age + Distance,
          data = data1)

summary(removed_values_fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

fitted_values <- fitted(removed_values_fit2)
std_fitted_values <- (fitted_values - mean(fitted_values)) / sd(fitted_values)


# Create scatterplot
plot(std_fitted_values, data1$StdResid, 
     xlab = "Standardized Fitted Values", 
     ylab = "Standardized Residuals", 
     main = "Scatterplot of Standardized Residuals vs Standardized Fitted Values")

#install.packages("olsrr")
library(olsrr)

## Warning: package 'olsrr' was built under R version 4.3.2

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(removed_values_fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

This does not look like there is heteroskedasticity. This is because all the datapoints are roughly fitting inbetween two parallel lines. There is also not a non-linear relationship. The BP-Test confirms that we cannot reject the Null Hypothesis of constant variance (p>0.005).

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

shapiro.test(data1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  data1$StdResid
## W = 0.93418, p-value = 0.0004761

hist(data1$StdResid, 
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     col = "grey",
     border = "black",
     breaks = 20)

No, the standardized residuals are not normally distributed. We reject the Null Hypothesis of the Shapiro Wilk Test that assumes normal distribution of the standardized residuals. We reject it at p<0.001. This would mean that the assumption for the linear regression is violated. Since n=80, we can still conduct a linear regression analysis, however, we require the significance level of the coefficients to be further away from 0.05.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Coefficients:

Age: Holding everything else constant, a one year increase in the age of the apartment decreases its price per square meter by, on average, 7.934 EUR.
Distance: Holding everything else constant, a one kilometer increase in the distance of the apartment from the city center, decreases the price per square meter by, on average, 20.667 EUR.
Intercept: A brand new apartment in the city center costs EUR 2460.101 per square meter.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

#Here outliers were excluded.
removed_outliers_fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
          data = data1)

summary(removed_outliers_fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2528.049     80.942  31.233  < 2e-16 ***
## Age                       -7.970      3.191  -2.498   0.0147 *  
## Distance                 -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFactorNo Parking -128.700     60.801  -2.117   0.0376 *  
## BalconyFactorNo Balcony   -6.032     57.307  -0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

#I compare the models where not outliers are excluded.
anova(removed_values_fit2, removed_outliers_fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Including the two dummy variables does not significantly increase the fit of the model (p=0.11).

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(removed_outliers_fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2528.049     80.942  31.233  < 2e-16 ***
## Age                       -7.970      3.191  -2.498   0.0147 *  
## Distance                 -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFactorNo Parking -128.700     60.801  -2.117   0.0376 *  
## BalconyFactorNo Balcony   -6.032     57.307  -0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Regression coefficient “ParkingFactorNo Parking”: Everything else equal, the effect of not having parking included in the apartment reduces the price per square meter by EUR 196.168 on average to an apartment with parking.

Regression coefficient “ParkingFactorNo Balcony”: This is not statistically significant. Hence, no interpretation.

Null hypothesis F-Statistic: There is no predictive relationship between any of the explanatory variables included in the model and the dependent variable (in other words R^2 squared is equal to zero).

We reject the Null hypothesis at a p-value of less than 0.001.

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

data1$fitted_values_fit3 <- fitted(removed_outliers_fit3)
residual_apartment2 <- 2800 - 2356.597

The fitted value of apartment ID2 is EUR 2356.597 per square meter and the actual value is EUR 2800. Hence, the residual is 443.403.