HW 3

Tobias Schöfbeck

1. Import the dataset Apartments.xlsx

library(readxl)
data <- read_xlsx("Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2 in euros
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

Does the distance from the city center have a significant influence on the price of the apartment?

3. Change categorical variables into factors. (0.5 p)

data$Parking <- factor(data$Parking, levels = c(0, 1), labels = c("No", "Yes"))
data$Balcony <- factor(data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(data$Price, muh = 1900, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  data$Price
## t = 49.263, df = 84, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

It can be concluded that the arithmetic mean of price per m² is significantly different from 1900 (p<0.001). It is estimated to be higher at 2018.941.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = data)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(data$Age, data$Price)

## [1] -0.230255

The value for b0 is 2185.455. This means that if the apartment is 0 years old, its price per m² is on average 2185.455 euros (p<0.001).
The value for b1 is -8.975. This means that if an apartment age increases by one year, its price per m² decreases by 8.975 euros on average (p=0.035).
The coefficient of correlation is -0.230255. This means that Age and Price are weakly negatively correlated. Meaning that when the Age of the apartment risis, the Price per m² tends to go down.
The coefficient of determination is 0.05302. This means that 5% of the variability of Price per m² can be explained by the age of the apartment. As this number is really low, we should consider adding a few variables.

6. Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)
scatterplotMatrix(data[ ,c(-4,-5), smooth = FALSE])

There does not seem to be a potential problem with multicolinearity. There is only a slight negative relationship between the age of the apartment in years and the price in m². There is a stronger negative correlation between the distance from the city center and the price per m², but it also does not look like there is a problem with multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = data)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The values are very close to one. This means that there is almost no multicolinearity in the model.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

data$StdResid <- round(rstandard(fit2), 3)
data$CooksD <- round(cooks.distance(fit2), 3)
hist(data$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histogram of Standardized Residuals") #no problematic units

hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances")

head(data[order(-data$CooksD),],5 )

## # A tibble: 5 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061

which(data$Price == 2180) # checking if any other observation has this value

## [1] 38

library(dplyr)
data <- data %>% filter(!Price == 2180)
hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances") # There are still "jumps"

which(data$Price == 1740) #checking if there are any observations which have these values as well

## [1] 54

which(data$Price == 2790)

## [1] 33

which(data$Price == 1760)

## [1] 52

which(data$Price == 2540)

## [1] 22

data <- data %>% filter(!Price %in% c(1740, 2790, 1760, 2540))
hist(data$CooksD, xlab = "Cooks Distance", ylab = "Frequency", main = "Histogram of Cooks distances")

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance, data = data)
data$StdResid <- round(rstandard(fit2), 3)
data$StdFitted <- scale(fit2$fitted.values)
plot(x = data$StdFitted, y = data$StdResid, xlab = "Standardized residuals", ylab = "Standardized fitted values") #There might be Heteroskedasticity present
abline(0,0, col = "red")

library(olsrr)
ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

H0: The variance of errors is constant
HA: The variance of errors is not constant

The nullhypothesis can not be rejected (p=0.188). Therefore the assumption of homoscedasticity is not violated.

11. Are the standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(data$StdResid, xlab = "Standardized residuals", ylab = "Frequency", main = "Histogram of standardized residuals")

shapiro.test(data$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$StdResid
## W = 0.94154, p-value = 0.001166

H0: The errors are normally distributed
HA: The errors are not normally distributed

As p<0.005 we reject the nullhypothesis that the errors are normally distributed. We can also see this when looking at the distribution of the standardized residuals in the Histogramm, as it does not resemble a normal distribution.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

#fit2 was already estimated again in 10.
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

The intercept is 2502.467 and is the point where the regression line intersects with the y-axis. This means that when an apartment is 0 years old and is in the city center, its price per m² is on average 2329.724 euros (p>0.001).
The value for b1 is -8.674. This means, when an apartments ages by one more year, its price per m² decreases by -8.674 euros on average, when all other variables stay the same (p=0.009).
The value for b2 is -24.063. This means that the price of an apartment per m² decreases by 24.063 euros on average, for every additional kilometer it is distanted from the city center, when all other variables stay the same (p<0.001).

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = data)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

H0: Both models are equally good
HA: Model fit3 fits the data better than model fit2

We can not reject the nullhypothesis. This means that the model fit3 does not fit the data significantly better than model fit2.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

The value for b3 is 128.7. This means that when controlling for all other variables, an apartment is on average 128.7 euros per m² more expensive, when it has parking, compared to when it has not (p=0.038)
The value for b4 is 6.032. The test for the linear regression coefficient is not significant. This means that it wether the apartment has a balcony or not, has no significant influence on the price per m².

F-Test

H0: No explanatory variable has a significant influence on the price per m² of an apartment.
HA: At least one explanatory variable has a significant influence on the price per m² of an apartment.

We can reject the nullhypothesis and can say that at least one explanatory variable has a significant influence on the price per m² of an apartment.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

data$Fitted <- fitted.values(fit3)
data$Residuals <- residuals(fit3)
data[2, c(-6,-7,-8)]

## # A tibble: 1 × 7
##     Age Distance Price Parking Balcony Fitted Residuals
##   <dbl>    <dbl> <dbl> <fct>   <fct>    <dbl>     <dbl>
## 1    18        1  2800 Yes     No       2357.      443.