HW 3

Oliwer Rzepka

1. Import the dataset Apartments.xlsx

library(readxl)

## Warning: package 'readxl' was built under R version 4.4.1

df <- read_xlsx('./Apartments.xlsx')
df <- as.data.frame(df)
head(df)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

Do apartment characteristics like its age, distance from the city centre, whether it has parking and/or a balcony affect its price?

3. Change categorical variables into factors. (0.5 p)

df$Parking <- factor(df$Parking,
                     levels = c(0,1),
                     labels = c('No', 'Yes'))

df$Balcony <- factor(df$Balcony,
                     levels = c(0,1),
                     labels = c('No','Yes'))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

shapiro.test(df$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Price
## W = 0.94017, p-value = 0.0006513

# We assume normality regardless of the outcome of the Shapiro test.
# Thus, we proceed with the t-test.

t.test(df$Price, mu = 1900, alternative = 'two.sided')

## 
##  One Sample t-test
## 
## data:  df$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the t.test we can conclude that \(\mu\) is higher than 1900 (p-value = 0.005)

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age,
           data = df)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(df$Price, df$Age)

## [1] -0.230255

Regression coefficient \(b_1\): The coefficient suggests that increasing the age of an apartment by 1, results in an average decrease in price by 8.975 euro per \(m^2\) (p-value = 0.034).
Coefficient of correlation: The coefficient of correlation is -0.23 which signifies a weak negative correlation between the age and price of an apartment.
Coefficient of determination \(R^2\): Suggests that 5.3% of the variability in price can be explained by the age of an apartment.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

scatterplotMatrix(df[,c(1:3)], smooth = F)

Based on the graph we notice that there is no strong correlation between the two explanatory variables, (age, distance). Therefore, there should not be any problems with multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = df)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

There does not appear to be multicollinearity in the variables. None of them has to be dropped.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

df$stdresid <- round(rstandard(fit2), 3)
df$cooksd <- round(cooks.distance(fit2), 3)

hist(df$stdresid)

No standardized residuals above 3. Therefore, no outliers to remove.

hist(df$cooksd)

We see empty space in the histogram, there is at least one unit with high impact.

head(df[order(-df$cooksd),], 10)

##    Age Distance Price Parking Balcony stdresid cooksd
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038
## 58   8        2  2820     Yes      No    1.655  0.037
## 25   8       26  2300     Yes     Yes    1.571  0.034
## 57  10        1  2810      No      No    1.601  0.032
## 2   18        1  2800     Yes      No    1.783  0.030

df <- df[-c(38,55,33,53,22),]

Removing units which have much higher Cooks distances than the rest

hist(df$cooksd)

Now there is no empty spaces in the histogram. Hence, there shouldn’t be any units with high impact left.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance,
           data = df)

df$stdfitted <- scale(fit2$fitted.values)

scatterplot(y = df$stdresid, x = df$stdfitted,
            boxplots = F,
            regLine = F,
            smooth = F,
            ylab = 'Standardized Residuals',
            xlab = 'Standardized fitted values')

We can see that the variance of the Standardized residuals opens up as the standardized fitted values increases. Hence, there is potential heteroskedasticity.

library(olsrr)

## Warning: package 'olsrr' was built under R version 4.4.3

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

Based on the Breusch Pagan Test, we cannot reject the null hypothesis (p-value = 0.19). Thus, we assume that the variance of standardized residuals is constant.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(df$stdresid)

Based on the graph we cannot immediately say that it is normally distributed.

shapiro.test(df$stdresid)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$stdresid
## W = 0.93418, p-value = 0.0004761

We reject the null hypothesis at p-value < 0.001. Thus, the standardized residuals are not normally distributed.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = df)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Age: An increase of one year in the age of an apartment, while controlling for all other variables, results in an average decrease of the price per square meter of the apartment by 8.674 Euro (p-value = 0.001)
Distance: An increase of 1km in the distance from the city center, while controlling for all other variables, results in an average decrease in the price per square meter of an apartment by 24.063 Euro (p-value < 0.001)
Coefficient of Determination: The \(R^2\) tells us that 53.6% of the variability in price per square meter of an apartment can be explained by linear effects of Age and Distance.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = df)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

At p-value = 0.12, we cannot reject the null hypothesis. Therefore, there is no significant increase in the \(R^2\) between the complex and simple model. Thus, we assume that the simple model is better.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Parking: Keeping all other variables constant, an apartment with parking costs on average 128.7 Euro per square meter more compared to apartments without parking. (p-value = 0.04)
Balcony: Based on the p-value, we cannot say that having a balcony has a significant impact on the price per square meter of an apartment.
F-statistic:
\(H_0\): \(\rho^2 = 0\) (All betas equal to zero)
\(H_0\): \(\rho^2 > 0\) (At least one beta is different from 0)

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

df$fittedvals <- fit3$fitted.values

df$Price[2] - df$fittedvals[2]

## [1] 443.4026