HW 3

NAME SURNAME

Abel Mate Nagy

1. Import the dataset Apartments.xlsx

library(readxl)

df <- read_xlsx("Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

Can differences in price per square meter of apartments be explained by the distance of the apartment from the city center and whether there is parking available?

3. Change categorical variables into factors. (0.5 p)

df$Parking <- factor(df$Parking, levels = c(0,1), labels = c("No", "Yes"))
df$Balcony <- factor(df$Balcony, levels = c(0,1), labels = c("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(df$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  df$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Since the exercise requires the test of the arithmetic mean, I use the t-test. Since the sample size is 85, it is safe to assume that the Central Limit Theorem kicks, and this justifies the use of the t-test.

The conclusion of the test is that \(H_0\) can be rejected at \(p = 0.005\), meaning \(\mu_{Price} \neq 1900\). Looking at the sample mean, we see it is higher than 1900.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = df)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression coefficient estimate: a year increase in the age of the apartment leads to a 8.975 EUR decrease in the price per square meter of the apartment on average.
Coefficient of correlation: We can obtain the correlation coefficient by taking the square root of R squared, leading us to \(\sqrt{0.05302} = 0.2302607\). This means there is a weak correlation between the age and price per square meter of apartments.
Coefficient of determination: The coefficient of determination, or R squared, means that \(0.053\%\) of variability in price per square meter can be explained by the age of the apartment.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)
scatterplotMatrix(df[, c("Price", "Age", "Distance")], smooth = FALSE)

As Age and Distance are the explanatory variables, I look at the plots between these two. There does not seem to be a strong correlation between them, the scatterplot looks rather random.

There is a clearer correlation between e.g. Price and Age, but as Price is the independent variable, we do not need to worry about this for multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = df)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

As the VIF statistic is close to 1 for both variables, there is no indication of multicolinearity. If any of them would be above 5, it would be necessary to drop one.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

df$std_res <- round(rstandard(fit2), 3)
df$cook_d <- round(cooks.distance(fit2), 3)

hist(df$std_res)

hist(df$cook_d)

Based on the standardized residual, we see that the highest is between 2.5 and 3, so none of them are more than 3 standard deviations away from the mean. To double check, we can list the 5 highest standardized residuals.

head(df[order(-df$std_res), ])

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony std_res cook_d
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes        2.58  0.32 
## 2     2       11  2790 Yes     No         2.05  0.069
## 3    18        1  2800 Yes     No         1.78  0.03 
## 4    18        1  2800 Yes     Yes        1.78  0.03 
## 5     8        2  2820 Yes     No         1.66  0.037
## 6    10        1  2810 No      No         1.60  0.032

We can suspect that there is a unit with high influence. We can also list the highest Cook’s distances.

head(df[order(-df$cook_d), ])

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony std_res cook_d
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes        2.58  0.32 
## 2    43       37  1740 No      No         1.44  0.104
## 3     2       11  2790 Yes     No         2.05  0.069
## 4     7        2  1760 No      Yes       -2.15  0.066
## 5    37        3  2540 Yes     Yes        1.58  0.061
## 6    40        2  2400 No      Yes        1.09  0.038

As this unit is very distinct from the others based on Cook’s distance, we can drop it.

library(dplyr)
# we know the exact cook_d, and that there is only one of these, so we can drop it based on that
df <- df %>% filter(!cook_d == 0.320)

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

# fit model again after dropping one observation

fit2_new <- lm(Price ~ Age + Distance, data = df)

df$std_fitted <- scale(fit2_new$fitted.values)
car::scatterplot(x = df$std_fitted, y = df$std_res,
                 boxplots = FALSE,
                 smooth = FALSE,
                 regLine = FALSE)

The variance of errors seems to be more or less constant, meaning that the homoskedasticity assumption seems to be met.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(df$std_res)

shapiro.test(df$std_res)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$std_res
## W = 0.94879, p-value = 0.002187

The errors do not seem to be normally distributed based on the histogram. The Shapiro-Wilk test confirms this, meaning we reject the null hypothesis that the errors are normally distributed at \(p = 0.003\).

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance, data = df)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Explanation of coefficients:

Age: a year increase in the age of the apartment leads to a 6.464 EUR decrease in the price per square meter of the apartment on average, provided all other variables are held constant.
Distance: a km increase in the distance of the apartment from the city center leads to a 22.955 EUR decrease in the price per square meter of the apartment on average, provided all other variables are held constant.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = df)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we reject \(H_0\) at \(p = 0.031\), we can conclude that fit3 fits the data better than fit2.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Parking: Holding all other variables constant, apartments with a parking space have 167.531 EUR higher price per m2 on average than those without.
Balcony: Holding all other variables constant, apartments with a balcony have 15.207 EUR lower price per m2 on average than those without.

Hypothesis being tested:

\[H_0: \rho^2 = 0\] \[H_1: \rho^2 \neq 0\]

where \(\rho^2\) is the coefficient of determination.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

df$fitted_3 <- fit3$fitted.values
df$id <- seq(1, nrow(df))

res_id2 <- df$Price[df$id == 2] - df$fitted_3[df$id == 2]
res_id2

##        2 
## 427.8029