Homework Assigment 3: OLS Regression

1. Import the dataset Apartments.xlsx

library(readxl)

data <- read_xlsx("./Apartments.xlsx")

head(data)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

A possible research question given the data could be:

“How do the age of the apartment, its distance from the city center, and the presence of parking or a balcony affect the price per square meter?”

3. Change categorical variables into factors. (0.5 p)

data$ParkingF <- factor(data$Parking,
                      levels = c("0", "1"),
                      labels = c("No_Parking", "Parking")) # No_Parking -> Reference Category

data$BalconyF <- factor(data$Balcony,
                      levels = c("0", "1"),
                      labels = c("No_Balcony", "Balcony")) # No_Balcony -> Reference Category

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(data$Price, mu = 1900, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject the null hypothesis at the p-value of 0.005, indicating that the mean price per square meter is significantly different from 1900 euros, and specifically, it appears to be higher.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = data)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Intercept (β₀): The estimated price per m², when the apartment is 0 years old (brand new), is 2185.455 euros.
Slope (β₁): If the apartment’s age increases by one year, the price is expected to decrease by 8.98 euros, on average.

r <- sqrt(summary(fit1)$r.squared)
if (coef(fit1)["Age"] < 0) {r <- -r}

r

## [1] -0.230255

Coefficient of Correlation (r): This means there’s a weak negative correlation between apartment age and price.
Coefficient of Determination (R²): Approximately 5.3% of the variability in apartment prices is explained by the age of the apartment. (R² = 0.05302)

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(data[, c("Price", "Age", "Distance")], smooth = FALSE, col = "mediumslateblue")

Based on the scatterplot matrix, there is no indication of a perfect linear relationship between any pair of variables. Therefore, the assumption of no perfect multicollinearity, expressed as:

\[ \lambda_1 X_1 + \lambda_2 X_2 + \dots + \lambda_k X_k = 0 \]

is not violated.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Since none of the Variance Inflation Factor (VIF) statistics exceed 5, and the average VIF is close to 1, we can conclude that there is no strong multicollinearity present in the model.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

Standardized Residuals

data$Standardized_Residuals <- round(rstandard(fit2), 3)

hist(data$Standardized_Residuals, 
     xlab = "Standard Residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals",
     col = "mediumslateblue")

# First three rows sorted by standardized residuals
head(data[order(data$Standardized_Residuals),], 3)

## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF   BalconyF   Standardized_Residuals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>                       <dbl>
## 1     7        2  1760       0       1 No_Parking Balcony                     -2.15
## 2    12       14  1650       0       1 No_Parking Balcony                     -1.50
## 3    12       14  1650       0       0 No_Parking No_Balcony                  -1.50

# Last three rows sorted by standardized residuals
tail(data[order(data$Standardized_Residuals),], 3)

## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF BalconyF   Standardized_Residuals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>                       <dbl>
## 1    18        1  2800       1       1 Parking  Balcony                      1.78
## 2     2       11  2790       1       0 Parking  No_Balcony                   2.05
## 3     5       45  2180       1       1 Parking  Balcony                      2.58

There are no outliers, as no data points have standardized residuals greater than 3 or less than -3.

Units with High Impact

data$CooksD<- round(cooks.distance(fit2), 3)

hist(data$CooksD, 
     xlab = "Cooks Distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks Distances",
     col = "mediumslateblue")

head(data[order(-data$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF   BalconyF   Standardized_Residuals CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>                       <dbl>  <dbl>
## 1     5       45  2180       1       1 Parking    Balcony                      2.58  0.32 
## 2    43       37  1740       0       0 No_Parking No_Balcony                   1.44  0.104
## 3     2       11  2790       1       0 Parking    No_Balcony                   2.05  0.069
## 4     7        2  1760       0       1 No_Parking Balcony                     -2.15  0.066
## 5    37        3  2540       1       1 Parking    Balcony                      1.58  0.061
## 6    40        2  2400       0       1 No_Parking Balcony                      1.09  0.038

I will remove the unit with a Cooks Distance higher than 0.3, as the graph clearly shows a jump, and this unit appears after that threshold.

data <- data[data$CooksD <= 0.3, ]

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings. (0.5 p)

data$Standardized_Fitted <- scale(lm(Price ~ Age + Distance, data = data)$fitted.values)

library(car)
scatterplot(y = data$Standardized_Residuals, x = data$Standardized_Fitted, 
            ylab = "Standardized Residuals", 
            xlab = "Standardized Fitted Values", 
            boxplots = FALSE, regLine = FALSE, smooth = FALSE, col = "mediumslateblue")

The plot shows that the variance of the errors is constant, indicating homoskedasticity. This means that the spread of the residuals remains consistent across all levels of the fitted values. In other words, the points in the scatterplot do not exhibit any “opening” or “closing” pattern, suggesting that there is no change in the variability of the errors as the fitted values increase.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

data$Standardized_Residuals <- round(rstandard(lm(Price ~ Age + Distance, data = data)), 3)

hist(data$Standardized_Residuals, 
     xlab = "Standard Residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals",
     col = "mediumslateblue")

The graph does not necessarily indicate a normal distribution, so I will conduct a Shapiro-Wilk test to formally assess normality.

H₀: The errors (𝜀) are normally distributed.
H₁: The errors (𝜀) are not normally distributed.

shapiro.test((data$Standardized_Residuals))

## 
##  Shapiro-Wilk normality test
## 
## data:  (data$Standardized_Residuals)
## W = 0.95649, p-value = 0.006355

We reject the null hypothesis at the p-value of 0.007, indicating that the standardized residuals are not normally distributed. However, since we have a large sample size of 84 observations, the assumption of normality is less critical.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance, data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Intercept (β₀): The estimated price per m², when the apartment is 0 years old (brand new) and at a distance of 0 km from the city center, is 2456.076 euros.
Slope (β₁ for Age): If the apartment’s age increases by one year, the price is expected to decrease by 6.464 euros, on average, keeping every other variable unchanged.
Slope (β₂ for Distance): If the apartment’s distance from the city center increases by one kilometer, the price is expected to decrease by 22.955 euros, on average, keeping every other variable unchanged.

r <- sqrt(summary(fit2)$r.squared)

r

## [1] 0.6955609

Coefficient of Correlation (r): This means there’s a moderate linear relationship between the dependent variable (price) and the explanatory variables (age and distance).
Coefficient of Determination (R²): Approximately 48.4% of the variability in apartment prices is explained by the age and distance from the city center of the apartment. (R² = 0.4838)

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = data)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

H₀: Δρ² = 0
H₁: Δρ² > 0

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis at the p-value of 0.03, which indicates that the second model (fit3) has a significantly higher coefficient of determination. In other words, fit3 provides a better fit to the data compared to fit2.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## Parking      167.531     62.864   2.665  0.00933 ** 
## Balcony      -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Parking (β₃ = 167.531): Given Age and Distance, apartments with parking (compared to those without) have, on average, a price per m² that is 167.531 euros higher. This coefficient is statistically significant, with a p-value of 0.00933.
Balcony (β₄ = -15.207): Given Age and Distance, apartments with a balcony (compared to those without) have, on average, a price per m² that is 15.207 euros lower. However, this coefficient is not statistically significant, with a p-value of 0.79795.

The F-statistic tests the overall significance of the regression model. Specifically, it evaluates whether at least one of the predictors (independent variables) has a non-zero coefficient, meaning the model provides a better fit to the data than a model with no predictors.

The hypotheses being tested with the F-statistic are:

H₀: ρ² = 0, which means that none of the independent variables in the regression model are significantly related to the dependent variable, and the model does not explain the variance in the data better than a model with no predictors.
H₁: ρ² ≠ 0, which means that at least one of the independent variables in the regression model has a significant relationship with the dependent variable, and the model explains the variance in the data better than a model with no predictors.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

data$ID <- seq(1, nrow(data))

data$fit3_fitted_values <- fit3$fitted.values

data$residuals <- residuals(fit3)

residual_id2 <- data$residuals[data$ID == 2]
fitted_id2 <- data$fit3_fitted_values[data$ID == 2]

cat("Fitted Value for ID 2:", fitted_id2, "\n")

## Fitted Value for ID 2: 2372.197

cat("Residual for ID 2:", residual_id2, "\n")

## Residual for ID 2: 427.8029