1. Import the dataset Apartments.xlsx

library(readxl)

data <- read_xlsx("./Apartments.xlsx")

head(data)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

2. What could be a possible research question given the data you analyze? (1 p)

A possible research question given the data could be:

“How do the age of the apartment, its distance from the city center, and the presence of parking or a balcony affect the price per square meter?”

3. Change categorical variables into factors. (0.5 p)

data$ParkingF <- factor(data$Parking,
                      levels = c("0", "1"),
                      labels = c("No_Parking", "Parking")) # No_Parking -> Reference Category

data$BalconyF <- factor(data$Balcony,
                      levels = c("0", "1"),
                      labels = c("No_Balcony", "Balcony")) # No_Balcony -> Reference Category

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(data$Price, mu = 1900, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject the null hypothesis at the p-value of 0.005, indicating that the mean price per square meter is significantly different from 1900 euros, and specifically, it appears to be higher.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = data)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
r <- sqrt(summary(fit1)$r.squared)
if (coef(fit1)["Age"] < 0) {r <- -r}

r
## [1] -0.230255

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)
## Loading required package: carData
scatterplotMatrix(data[, c("Price", "Age", "Distance")], smooth = FALSE, col = "mediumslateblue")

Based on the scatterplot matrix, there is no indication of a perfect linear relationship between any pair of variables. Therefore, the assumption of no perfect multicollinearity, expressed as:

\[ \lambda_1 X_1 + \lambda_2 X_2 + \dots + \lambda_k X_k = 0 \]

is not violated.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = data)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

Since none of the Variance Inflation Factor (VIF) statistics exceed 5, and the average VIF is close to 1, we can conclude that there is no strong multicollinearity present in the model.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

Standardized Residuals

data$Standardized_Residuals <- round(rstandard(fit2), 3)

hist(data$Standardized_Residuals, 
     xlab = "Standard Residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals",
     col = "mediumslateblue")

# First three rows sorted by standardized residuals
head(data[order(data$Standardized_Residuals),], 3)
## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF   BalconyF   Standardized_Residuals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>                       <dbl>
## 1     7        2  1760       0       1 No_Parking Balcony                     -2.15
## 2    12       14  1650       0       1 No_Parking Balcony                     -1.50
## 3    12       14  1650       0       0 No_Parking No_Balcony                  -1.50
# Last three rows sorted by standardized residuals
tail(data[order(data$Standardized_Residuals),], 3)
## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF BalconyF   Standardized_Residuals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>                       <dbl>
## 1    18        1  2800       1       1 Parking  Balcony                      1.78
## 2     2       11  2790       1       0 Parking  No_Balcony                   2.05
## 3     5       45  2180       1       1 Parking  Balcony                      2.58

There are no outliers, as no data points have standardized residuals greater than 3 or less than -3.

Units with High Impact

data$CooksD<- round(cooks.distance(fit2), 3)

hist(data$CooksD, 
     xlab = "Cooks Distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks Distances",
     col = "mediumslateblue")

head(data[order(-data$CooksD),], 6)
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF   BalconyF   Standardized_Residuals CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>                       <dbl>  <dbl>
## 1     5       45  2180       1       1 Parking    Balcony                      2.58  0.32 
## 2    43       37  1740       0       0 No_Parking No_Balcony                   1.44  0.104
## 3     2       11  2790       1       0 Parking    No_Balcony                   2.05  0.069
## 4     7        2  1760       0       1 No_Parking Balcony                     -2.15  0.066
## 5    37        3  2540       1       1 Parking    Balcony                      1.58  0.061
## 6    40        2  2400       0       1 No_Parking Balcony                      1.09  0.038

I will remove the unit with a Cooks Distance higher than 0.3, as the graph clearly shows a jump, and this unit appears after that threshold.

data <- data[data$CooksD <= 0.3, ]

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings. (0.5 p)

data$Standardized_Fitted <- scale(lm(Price ~ Age + Distance, data = data)$fitted.values)

library(car)
scatterplot(y = data$Standardized_Residuals, x = data$Standardized_Fitted, 
            ylab = "Standardized Residuals", 
            xlab = "Standardized Fitted Values", 
            boxplots = FALSE, regLine = FALSE, smooth = FALSE, col = "mediumslateblue")

The plot shows that the variance of the errors is constant, indicating homoskedasticity. This means that the spread of the residuals remains consistent across all levels of the fitted values. In other words, the points in the scatterplot do not exhibit any “opening” or “closing” pattern, suggesting that there is no change in the variability of the errors as the fitted values increase.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

data$Standardized_Residuals <- round(rstandard(lm(Price ~ Age + Distance, data = data)), 3)

hist(data$Standardized_Residuals, 
     xlab = "Standard Residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals",
     col = "mediumslateblue")

The graph does not necessarily indicate a normal distribution, so I will conduct a Shapiro-Wilk test to formally assess normality.

shapiro.test((data$Standardized_Residuals))
## 
##  Shapiro-Wilk normality test
## 
## data:  (data$Standardized_Residuals)
## W = 0.95649, p-value = 0.006355

We reject the null hypothesis at the p-value of 0.007, indicating that the standardized residuals are not normally distributed. However, since we have a large sample size of 84 observations, the assumption of normality is less critical.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance, data = data)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12
r <- sqrt(summary(fit2)$r.squared)

r
## [1] 0.6955609

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = data)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis at the p-value of 0.03, which indicates that the second model (fit3) has a significantly higher coefficient of determination. In other words, fit3 provides a better fit to the data compared to fit2.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## Parking      167.531     62.864   2.665  0.00933 ** 
## Balcony      -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

The F-statistic tests the overall significance of the regression model. Specifically, it evaluates whether at least one of the predictors (independent variables) has a non-zero coefficient, meaning the model provides a better fit to the data than a model with no predictors.

The hypotheses being tested with the F-statistic are:

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

data$ID <- seq(1, nrow(data))

data$fit3_fitted_values <- fit3$fitted.values

data$residuals <- residuals(fit3)

residual_id2 <- data$residuals[data$ID == 2]
fitted_id2 <- data$fit3_fitted_values[data$ID == 2]

cat("Fitted Value for ID 2:", fitted_id2, "\n")
## Fitted Value for ID 2: 2372.197
cat("Residual for ID 2:", residual_id2, "\n")
## Residual for ID 2: 427.8029