HW 3

Yong-Quan Su

1. Import the dataset Apartments.xlsx

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

df <- as.data.frame(read_xlsx("./Apartments.xlsx"))

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

“Does distance from city centre affect the price of an apartment? If so, how big is its impact?”

3. Change categorical variables into factors. (0.5 p)

df$Parking <- factor(df$Parking, levels = c(0, 1), labels = c("No", "Yes"))
df$Balcony <- factor(df$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

T-test is employed to test whether the arithmetic mean of the sample data deviates from the hypothesis. Based on the p-value, $0.004731$, the null-hypothesis can be rejected at $p < 0.05$. As such, the true mean of the sample data differs from 1900.

t.test(df$Price, 
       mu = 1900, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  df$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

Regression coefficient: The regression coefficient shows the change in the dependent variable if the independent variables change by 1 unit.
- Age: The p-value of this independent variable is $0.034 < 0.5$, meaning that the variable is statistically significant. Hodling all else constant, if the age of the apartment increases by 1 year, its price per m² decreases by 8.975 monetary unit, on average.
Correlation coefficient: The correlation coefficient ($\rho$) shows the movement of two variables. If $\rho$ is a positive value, the two variables co-move with one another. If $\rho$ is a negative value, the two variables counter-move from one another.
- The correlation coefficient between Price and Age is negative, meaning that the higher the apartment’s age is, the lower the price.
Coefficient of determination: The coefficient of determination ($R^2$) is used to determine the proportion of variance in the dependent variable that can be predicted by the independent variable.
- The $R^2$ in this case is quite small, namely 5.3%. This means that only 5.3% of the variability in the apartment price can be explained by the linear effect of age.

fit1 <- lm(Price ~ Age, data = df)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

print(paste("Regression Coefficient:", round(coef(fit1)[2], 4)))

## [1] "Regression Coefficient: -8.9751"

print(paste("Correlation Coefficient:", round(cor(df$Price, df$Age), 4)))

## [1] "Correlation Coefficient: -0.2303"

print(paste("Coefficient of Determination:", round(summary(fit1)$r.squared, 4)))

## [1] "Coefficient of Determination: 0.053"

6. Show the scaterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicollinearity. (0.5 p)

Multicolinearity describes a situation where two or more independent variables correlate strongly with one another.

Based on the scatter plot matrix, it can be concluded that the correlations of Age~Distance and Age~Price are nearly zero, i.e. not correlated. The correlation between Distance and Price seems to be negative. However, it does not seem to be a strong correlation. As such, the multicollinearity does not appear to be a concern for this model.

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.3

scatterplotMatrix(df[, c(1, 2, 3)], smooth = FALSE)

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

Having fitted the model, it is necessary to check the assumptions before analysing the results from the regression model.

fit2 <- lm(Price ~ Age + Distance, data = df)

8. Check the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.

The VIF in this case is 1.002 which lies within the range in which VIF value is still acceptable. This value can be used to verify the correlation matrix above that multicollinearity is not an issue for this model.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

print(paste("VIF Statistics:", mean(vif(fit2))))

## [1] "VIF Statistics: 1.0018445034083"

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

Before removing any potential outliers or units with high influence, the data were manually examined by ordering the dataframe according to each criterion: standardised residuals and Cook’s distance.

A commonly used cut-off point for identifying potential outliers based on standardised residuals is ±3. Upon inspection, all standardised residuals in the dataset fall within this range. Therefore, no records were removed on the basis of residual size, as none of the observations exhibited extreme deviations from the model’s expectations.

In contrast, units with high influence are assessed using Cook’s distance. In this dataset, two observations, namely records indexed 38 and 55, display Cook’s distance values that noticeably deviate from the rest. Although they do not exceed the conventional threshold of one, their relative magnitude suggests that it could have a substantial impact on the regression results. Including such a point may increase model instability and reduce the explanatory power of the regression model.

df$fit2_StdResid <- round(rstandard(fit2), 3) 
df$fit2_CooksD <- round(cooks.distance(fit2), 3) 

head(df[order(abs(df$fit2_StdResid), decreasing = TRUE),], 5)

##    Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 38   5       45  2180     Yes     Yes         2.577       0.320
## 53   7        2  1760      No     Yes        -2.152       0.066
## 33   2       11  2790     Yes      No         2.051       0.069
## 2   18        1  2800     Yes      No         1.783       0.030
## 61  18        1  2800     Yes     Yes         1.783       0.030

head(df[order(df$fit2_CooksD, decreasing = TRUE),], 5)

##    Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 38   5       45  2180     Yes     Yes         2.577       0.320
## 55  43       37  1740      No      No         1.445       0.104
## 33   2       11  2790     Yes      No         2.051       0.069
## 53   7        2  1760      No     Yes        -2.152       0.066
## 22  37        3  2540     Yes     Yes         1.576       0.061

hist(x = df$fit2_CooksD, xlab = "Cook's Distance", ylab = "Frequency", main = "Histogram of Cook's Distance")

Based on the preceding analysis, the most appropriate course of action is to remove records that may adversely affect the performance of the regression model. Accordingly, the observations at indices 38 and 55 were excluded from the dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df_cleaned <- df %>% 
  filter(!row_number() %in% c(38, 55))

head(df_cleaned[order(df_cleaned$fit2_CooksD, decreasing = TRUE),], 5)

##    Age Distance Price Parking Balcony fit2_StdResid fit2_CooksD
## 33   2       11  2790     Yes      No         2.051       0.069
## 52   7        2  1760      No     Yes        -2.152       0.066
## 22  37        3  2540     Yes     Yes         1.576       0.061
## 38  40        2  2400      No     Yes         1.091       0.038
## 56   8        2  2820     Yes      No         1.655       0.037

10. Check for potential heteroskedasticity with scatterplot between standardized residuals and standardized fitted values. Explain the findings. (0.5 p)

Based on the scatter plot, the spread of the standardised residuals appears to widen as the fitted values increase, suggesting potential heteroscedasticity in the regression model. However, the Breusch-Pagan test did not reject the null hypothesis at the 5% significance level ($p > 0.05$), thereby providing no strong evidence against homoscedasticity.

fit2_cleaned <- lm(Price ~ Age + Distance, data = df_cleaned)

df_cleaned$fit2_StdFitted <- scale(fit2_cleaned$fitted.values)

library(car)

scatterplot(y = df_cleaned$fit2_StdResid, x = df_cleaned$fit2_StdFitted, 
            ylab = "Standardised residuals", 
            xlab = "Standardised fitted values", 
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2_cleaned)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

11. Are standardized residuals distributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

The histogram below displays the distribution of standardised residuals from the fit2_cleaned model. At first glance, it is evident that the distribution does not follow a normal shape, appearing positively skewed.

To formally assess normality, the Shapiro-Wilk test was conducted. The resulting p-value was $< 0.05$(specifically, 0.003), providing statistical evidence that the standardised residuals deviate from a normal distribution.

Given the result, potential biases of the model can be argued. However, the sample size (n = 83) is sufficiently large for the Central Limit Theorem to be applicable. As such, this suggests that while residuals are not perfectly normal, inference may still be valid due to the sample size.

hist(df_cleaned$fit2_StdResid,
     xlab = "Standardised residuals", 
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

shapiro.test(df$fit2_StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$fit2_StdResid
## W = 0.95303, p-value = 0.003645

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

Regression coefficient:

Age: Holding all else constant, the price per m² of an apartment, on average, declines by approximately 7.85 for each additional year of the apartment’s age.
Distance: Holding all else constant, the price per m² of an apartment, on average, declines by approximately 23.95 for each additional kilometre farther from city centre.

fit2_cleaned <- lm(Price ~ Age + Distance,
           data = df_cleaned)

summary(fit2_cleaned)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = df)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

Based on the ANOVA output comparing two regression models, this suggests that Model 2 (fit_3) has greater explanatory power than Model 1 (fit_2), evident by a lower residual sum of squares. The improvement in model fit is statistically significant (p = 0.01007 < 0.05), suggesting that the additional predictors (Parking and Balcony) contribute meaningfully to explaining variation in Price.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

Regression Coefficient:

Parking: Holding all else constant, the price per m² of an apartment with parking, on average, is approximately 196.17 higher than apartments without parking.
Balcony: Holding all else constant, the price per m² of an apartment with balcony, on average, is approximately 1.94 higher than apartments without balcony. However, the coefficient is not statistically significant ($ p = 0.97$), indicating no meaningful effect of having a balcony on price in this model.

F-statistic tests whether any of the independent variables has any linear relationship with the dependent variable. As such:

\[H_0: \beta_1 = \beta_2 = \beta_3=\beta_4=0\]

\[H_0: At\ least\ one\ \beta \neq0\]

The p-value is extremely small, $1.849e^{-11}$. Thereby the null-hypothesis can be rejected, stating that at least one independent variables are significantly related to the variation in the dependent variable.

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

fitted3_value <- data.frame(fitted.values(fit3))
df$Resid3 <- df$Price - fitted3_value

index <- 2

print(paste("The residual for apartment", index, "is", round(df$Resid3[index, ], 2)))

## [1] "The residual for apartment 2 is 442.59"