HW 3

ROSTYSLAV MYKHALCHUK

1. Import the dataset Apartments.xlsx

library(readxl) #Activating library 
mydata <- read_xlsx("./Apartments.xlsx") #Reading data set in .xlsx format

mydata <- as.data.frame(mydata) #Convert data table to data frame format 
head(mydata)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2 in euros
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

How do apartment characteristics (age, distance to city center, availability of parking and balcony) affect the price per m² (on avg.)?

3. Change categorical variables into factors. (0.5 p)

mydata$Parking <- factor(mydata$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

mydata$Balcony <- factor(mydata$Balcony,
                         levels = c(0, 1),
                         labels = c("No", "Yes"))

head(mydata)

##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No
## 4  28       29  1850      No     Yes
## 5  18       18  1640     Yes     Yes
## 6  28       12  1770      No     Yes

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(mydata$Price, 
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Hypotheses for one-sample t-test:

Null hypothesis (H0): μ = 1900
Alternative hypothesis (H1): μ ≠ 1900
We reject the null hypothesis at the p = 0.05, meaning that there is statistically significant evidence that the average apartment price differs from 1900 EUR

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = mydata)


summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The estimate of regression coefficient

The estimate of regression coefficient (b1) equals to -8.975 (the slope is negative -> negative relationship between Price and Age)
For each additional year of apartment age, the price decreases on average by approximately 8.98 euros per m2 (H0 rejected at p = 0.04 - this relationship is statistically significant)

Coefficient of determination

R^2 equals to 0.05302
Only 5.3% of the variability of the price is explained by the apartment’s age
there are also other variables that describe the price

Coefficient of correlation

cor(mydata$Price, mydata$Age)

## [1] -0.230255

Correlation coefficient between price and age equals to -0.23
the linear realationship between price and age is negative and weak (Pearson)

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

#install.packages("carData")

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[ , -c(4, 5)],
                  smooth = FALSE,
                  main = "Scatterplot Matrix: Price, Age, and Distance")

There is no strong linear relationship between the explanatory variables (i.e., no highly correlated variables), suggesting no multicollinearity issue
However, a formal test, such as the Variance Inflation Factor (VIF), is recommended to confirm this

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Check the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF values for both Age and Distance are around 1.00, confirming that there is no multicollinearity in the model

mean(vif(fit2))

## [1] 1.001845

the mean of VIF is around 1, supporting previous conclusion

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydata$std_resid <- rstandard(fit2)
mydata$cooks_d <- cooks.distance(fit2)

hist(mydata$std_resid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata$std_resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$std_resid
## W = 0.95306, p-value = 0.00366

The distribution is roughly symmetric but shows some skewness and deviation from a perfect bell shape
A few residuals appear above +2 or below -2, but most are concentrated around zero, which is expected in a well-fitting model
However, What about Shapiro-Wilk normality test, p-value is less than 0.05
but we have big sample (sample size = 85), and the larger is the sample size, less problematic is the violation of normality distribution
We assume that errors are normally distributed

hist(mydata$cooks_d,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

Although most Cook’s distances are very small, there’s a notable jump from around 0.15 to over 0.30, which could signal a few units with disproportionately high influence on the model

head(mydata[order(-mydata$cooks_d),], 6)

##    Age Distance Price Parking Balcony std_resid    cooks_d
## 38   5       45  2180     Yes     Yes  2.576772 0.31973058
## 55  43       37  1740      No      No  1.444768 0.10420445
## 33   2       11  2790     Yes      No  2.050586 0.06913379
## 53   7        2  1760      No     Yes -2.151787 0.06625775
## 22  37        3  2540     Yes     Yes  1.575982 0.06086868
## 39  40        2  2400      No     Yes  1.091176 0.03750987

mydata <- mydata[order(-mydata$cooks_d)[-1], ] # [-1] removes the first index (i.e., the most influential unit)
head(mydata[order(-mydata$cooks_d),], 6)

##    Age Distance Price Parking Balcony std_resid    cooks_d
## 55  43       37  1740      No      No  1.444768 0.10420445
## 33   2       11  2790     Yes      No  2.050586 0.06913379
## 53   7        2  1760      No     Yes -2.151787 0.06625775
## 22  37        3  2540     Yes     Yes  1.575982 0.06086868
## 39  40        2  2400      No     Yes  1.091176 0.03750987
## 58   8        2  2820     Yes      No  1.655343 0.03654413

hist(mydata$cooks_d,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

there are 2 more units with high impact that should be removed

mydata <- mydata[order(-mydata$cooks_d)[-c(1, 2, 3, 4)], ]

hist(mydata$cooks_d,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

head(mydata[order(-mydata$cooks_d),], 6)

##    Age Distance Price Parking Balcony std_resid    cooks_d
## 39  40        2  2400      No     Yes  1.091176 0.03750987
## 58   8        2  2820     Yes      No  1.655343 0.03654413
## 25   8       26  2300     Yes     Yes  1.570985 0.03412790
## 57  10        1  2810      No      No  1.600668 0.03199717
## 2   18        1  2800     Yes      No  1.783288 0.03036543
## 61  18        1  2800     Yes     Yes  1.783288 0.03036543

this is pretty close, so I would rather not remove other variables

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance, data = mydata)

mydata$std_fit <- scale(fit2$fitted.values)


library(car)
scatterplot(std_resid ~ std_fit,
            data = mydata,
            xlab = "Standardized Fitted Values",
            ylab = "Standardized Residuals",
            main = "Scatterplot: Residuals vs Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE,
            grid = FALSE)

There is no clear visual evidence of heteroskedasticity (there is relatively constant distribution of the points)
However, to be more rigorous, you could perform a formal test such as the Breusch-Pagan test to confirm this assumption

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

# install.packages("olsrr")

library(ggpubr)

## Loading required package: ggplot2

ggqqplot(rstandard(fit2))

shapiro.test(rstandard(fit2))

## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(fit2)
## W = 0.94156, p-value = 0.001168

The residuals show some deviations from normality — both visually (in the Q-Q plot) and statistically (via Shapiro–Wilk test)
However, since the sample size is quite large (n = 85), the violation of normality is less problematic
We can proceed with linear regression

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2_clean <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Intercept (b0)

This is the estimated price per m2 when both Age and Distance are 0
While this exact situation (age = 0 years and distance = 0 km) may not be realistic, the intercept serves as a baseline
However, it could be interpreted as a newly built apartment located right in the city center, so a newly built apartment located at the city center is expected to cost approximately €2502 per m2 (on average)
But it is more theoretical baseline than a practical one, because no apartment in our dataset is newer than 2 years old, or nothing is located exactly at 0 km from the center

Age - Regression Coefficient (b1)

For each additional year in apartment age, the price per m2 decreases on average by ~8.67 euros (ceteris paribus)
Statistically significant (p = 0.009), so age has a meaningful effect on price

Distance - Regression Coefficient (b1)

For each additional kilometer away from the city center, the price drops on average by ~24.06 euros/m2 (p < 0.001) (ceteris paribus)

Coefficient of determination

About 53.61% of the variation in apartment price is explained by the combination of age and distance

F-statistic

The overall model is statistically significant (p < 0.001)

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

We fail to reject the null hypothesis, meaning model fit3 does not significantly improve the model compared to fit2

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Interpretation of Categorical Coefficients:

ParkingYes (b3):

Holding all other variables constant, apartments with parking cost on avg. €128.70 more per square meter than those without parking
The p-value is less than 0.05 –> the effect of parking is statistically significant

BalconyYes (b4):

holding all other variables constant, apartments with balconies are estimated to cost €6.03 more per square meter, compared to those without
However, p-value is very high –> this effect is not statistically significant. The presence of a balcony doesn’t have a meaningful impact on price in this sample

F-statistic Hypothesis Test

Null hypothesis (H0): All regression coefficients are zero (i.e., the model does not explain anything)
Alternative hypothesis (H1): At least one explanatory variable in the model significantly explains variation in the dependent variable
Since the p-value is extremely small, we reject the null hypothesis. At least one explanatory variable in the model significantly explains variation in the dependent variable