Task 3, Neža Podpeskar

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Downloads/R data/R Take Home Exam 2025/Task 3/Apartments.xlsx")

head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking, levels= c (0,1), labels =c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0,1), labels = c("No", "Yes"))
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

In this step I’ve changed categorical variables with categories 1 and 0. “No” means the apartment does not have parking or a balcony, and “yes” means the apartment has parking or balcony.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price,
       mu = 1900, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

HO: MU = 1900 (the true mean apartment price is 1900 EUR) H1: MU not equal 1900 (the true mean apartment price is different from 1900 EUR)

Since the p-value is 0.004731 < 0.05, I can reject null hypothesis at the 5% significance level. I am 95% confident that true average price of the apartments lies between 1937.44 EUR and 2100.44 EUR. (Out of 100 samples, there exist 5 bad samples).

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm (Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

corr_coeff <- cor(Apartments$Age, Apartments$Price, 
                  method ="pearson")
print(corr_coeff)

## [1] -0.230255

In this step, regression function is –> Price= 2185.455 - 8.975 x Age

Estimate of regression coefficient says that if an apartment has Age=O (a newly built apartment), the estimated average price is 2185.455 EUR, assuming all other variables remain unchanged. Since the age coefficient is negative, means that for each additional year an apartment gets older, the price will on average decrease by 8.975, assuming all other variables remain unchanged. I can also reject null hypothesis, since the coefficients are not equal to 0, because p value is 0.03401<0.05. Coefficient of correlation tells us that there is a weak negative correlation between apartment age and price. Lastly, coefficient of determination indicates that only about 5.3% of variation in apartment prices can be explained by age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

#install.packages("car")
library(car)

## Loading required package: carData

scatterplotMatrix(Apartments[  , c(1, 2, 3)], smooth = FALSE)

There is no multicolinearity, as the slope of regression line in the scatterplot matrix for explanatory variables (Age & Distance) is close to zero, indicates that there is no significant correlation between these two variables.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Based on VIF statistics, there is no multicolinearity between Age and Distance, since both VIF values are below 5 and their average is close to 1. This further confirms my assumption from the previous task, where the scatterplot matrix suggested that there was no potential problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdRes <- round (rstandard(fit2), 3)
head(Apartments[order(Apartments$StdRes),],10)

## # A tibble: 10 × 6
##      Age Distance Price Parking Balcony StdRes
##    <dbl>    <dbl> <dbl> <fct>   <fct>    <dbl>
##  1     7        2  1760 No      Yes      -2.15
##  2    12       14  1650 No      Yes      -1.50
##  3    12       14  1650 No      No       -1.50
##  4    13        8  1800 No      No       -1.38
##  5    14       16  1660 No      Yes      -1.26
##  6    24        5  1830 Yes     No       -1.19
##  7    30       17  1560 No      No       -1.10
##  8    18       18  1640 Yes     Yes      -1.07
##  9    18       18  1640 Yes     Yes      -1.07
## 10    18       19  1620 Yes     No       -1.07

hist(Apartments$StdRes, 
     main= "Histogram of standardized residuals",
     col= "#FFB6C1", #pastel pink
     border ="navy",
     xlab = "Standardized residuals",
     ylab = "Frequency",
     xlim = c(-3,3))

Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
      main = "Histogram of Cooks distances",
      col= "lightgreen", 
         border = "navy",
      xlab = "Cooks distances",
      ylab = " Frequency")

head(Apartments[order(-Apartments$CooksD), ])

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdRes CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>    <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes       2.58  0.32 
## 2    43       37  1740 No      No        1.44  0.104
## 3     2       11  2790 Yes     No        2.05  0.069
## 4     7        2  1760 No      Yes      -2.15  0.066
## 5    37        3  2540 Yes     Yes       1.58  0.061
## 6    40        2  2400 No      Yes       1.09  0.038

The Histogram of standardized residuals shows no extreme outliers, as all values are between -3 and 3. However, one observation has a noticeably higher Cook’s distance (look at Histogram of Cooks distances), which indicates stronger influence on the regression model. Its standardized residual is 2.577, which is close to 3, indicating that I will have to remove it.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings.

Apartments$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(x = Apartments$StdFitted, y = Apartments$StdRes,
             xlab = "Standardized fitted values",
             ylab = "Standardized residuals",
             boxplots= FALSE,
             regline = FALSE,
             smooth = FALSE)

Apartments$StdFitted <- scale(fit2$fitted.values)

library(car)

scatterplot  (x = Apartments$StdFitted, y = Apartments$StdRes,
             xlab = "Standardized fitted values",
             ylab = "Standardized residuals",
             boxplots= FALSE,
             regline = FALSE,
             smooth = FALSE, 
             col = "deeppink",
              pch = 19)

This is example of homoscedasticity, because variance of the standardized residuals looks constant, which indicates that heteroscedasticity is not present.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments$StdRes,
     main = "Histogram of standardized residuals",
     xlab = "Standardized residuals",
     ylab = "Density",
     col = "lightblue",
     border = "navy",
     prob = TRUE,
     xlim = c(-3, 3))

shapiro.test(Apartments$StdRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdRes
## W = 0.95303, p-value = 0.003645

The Histogram of standardized residuals suggests that the distribution is not normal. To check this assumption I also added Shapiro-Wilk normality test, where: - H0: standardized residuals are distributed normally - H1: standardized residuals are not distributed normally

Since the p-value is 0.003645 < 0.05, I can reject null hypothesis, which means that at 5% significance level standardized residuals are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Apartments <- Apartments [!(Apartments$CooksD == 0.320), ]

fit2 <- lm(Price ~Age + Distance,
           data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

Firstly, I removed rows from a data frame where Cooks Distance equals 0.320. Secondly, I estimated fit2 again to calculate new regression values without that value having strong influence on the regression model.

New regression function: Price= 2456.076 - 6.464 x Age - 22.955 x Distance

Estimate of regression coefficient says that if an apartment has Age=O (a newly built apartment), the estimated average price is 2456.076, assuming all other variables remain unchanged. Since the age coefficient is negative, means that for each additional year an apartment gets older, the price will on average decrease by 6.464, assuming all other variables remain unchanged. Moreover, regression coefficient for Distance tells us that if distance increases for 1 unit and everything else stays constant, Price will decrease by 22.955 EUR.

Lastly, coefficient of determination indicates that about 48.38% of variability in apartment prices can be explained by age and distance.

To explain, p-value: H0: coefficients are equal to 0 H1: coefficients are not equal to 0

Since the p-value is 2.339e-12 < 0.05, I can reject null hypothesis, which means that at 5% significance level coefficients are not equal to 0.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <-lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: model 1 is better (fit2) H1: model 2 is better (fit3)

I would reject null hypothesis at 5% significance level, because p-value is less than 0.05 and ANOVA says that model 2 (fit3) is better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Regression coefficients for categorical variables show that apartments with parking are predicted to have price higher by 167.531 EUR compared to apartments without parking, assuming all other variables remain unchanged. Moreover, apartments with a balcony are predicted to have price lower by 15.21 EUR, compared to apartments without a balcony, controlling for all other variables. F-statistic: H0: population coefficient of determination = 0 H1: population coefficient of determination > 0

Since F= 22.04, and p-value is 3.018e-12 < 0.05, I can reject null hypothesis, meaning model explains a significant amount of the variance in Price.

Save fitted values and calculate the residual for apartment ID2.

Apartments$FittedValues <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[  , colnames(Apartments) %in% c ("Age", "Price", "Balcony", "Distance", "FittedValues", "Residuals")])

## # A tibble: 6 × 6
##     Age Distance Price Balcony FittedValues Residuals
##   <dbl>    <dbl> <dbl> <fct>          <dbl>     <dbl>
## 1     7       28  1640 Yes            1706.     -66.0
## 2    18        1  2800 No             2372.     428. 
## 3     7       28  1660 No             1721.     -61.2
## 4    28       29  1850 Yes            1563.     287. 
## 5    18       18  1640 Yes            2012.    -372. 
## 6    28       12  1770 Yes            1908.    -138.

A residual of 427.803 means that the actual price of Apartment 2 is 427.803 EUR higher than the model predicted.