library(readxl)
Apartments <- read_excel("~/Downloads/R data/R Take Home Exam 2025/Task 3/Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$Parking <- factor(Apartments$Parking, levels= c (0,1), labels =c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0,1), labels = c("No", "Yes"))
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
In this step I’ve changed categorical variables with categories 1 and 0. “No” means the apartment does not have parking or a balcony, and “yes” means the apartment has parking or balcony.
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
HO: MU = 1900 (the true mean apartment price is 1900 EUR) H1: MU not equal 1900 (the true mean apartment price is different from 1900 EUR)
Since the p-value is 0.004731 < 0.05, I can reject null hypothesis at the 5% significance level. I am 95% confident that true average price of the apartments lies between 1937.44 EUR and 2100.44 EUR. (Out of 100 samples, there exist 5 bad samples).
fit1 <- lm (Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
corr_coeff <- cor(Apartments$Age, Apartments$Price,
method ="pearson")
print(corr_coeff)
## [1] -0.230255
In this step, regression function is –> Price= 2185.455 - 8.975 x Age
Estimate of regression coefficient says that if an apartment has Age=O (a newly built apartment), the estimated average price is 2185.455 EUR, assuming all other variables remain unchanged. Since the age coefficient is negative, means that for each additional year an apartment gets older, the price will on average decrease by 8.975, assuming all other variables remain unchanged. I can also reject null hypothesis, since the coefficients are not equal to 0, because p value is 0.03401<0.05. Coefficient of correlation tells us that there is a weak negative correlation between apartment age and price. Lastly, coefficient of determination indicates that only about 5.3% of variation in apartment prices can be explained by age.
#install.packages("car")
library(car)
## Loading required package: carData
scatterplotMatrix(Apartments[ , c(1, 2, 3)], smooth = FALSE)
There is no multicolinearity, as the slope of regression line in the
scatterplot matrix for explanatory variables (Age & Distance) is
close to zero, indicates that there is no significant correlation
between these two variables.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Based on VIF statistics, there is no multicolinearity between Age and Distance, since both VIF values are below 5 and their average is close to 1. This further confirms my assumption from the previous task, where the scatterplot matrix suggested that there was no potential problem with multicolinearity.
Apartments$StdRes <- round (rstandard(fit2), 3)
head(Apartments[order(Apartments$StdRes),],10)
## # A tibble: 10 × 6
## Age Distance Price Parking Balcony StdRes
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 2 1760 No Yes -2.15
## 2 12 14 1650 No Yes -1.50
## 3 12 14 1650 No No -1.50
## 4 13 8 1800 No No -1.38
## 5 14 16 1660 No Yes -1.26
## 6 24 5 1830 Yes No -1.19
## 7 30 17 1560 No No -1.10
## 8 18 18 1640 Yes Yes -1.07
## 9 18 18 1640 Yes Yes -1.07
## 10 18 19 1620 Yes No -1.07
hist(Apartments$StdRes,
main= "Histogram of standardized residuals",
col= "#FFB6C1", #pastel pink
border ="navy",
xlab = "Standardized residuals",
ylab = "Frequency",
xlim = c(-3,3))
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
main = "Histogram of Cooks distances",
col= "lightgreen",
border = "navy",
xlab = "Cooks distances",
ylab = " Frequency")
head(Apartments[order(-Apartments$CooksD), ])
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdRes CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
The Histogram of standardized residuals shows no extreme outliers, as all values are between -3 and 3. However, one observation has a noticeably higher Cook’s distance (look at Histogram of Cooks distances), which indicates stronger influence on the regression model. Its standardized residual is 2.577, which is close to 3, indicating that I will have to remove it.
Apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(x = Apartments$StdFitted, y = Apartments$StdRes,
xlab = "Standardized fitted values",
ylab = "Standardized residuals",
boxplots= FALSE,
regline = FALSE,
smooth = FALSE)
Apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot (x = Apartments$StdFitted, y = Apartments$StdRes,
xlab = "Standardized fitted values",
ylab = "Standardized residuals",
boxplots= FALSE,
regline = FALSE,
smooth = FALSE,
col = "deeppink",
pch = 19)
This is example of homoscedasticity, because variance of the standardized residuals looks constant, which indicates that heteroscedasticity is not present.
hist(Apartments$StdRes,
main = "Histogram of standardized residuals",
xlab = "Standardized residuals",
ylab = "Density",
col = "lightblue",
border = "navy",
prob = TRUE,
xlim = c(-3, 3))
shapiro.test(Apartments$StdRes)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdRes
## W = 0.95303, p-value = 0.003645
The Histogram of standardized residuals suggests that the distribution is not normal. To check this assumption I also added Shapiro-Wilk normality test, where: - H0: standardized residuals are distributed normally - H1: standardized residuals are not distributed normally
Since the p-value is 0.003645 < 0.05, I can reject null hypothesis, which means that at 5% significance level standardized residuals are not normally distributed.
Apartments <- Apartments [!(Apartments$CooksD == 0.320), ]
fit2 <- lm(Price ~Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
Firstly, I removed rows from a data frame where Cooks Distance equals 0.320. Secondly, I estimated fit2 again to calculate new regression values without that value having strong influence on the regression model.
New regression function: Price= 2456.076 - 6.464 x Age - 22.955 x Distance
Estimate of regression coefficient says that if an apartment has Age=O (a newly built apartment), the estimated average price is 2456.076, assuming all other variables remain unchanged. Since the age coefficient is negative, means that for each additional year an apartment gets older, the price will on average decrease by 6.464, assuming all other variables remain unchanged. Moreover, regression coefficient for Distance tells us that if distance increases for 1 unit and everything else stays constant, Price will decrease by 22.955 EUR.
Lastly, coefficient of determination indicates that about 48.38% of variability in apartment prices can be explained by age and distance.
To explain, p-value: H0: coefficients are equal to 0 H1: coefficients are not equal to 0
Since the p-value is 2.339e-12 < 0.05, I can reject null hypothesis, which means that at 5% significance level coefficients are not equal to 0.
fit3 <-lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: model 1 is better (fit2) H1: model 2 is better (fit3)
I would reject null hypothesis at 5% significance level, because p-value is less than 0.05 and ANOVA says that model 2 (fit3) is better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Regression coefficients for categorical variables show that apartments with parking are predicted to have price higher by 167.531 EUR compared to apartments without parking, assuming all other variables remain unchanged. Moreover, apartments with a balcony are predicted to have price lower by 15.21 EUR, compared to apartments without a balcony, controlling for all other variables. F-statistic: H0: population coefficient of determination = 0 H1: population coefficient of determination > 0
Since F= 22.04, and p-value is 3.018e-12 < 0.05, I can reject null hypothesis, meaning model explains a significant amount of the variance in Price.
Apartments$FittedValues <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[ , colnames(Apartments) %in% c ("Age", "Price", "Balcony", "Distance", "FittedValues", "Residuals")])
## # A tibble: 6 × 6
## Age Distance Price Balcony FittedValues Residuals
## <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 7 28 1640 Yes 1706. -66.0
## 2 18 1 2800 No 2372. 428.
## 3 7 28 1660 No 1721. -61.2
## 4 28 29 1850 Yes 1563. 287.
## 5 18 18 1640 Yes 2012. -372.
## 6 28 12 1770 Yes 1908. -138.
A residual of 427.803 means that the actual price of Apartment 2 is 427.803 EUR higher than the model predicted.