library(readxl)
data <- read_xlsx("./Apartments.xlsx")
head(data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
A possible research question given the data could be:
“How do the age of the apartment, its distance from the city center, and the presence of parking or a balcony affect the price per square meter?”
data$ParkingF <- factor(data$Parking,
levels = c("0", "1"),
labels = c("No_Parking", "Parking")) # No_Parking -> Reference Category
data$BalconyF <- factor(data$Balcony,
levels = c("0", "1"),
labels = c("No_Balcony", "Balcony")) # No_Balcony -> Reference Category
t.test(data$Price, mu = 1900, alternative = "two.sided")
##
## One Sample t-test
##
## data: data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We reject the null hypothesis at the p-value of 0.005, indicating that the mean price per square meter is significantly different from 1900 euros, and specifically, it appears to be higher.
fit1 <- lm(Price ~ Age, data = data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Intercept (β₀): The estimated price per m², when the apartment is 0 years old (brand new), is 2185.455 euros.
Slope (β₁): If the apartment’s age increases by one year, the price is expected to decrease by 8.98 euros, on average.
r <- sqrt(summary(fit1)$r.squared)
if (coef(fit1)["Age"] < 0) {r <- -r}
r
## [1] -0.230255
Coefficient of Correlation (r): This means there’s a weak negative correlation between apartment age and price.
Coefficient of Determination (R²): Approximately 5.3% of the variability in apartment prices is explained by the age of the apartment. (R² = 0.05302)
library(car)
## Loading required package: carData
scatterplotMatrix(data[, c("Price", "Age", "Distance")], smooth = FALSE, col = "mediumslateblue")
Based on the scatterplot matrix, there is no indication of a perfect linear relationship between any pair of variables. Therefore, the assumption of no perfect multicollinearity, expressed as:
\[ \lambda_1 X_1 + \lambda_2 X_2 + \dots + \lambda_k X_k = 0 \]
is not violated.
fit2 <- lm(Price ~ Age + Distance, data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Since none of the Variance Inflation Factor (VIF) statistics exceed 5, and the average VIF is close to 1, we can conclude that there is no strong multicollinearity present in the model.
Standardized Residuals
data$Standardized_Residuals <- round(rstandard(fit2), 3)
hist(data$Standardized_Residuals,
xlab = "Standard Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals",
col = "mediumslateblue")
# First three rows sorted by standardized residuals
head(data[order(data$Standardized_Residuals),], 3)
## # A tibble: 3 × 8
## Age Distance Price Parking Balcony ParkingF BalconyF Standardized_Residuals
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 2 1760 0 1 No_Parking Balcony -2.15
## 2 12 14 1650 0 1 No_Parking Balcony -1.50
## 3 12 14 1650 0 0 No_Parking No_Balcony -1.50
# Last three rows sorted by standardized residuals
tail(data[order(data$Standardized_Residuals),], 3)
## # A tibble: 3 × 8
## Age Distance Price Parking Balcony ParkingF BalconyF Standardized_Residuals
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 18 1 2800 1 1 Parking Balcony 1.78
## 2 2 11 2790 1 0 Parking No_Balcony 2.05
## 3 5 45 2180 1 1 Parking Balcony 2.58
There are no outliers, as no data points have standardized residuals greater than 3 or less than -3.
Units with High Impact
data$CooksD<- round(cooks.distance(fit2), 3)
hist(data$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances",
col = "mediumslateblue")
head(data[order(-data$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF Standardized_Residuals CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Parking Balcony 2.58 0.32
## 2 43 37 1740 0 0 No_Parking No_Balcony 1.44 0.104
## 3 2 11 2790 1 0 Parking No_Balcony 2.05 0.069
## 4 7 2 1760 0 1 No_Parking Balcony -2.15 0.066
## 5 37 3 2540 1 1 Parking Balcony 1.58 0.061
## 6 40 2 2400 0 1 No_Parking Balcony 1.09 0.038
I will remove the unit with a Cooks Distance higher than 0.3, as the graph clearly shows a jump, and this unit appears after that threshold.
data <- data[data$CooksD <= 0.3, ]
data$Standardized_Fitted <- scale(lm(Price ~ Age + Distance, data = data)$fitted.values)
library(car)
scatterplot(y = data$Standardized_Residuals, x = data$Standardized_Fitted,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted Values",
boxplots = FALSE, regLine = FALSE, smooth = FALSE, col = "mediumslateblue")
The plot shows that the variance of the errors is constant, indicating homoskedasticity. This means that the spread of the residuals remains consistent across all levels of the fitted values. In other words, the points in the scatterplot do not exhibit any “opening” or “closing” pattern, suggesting that there is no change in the variability of the errors as the fitted values increase.
data$Standardized_Residuals <- round(rstandard(lm(Price ~ Age + Distance, data = data)), 3)
hist(data$Standardized_Residuals,
xlab = "Standard Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals",
col = "mediumslateblue")
The graph does not necessarily indicate a normal distribution, so I will conduct a Shapiro-Wilk test to formally assess normality.
shapiro.test((data$Standardized_Residuals))
##
## Shapiro-Wilk normality test
##
## data: (data$Standardized_Residuals)
## W = 0.95649, p-value = 0.006355
We reject the null hypothesis at the p-value of 0.007, indicating that the standardized residuals are not normally distributed. However, since we have a large sample size of 84 observations, the assumption of normality is less critical.
fit2 <- lm(Price ~ Age + Distance, data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Intercept (β₀): The estimated price per m², when the apartment is 0 years old (brand new) and at a distance of 0 km from the city center, is 2456.076 euros.
Slope (β₁ for Age): If the apartment’s age increases by one year, the price is expected to decrease by 6.464 euros, on average, keeping every other variable unchanged.
Slope (β₂ for Distance): If the apartment’s distance from the city center increases by one kilometer, the price is expected to decrease by 22.955 euros, on average, keeping every other variable unchanged.
r <- sqrt(summary(fit2)$r.squared)
r
## [1] 0.6955609
Coefficient of Correlation (r): This means there’s a moderate linear relationship between the dependent variable (price) and the explanatory variables (age and distance).
Coefficient of Determination (R²): Approximately 48.4% of the variability in apartment prices is explained by the age and distance from the city center of the apartment. (R² = 0.4838)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = data)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis at the p-value of 0.03, which indicates that the second model (fit3) has a significantly higher coefficient of determination. In other words, fit3 provides a better fit to the data compared to fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## Parking 167.531 62.864 2.665 0.00933 **
## Balcony -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Parking (β₃ = 167.531): Given Age and Distance, apartments with parking (compared to those without) have, on average, a price per m² that is 167.531 euros higher. This coefficient is statistically significant, with a p-value of 0.00933.
Balcony (β₄ = -15.207): Given Age and Distance, apartments with a balcony (compared to those without) have, on average, a price per m² that is 15.207 euros lower. However, this coefficient is not statistically significant, with a p-value of 0.79795.
The F-statistic tests the overall significance of the regression model. Specifically, it evaluates whether at least one of the predictors (independent variables) has a non-zero coefficient, meaning the model provides a better fit to the data than a model with no predictors.
The hypotheses being tested with the F-statistic are:
H₀: ρ² = 0, which means that none of the independent variables in the regression model are significantly related to the dependent variable, and the model does not explain the variance in the data better than a model with no predictors.
H₁: ρ² ≠ 0, which means that at least one of the independent variables in the regression model has a significant relationship with the dependent variable, and the model explains the variance in the data better than a model with no predictors.
data$ID <- seq(1, nrow(data))
data$fit3_fitted_values <- fit3$fitted.values
data$residuals <- residuals(fit3)
residual_id2 <- data$residuals[data$ID == 2]
fitted_id2 <- data$fit3_fitted_values[data$ID == 2]
cat("Fitted Value for ID 2:", fitted_id2, "\n")
## Fitted Value for ID 2: 2372.197
cat("Residual for ID 2:", residual_id2, "\n")
## Residual for ID 2: 427.8029