library(readxl)
apartments <- read_excel("~/Desktop/SpringFiles.2025/5708 Applied DA w R/Applied Data Analysis in R - HW3 (data)/Apartments.xlsx")
View(apartments)
Description:
A possible research question could be: “Does the age of an apartment (years) impact the price more or less than Distance in km from the city center?”
apartments$ParkingF <- factor(apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
apartments$BalconyF <- factor(apartments$Balcony,
levels = c(0,1),
labels = c("No","Yes"))
t.test(apartments$Price, mu = 1900)
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
You would reject the null hypothesis since the p < 0.05, implying that the evidence is statistically significant that the average apartment price is not equal to 1900 EUR. In other words the confidence interval for the true mean does not include 1900.
library(ggplot2)
ggplot(apartments, aes(x = Price)) +
geom_histogram(binwidth = 50, colour = "gray") +
ylab("Frequency") +
xlab("Price per m2")
Since the Histogram is relatively normally distributed, we can assume normality. Therefore the Wilcoxon Signed-Rank test is not needed (There is minor asymmetry to the right, but since there’s not major outliers, I think the t-test suffices).
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Estimate of regression coefficient: Price = 2185.455 + -8.975 * Age “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -8.975 EUR (holding all other explanatory variables constant/unchanged).”
Intercept: 2185.455 EUR “The expected price per m2 for a brand new apartment (Age = 0 years) is 2185.455 EUR, holding all explanatory variables constant/unchanged.”
R^2 (Coefficient of Determination): 0.053 “5.3% of variability of Price is explained by the linear effect of Age.” Despite being statistically significant, Age has a very low R^2, meaning it has a weak effect on the Price per m2 of an apartment
Coefficient of Correlation (R): sq-rt of R^2 = 0.23 “The linear relationship between Price and Age is negative, with a correlation between Price and Age of 0.23 which indicates a weak negative linear relationship.”
library(car)
## Loading required package: carData
scatterplotMatrix(apartments[, c(-4, -5, -6, -7)],
smooth = FALSE)
Based on the scatterplot matrix, there doesn’t appear to be a huge problem of multicolinearity. For Age against Distance, the relationship shows low correlation (slight positive trend, spread-out). Additionally for Price against Distance and Price against Age, there is a negative relationship. As Age increases, Price decreases; as Distance increases, Price also decreases. Based on these relationships and on the low correlation between Age and Distance, there is no problem of multicollinearity present.
fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Note: I wasn’t sure if I needed to explain the coefficients for this model like we were instructed to on the last one
Estimate of regression coefficient: Price = 2460.1 + -7.93 * Age + -20.67 * Distance “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -7.93 EUR holding all other explanatory variables constant/unchanged.” “If the Distance from city center of the apartment increases by 1 km, Price per m2 on average decreases by -20.67 EUR holding all other explanatory variables constant/unchanged.”
Intercept: 2460.1 EUR “The expected price per m2 for a brand new apartment in city center (Age = 0 years, Distance from City center = 0km) is 2460.10 EUR, holding all explanatory variables constant/unchanged.”
R^2 (Coefficient of Determination): 0.4396 “43.96% of variability of Price is explained by the linear effect of Age.” Compared to fit1, Distance and Age have a much higher R^2, meaning it has a stronger effect on the Price per m2 of an apartment, but is not very strong.
Coefficient of Correlation (R): sq-rt of R^2 = 0.66 “The linear relationship between Price, Age and Distance is negative, with a correlation between Price and Age and Distance of 0.66 which indicates a weak negative linear relationship.”
vif(fit2)
## Age Distance
## 1.001845 1.001845
The two explanatory variables have an equal VIF of 1.001845 which indicates nearly no multicollinearity between the predictor variables given the range of VIF (1-5). Since the variables are extremely close to the minimum of 1, there is nearly no correlation between predictor variables meaning they are completely independent of each other. In other words, the predictors share .18% of their variance so they contribute mostly unique information to the regression with very little overlap.
apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(apartments$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
Given the -/+3 threshold range, no outliers would be dropped as all standardized residuals fall within this range.
shapiro.test(apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: apartments$StdResid
## W = 0.95303, p-value = 0.003645
I also used the Shapiro-Wilk test to see whether my standardized residuals are normally distributed or not. Since the p < 0.05 I would reject the null hypothesis that my residuals are normally distributed despite the w = 0.95 being close to 1. This means the residuals aren’t normally distributed.
hist(apartments$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
There are a few observations with Cook’s distances higher ( 0.1, 0.15, and 0.3) than the majority which are below 0.05, indicating some influential points.
head(apartments[order(-apartments$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
Since the top 2 apartment’s Cooks Distances stand out as higher than the rest, I will remove them using their age as they are the only apartments with the ages 5 and 43.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
apartments <- apartments %>%
filter(!Age == "5")
library(dplyr)
apartments <- apartments %>%
filter(!Age == "43")
fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Here I redefined fit2 to account for the removal of the units of high impact.
apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = apartments$StdResid, x = apartments$StdFitted,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted Values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the shape of the Scatterplot, there is apparent heteroskendasticity as the spread along the y-axis gets broader around 1 on the x-axis.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
Based on the Breusch Pagan test, this model fails to reject the null hypothesis of homoskendasticity because the standard alpha is above 0.05 (it is 0.052). This means that the variance might not be perfectly constant across all fitted values. Because of how close the standard alpha is to the threshold, further testing is needed.
apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(apartments$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
Based on the shape of the histogram, the distribution appears to be bimodel meaning it doesn’t follow the normal bell-shaped distribution. It also varies from the original fit2 which did not exclude the two units of high impact.
shapiro.test(apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: apartments$StdResid
## W = 0.95952, p-value = 0.01044
The Shapiro-Wilk test provides a p-value of 0.01044 which is less than the ideal significance level of 0.05, therefore we reject the null hypothesis that the standardized residuals are normally distributed.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Since I already removed the units of high impact in Question 9, I will just explain all coefficients here:
Price = 2490.11 + -7.85 * Age + -23.95 * Distance
Intercept: 2490.11 EUR “The expected price per m2 for a brand new apartment in city center (Age = 0 years, Distance from City center = 0km) is 2490.11 EUR, holding all explanatory variables constant/unchanged.” Additionally, the p-value is extremely small, indicating the intercept as statistically significant, and the base value as reliable
Correlation Coefficients Explanations: “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -7.85 EUR (p<0.05) holding all other explanatory variables constant/unchanged.” “If the Distance from city center of the apartment increases by 1 km, Price per m2 on average decreases by -23.95 EUR (p<0.01) holding all other explanatory variables constant/unchanged.” (Both are statistically significant, with Distance having the smallest p-value)
R^2 (Coefficient of Determination): 0.4968 “49.68% of variability of Price is explained by the linear effect of Age.” Compared to the original fit2, Distance and Age have a slightly higher R^2, meaning it has a stronger effect on the Price per m2 of an apartment, but is not very strong.
Coefficient of Correlation (R): sq-rt of R^2 = 0.70 “The linear relationship between Price, Age and Distance is negative, with a correlation between Price and Age and Distance of 0.70 which indicates a strong negative linear relationship.”
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (0.02813) is less than the significance level of 0.05, we reject the null hypothesis that Model 1 (fit2) fits the data as well as Model 2. This indicates that Model 2 (fit3), which includes the dummy variables: ParkingF and BalconyF, fits the data far better than Model 1 (fit2). Therefore, the inclusion of the parking and balcony factors improves the model’s fit significantly.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
ParkingFYes: 168.92 EUR “Given the values of other explanatory variables, then apartments with parking have a higher price by 168.92 EUR on average compared to aparments without parking (p<0.01).”
BalconyFYes: -6.99 “Given the values of other explanatory variables, then apartments with Balconies have a lower price by 6.99 EUR on average compared to aparments without Balconies.” The p-value is very high ( p-value = 0.91) indicating this is not a statistically significant effect.
Test of Significance of Regression: H0: p^2 = 0 H1: p^2 > 0
Since the p-value of the the F-statistic is very small (1.449e-12) and less than 0.05. We reject the null hypothesis at reject p < 0.001. By rejecting the null we know that the model is statistically significant, and at least one of the predictor variables has a statistically significant relationship with Price.
apartments <- apartments %>%
mutate(ID = row_number())
apartments$Fitted <- fitted.values(fit3)
apartments$Residuals <- residuals(fit3)
head(apartments[colnames(apartments) %in% c("ID", "Price","Fitted","Residuals")])
## # A tibble: 6 × 4
## Price ID Fitted Residuals
## <dbl> <int> <dbl> <dbl>
## 1 1640 1 1707. -66.8
## 2 2800 2 2377. 423.
## 3 1660 3 1714. -53.8
## 4 1850 4 1534. 316.
## 5 1640 5 2009. -369.
## 6 1770 6 1896. -126.
ID2 is the second on the list and shows that the actual price of ID2 is 422.96 EUR higher than the price that my model predicted in the fit3 model.