library(readxl)
Apartments <- read_excel("~/Desktop/R Take Home Exam 2025/Task 3/Apartments.xlsx")
Description:
mydata <-force(Apartments)
mydata$Parking <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$Balcony <-factor(mydata$Balcony,
levels = c(0,1),
labels = c("No","Yes"))
mean(mydata$Price)
## [1] 2018.941
sd(mydata$Price)
## [1] 377.8417
qt(p = 0.025, df = 84, lower.tail = FALSE)
## [1] 1.98861
qt(p = 0.025, df = 84, lower.tail = TRUE)
## [1] -1.98861
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on a sample data we can reject the H0 at p<0.005. The average price for the apartment has increased.
fit1 <-lm(Price~Age,
data=mydata)
coef(fit1)
## (Intercept) Age
## 2185.454892 -8.975058
cor(mydata$Price, mydata$Age)
## [1] -0.230255
summary(fit1)$r.squared
## [1] 0.05301737
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression coefficient: The intercept indicates that when Age is zero and all other factors remain unchanged, the average Price is about €2185.46. The coefficient for Age shows that for every additional year of Age, the Price decreases on average by €8.98, assuming everything else stays constant. Since the p-value is below 0.05, we can say with 95% confidence that these coefficients are statistically significant and not equal to zero, which means we reject the null hypothesis. Correlation coefficient: This value measures both the strength and the direction of the linear relationship between two variables. In this case, Age and Price have a weak negative relationship, as indicated by the correlation coefficient of –0.23. Coefficient of determination: The R-squared statistic shows that only 5.3% of the variation in Price can be explained by the Age variable.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[ , c("Price","Age","Distance")],
smooth = FALSE)
No potential problem with multicolinearity
fit2 <-lm(Price~Age+Distance,
data=mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
VIF statistics are below 5, while the average is equal to 1, which means that in the model fit2 there is no problem with multicolinearity.
mydata$StdResid <- round(rstandard(fit2), 3)
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95303, p-value = 0.003645
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
we will remove all observations that appear on the graph after this
break, as they can be considered being influential points and contribute
to the result
mydata$ID <- seq_len(nrow(mydata))
head(mydata[order(-mydata$CooksD),],)
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony StdResid CooksD ID
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 5 45 2180 Yes Yes 2.58 0.32 38
## 2 43 37 1740 No No 1.44 0.104 55
## 3 2 11 2790 Yes No 2.05 0.069 33
## 4 7 2 1760 No Yes -2.15 0.066 53
## 5 37 3 2540 Yes Yes 1.58 0.061 22
## 6 40 2 2400 No Yes 1.09 0.038 39
head(mydata[order(mydata$StdResid),],)
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony StdResid CooksD ID
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 7 2 1760 No Yes -2.15 0.066 53
## 2 12 14 1650 No Yes -1.50 0.013 13
## 3 12 14 1650 No No -1.50 0.013 72
## 4 13 8 1800 No No -1.38 0.012 20
## 5 14 16 1660 No Yes -1.26 0.008 35
## 6 24 5 1830 Yes No -1.19 0.012 36
By looking at the standardized residuals we can remove all the variables that have value close to -3 or 3 (these values often used to detect outliers)
Removing outliers
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
drop_ids <-c(38,55,33,53,22)
mydata <- mydata %>% filter (!ID %in% drop_ids)
print(mydata)
## # A tibble: 80 × 8
## Age Distance Price Parking Balcony StdResid CooksD ID
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 7 28 1640 No Yes -0.665 0.007 1
## 2 18 1 2800 Yes No 1.78 0.03 2
## 3 7 28 1660 No No -0.594 0.006 3
## 4 28 29 1850 No Yes 0.754 0.008 4
## 5 18 18 1640 Yes Yes -1.07 0.005 5
## 6 28 12 1770 No Yes -0.778 0.005 6
## 7 14 20 1850 No Yes -0.302 0.001 7
## 8 18 6 1970 Yes Yes -0.787 0.004 8
## 9 22 7 2270 Yes No 0.455 0.001 9
## 10 25 2 2570 Yes No 1.24 0.017 10
## # ℹ 70 more rows
fit2 <- lm(Price ~ Age + Distance, data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The points in the scatter plot of standardized residuals versus
standardized fitted values should appear randomly scattered within a
horizontal band, showing a consistent spread across all fitted values.
Here variability doesn’t change so we can say that heteroskedastiity is
not present.
standardized_residuals <- rstandard(fit2)
shapiro.test(standardized_residuals)
##
## Shapiro-Wilk normality test
##
## data: standardized_residuals
## W = 0.94156, p-value = 0.001168
hist(rstandard(fit2))
From the graph we can see that the distribution is not normally
distributed and its slightly skewed to the right. Below the distribution
is formally tested with Shapiro-Wilk test.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.94154, p-value = 0.001166
Ho- distribution is normal N1- distribution is not normal According to results from Shapiro-Wilk test, p-value < 0,05 which means that we can reject the null hypothesis. Data is not normally distributed.
fit2 <- lm (Price~ Age +Distance,
data=mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
The regression model shows that both Age and Distance have significant negative effects on Price. A newly built apartment located in the city center is expected to cost approximately €2,502.47 (Intercept) For each additional year of age, price decreases by about $8.67, and for each additional unit of distance, price decrease by $24.06. R²: 53.6% of variability of apartments is explained by linear effect of Age, Distance.
fit3 <- lm(Price~ Age + Distance + Parking + Balcony,
data = mydata)
anova (fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
`
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
The regression coefficient for paring tells us that if an apartment has a parking space and everything else remains unchanged, then the price is 128.7 euros higher on average. Same for balcony, if the appartment has a balcony price decrease by 6.03 euros on average.
F hypothesis: Ho: the population coefficient of determination=0 H1: the population coefficient of determintaion >0
Fitted_ID2 <- fitted(fit3)[mydata$ID == 2]
Residual_ID2 <- resid(fit3)[mydata$ID == 2]
round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)
## Fitted.2 Residual.2
## 2356.597 443.403
A positive residual means the apartment’s actual price is higher than the model predicts. The actual price for the apartment is by 443.403 euros higher that the estimated value from the regression