library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$Parking <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$Balcony <- factor(mydata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata)
## Age Distance Price Parking Balcony
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
summary(mydata)
## Age Distance Price Parking Balcony
## Min. : 1.00 Min. : 1.00 Min. :1400 No :42 No :48
## 1st Qu.:12.00 1st Qu.: 4.00 1st Qu.:1710 Yes:43 Yes:37
## Median :18.00 Median :12.00 Median :1950
## Mean :18.55 Mean :14.22 Mean :2019
## 3rd Qu.:24.00 3rd Qu.:20.00 3rd Qu.:2290
## Max. :45.00 Max. :45.00 Max. :2820
t.test (mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0: Mu_Price = 1900 eur
H1: Mu_Price =/ 1900 eur
We reject H0 at p-value < 0.005.
This means there is statistically significant evidence to conclude that the average price of the apartments is not equal to 1900 EUR. The mean of the sample (2018.941 EUR) supports this finding.
Source(t.test): https://www.denis-statistika.si/_files/ugd/5f6427_497986749e164071812524365d91dd7c.pdf, p.7.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor_coef <- sqrt(summary(fit1)$r.squared)
cor_coef
## [1] 0.230255
Price per m2 = 2185.455 + (-8.975)*Age
Intercept (2185.455): This is the estimated price when the age of the apartment is zero. It provides a baseline price of 2185.455 EUR for new apartments (age = 0 years).
Age (-8.975): This coefficient shows the average change in the apartment price for each additional year of age. The negative value indicates that, on average, the price decreases by about 8.975 EUR per year as the apartment ages.
H0: β1 = 0 (Age has no effect on price per m2)
H1: β1 =/ 0 (Age Has effect on price per m2)
This effect is statistically significant, we reject H0 at p-value < 0.04.
Multiple R-squared (0.05302): This value tells us that about 5.302% of the variability in apartment prices can be explained by the linear effect of the age of the apartment. This is a relatively low R-squared value, suggesting that the model does not explain a large proportion of the variation in prices.
Correlation Coefficient (0.230255): This is the square root of R-squared. The positive sign here is contrary to the negative slope => correlation is negative and weak according to Pearson correlation coefficient interpretation.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[ ,c(-4, -5)],
smooth = FALSE)
Age and Distance: The scatterplot between “Age” and “Distance” don’t shows strong linear relationship.
Age and Price: The plot between “Age” and “Price” indicates a very mild downward trend. However, the relationship does not appear very strong.
Distance and Price: The relationship between “Distance” and “Price” shows a more pronounced negative trend, indicating that prices tend to decrease as the distance from the city center increases. there is potential problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
A VIF value close to 1 indicates that there is no multicollinearity among “Age” and “Distance”. VIF should be < 5.
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Values greater than 3 and less than -3 are considered potential outliers.There are no such values.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We have a gap between 0.15 and 0.30, so we should remove units which are located between 0.30 and 0.35.
head(mydata[order(-mydata$CooksD),], 5)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
We should remove the apartment with the price 2180 per m2.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!Price == 2180)
fit2 <- lm(Price ~ Age + Distance,
data = mydata) #creating new fit with 84 obs.
mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The graph looks homoscedastic.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
Prob > Chi2 = 0.08708469 = p-value. We do not reject the H0. The graph is homoscedastic.
hist(mydata$StdResid,
main = "Distribution of standardized residuals",
xlab = "Frequency",
ylab = "Standardized residuals",
breaks = seq(from = -5, to = 5, by = 1))
Visually standardized residuals are normally distributed.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.94879, p-value = 0.002187
H0: Errors are normally distributed. H1: Errors are not normally distributed.
We should have to reject H0 at p-value < 0.003. But assuming, that we have 84 observation units and visually the errors on graph are normally distributed, I would assume that the standardized residuals are normally distributed.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Price per m2 = 2456.076 + (-6.464) x Age + (-22.955) x Distance
Intercept (2456.076): This is the estimated price when the age of the apartment is zero and the distance to the city center is also 0. It provides a baseline price of 2456.076 EUR for new apartments (age = 0 years, distance = 0 km).
Age (-6.464): The coefficient for Age is -6.464, meaning that for each additional year, the price of the apartment decreases by about 6.464 EUR, holding Distance constant. The p-value for Age is 0.044, suggesting that the effect of Age on Price is statistically significant.
H0: β1 = 0 (Age has no effect on price per m2)
H1: β1 =/ 0 (Age Has effect on price per m2)
This effect is statistically significant, we reject H0 at p-value < 0.05.
Distance (-22.955): The coefficient for Distance is -22.955, implying that for every additional kilometer from the center, the price decreases by 22.955 EUR, holding Age constant. This coefficient is also highly significant (p-value = 2.52e-12), showing a strong inverse relationship between Distance and Price.
H0: β2 = 0 (Distance has no effect on price per m2)
H1: β2 =/ 0 (Distance Has effect on price per m2)
This effect is statistically significant, we reject H0 at p-value < 0.001.
Multiple R-squared (0.4838): This value tells us that about 48.38% of the variability in apartment prices can be explained by the linear effect of the age of the apartment and the distance to the city center. This is a high R-squared value, suggesting that the model does explain a large proportion of the variation in prices.
Correlation Coefficient (0.230255): This is the square root of R-squared. The positive sign here is contrary to the negative slope => correlation is negative and weak according to Pearson correlation coefficient interpretation.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Δp^2 = 0 (both models are equally good) H1: Δp^2 > 0
We reject H0 at p-value < 0.04.
The p-value (0.03051) is significant at the 5% level, suggesting that adding Parking and Balcony to the model significantly improves the fit compared to just using Age and Distance.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
ParkingYes (167.531):This coefficient indicates that having parking is associated with an average price increase of about 167.531 EUR, holding all other variables constant.
BalconyYes (-15.207):This coefficient suggests that having a balcony is associated with a decrease in the apartment price by about 15.207 EUR, with all other factors held constant.
H0: None of the predictors (Age, Distance, ParkingYes, BalconyYes) have an effect on the dependent variable Price.
H1: At least one of the predictors has a non-zero effect on the dependent variable Price.
fitted_values <- fitted(fit3)
residuals <- residuals(fit3)
fitted_value_ID2 <- fitted_values[2]
residual_ID2 <- residuals[2]
print(paste("Fitted Value for Apartment ID2:", fitted_value_ID2))
## [1] "Fitted Value for Apartment ID2: 2372.19707577646"
print(paste("Residual for Apartment ID2:", residual_ID2))
## [1] "Residual for Apartment ID2: 427.802924223543"
Source (print(paste)): https://www.geeksforgeeks.org/printing-output-of-an-r-program/