Nam Anh Le
library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
A possible research question would be whether Parking and Price related to each other?
mydatanew <- mydata
mydatanew$Parking <- factor(mydata$Parking,
levels = c(0,1),
labels = c("No","Yes"))
mydatanew$Balcony <- factor(mydata$Balcony,
levels = c(0,1),
labels = c("No","Yes"))
head(mydatanew)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
\(H_0\): \(\mu_{\text{Price}}\) = 1900 euro
\(H_1\): \(\mu_{\text{Price}}\) \(\ne\) 1900 euro
Assumptions:
Normality test:
library(ggplot2)
ggplot(mydatanew, aes(x = Price)) +
geom_histogram(binwidth = 150, colour = "black") +
ylab("Frequency") +
xlab("Price in EUR")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
shapiro.test(mydatanew$Price)
##
## Shapiro-Wilk normality test
##
## data: mydatanew$Price
## W = 0.94017, p-value = 0.0006513
Based on the Shapiro Test, we reject \(H_0\) at p-value = 0.001
Non Parametric Test:
wilcox.test(mydatanew$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydatanew$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
Based on the Wilcox test, we reject \(H_0\) at p-value = 0.029
Conclusion: The median Price per \(m^2\) is not 1900 euro
fit1 <- lm(Price ~ Age,
data = mydatanew)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydatanew$Price,mydatanew$Age)
## [1] -0.230255
Regression Coefficient: b = -8.975
Explanation: If Age
increases by 1 year, the Price per \(m^2\) of the Apartment on average reduces
by 8.975 euros.
Coefficient of Correlation: -0.23
Explanation: The linear
relationship between Price and Age is negative and weak.
Coefficient of Determination: 0.053
Explanation: 5.3% of
the variability of Price is explained by the linear effect of Age.
library(car)
## Loading required package: carData
scatterplotMatrix(mydatanew[,-c(4,5)],smooth = FALSE)
There is no clear trend between Age and Distance, we can see
that the line is nearly flat
Thus there is no multicolinearity
issues
fit2 <- lm(Price ~ Age + Distance,
data = mydatanew)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Based on VIF Statistic, There is weak multicolinearity between Age and Distance
mydatanew$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydatanew$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
head(mydatanew[order(mydatanew$StdResid),], 3) #Three units with lowest value of stand. residuals
## # A tibble: 3 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 No Yes -2.15 0.066
## 2 12 14 1650 No Yes -1.50 0.013
## 3 12 14 1650 No No -1.50 0.013
head(mydatanew[order(-mydatanew$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydatanew <- mydatanew %>%
filter(!CooksD == 0.320) %>%
filter(!StdResid == -2.152)
fit2 <- lm(Price ~ Age + Distance,
data = mydatanew)
mydatanew$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydatanew$StdResid, x = mydatanew$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatterplot, we see for different levels of the
explanatory variables, the variance of errors is relativelyconstant
hist(mydatanew$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydatanew$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydatanew$StdResid
## W = 0.93368, p-value = 0.0003444
Normality of standardized Residuals:
Based on the Shapiro Test, we reject \(H_0\) at p-value = 0.001
fit2 <- lm(Price ~ Age + Distance,
data = mydatanew)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.51 -223.89 -62.78 202.78 575.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2489.617 73.524 33.861 < 2e-16 ***
## Age -7.350 3.103 -2.368 0.0203 *
## Distance -23.636 2.731 -8.654 4.21e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared: 0.513, Adjusted R-squared: 0.5008
## F-statistic: 42.13 on 2 and 80 DF, p-value: 3.177e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.7162245
Regression Coefficient:
Explanation:
Multiple Correlation Coefficient: 0.716
Explanation: The
linear relationship between dependent and all explanatory variable is
strong.
Coefficient of Determination: 0.513
Explanation: 51.3% of
the variability of Price is explained by the linear effect of Age and
Distance.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydatanew)
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5795128
## 2 78 5410469 2 384659 2.7727 0.06866 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no significant difference between model fit3 and model
fit2 since p-value = 0.069
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.58 -198.68 -44.44 229.33 529.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2367.282 93.781 25.243 < 2e-16 ***
## Age -6.605 3.054 -2.162 0.0337 *
## Distance -21.140 2.878 -7.345 1.73e-10 ***
## ParkingYes 147.508 62.799 2.349 0.0214 *
## BalconyYes -3.122 58.635 -0.053 0.9577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared: 0.5453, Adjusted R-squared: 0.522
## F-statistic: 23.39 on 4 and 78 DF, p-value: 9.972e-13
\(H_0\): \(p^2\) = 0
\(H_1\): \(p^2\) > 0
mydatanew$Fitted <- round(fit3$fitted.values, 2)
mydatanew$Residual <- mydatanew$Price - mydatanew$Fitted
head(mydatanew)
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony StdResid CooksD StdFitted[,1] Fitted Residual
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 No Yes -0.665 0.007 -0.893 1726. -86.0
## 2 18 1 2800 Yes No 1.78 0.03 1.15 2375. 425.
## 3 7 28 1660 No No -0.594 0.006 -0.893 1729. -69.1
## 4 28 29 1850 No Yes 0.754 0.008 -1.55 1566. 284.
## 5 18 18 1640 Yes Yes -1.07 0.005 -0.323 2012. -372.
## 6 28 12 1770 No Yes -0.778 0.005 -0.0731 1926. -156.
The residual for Apartment ID2 is 425.23