library(readxl)
Apartments <- read_xlsx("Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels = c("0", "1"),
labels = c("no", "yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c("0", "1"),
labels = c("yes", "no"))
t.test(Apartments$Price,
mu = 1990,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 0.70618, df = 84, p-value = 0.482
## alternative hypothesis: true mean is not equal to 1990
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Because p value is larger than alpha=5%, we can conclude that we can not decline null hypothetis - we accept null hypothesis.
library(car)
## Loading required package: carData
scatterplot(Distance ~ Price,
data = Apartments,
main = "Scatterplot: Distance vs Price",
xlab = "Price",
ylab = "Distance",
pch = 19,
col = "blue",
smooth=FALSE)
I don’t think there is a problem with multicolinearity, variables do not
have a strong linear connection.
library(car)
scatterplot(Age ~ Price,
data = Apartments,
main = "Scatterplot: Age vs Price",
xlab = "Price",
ylab = "Age",
pch = 19,
col = "blue",
smooth=FALSE)
I don’t think there is a problem with multicolinearity, variables do not
have a strong linear connection.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
VIF close to 1 means there is almost no variance because of other variables, so there is no multicolinearity.
r_std <- rstandard(fit2)
cooks <- cooks.distance(fit2)
head(r_std)
## 1 2 3 4 5 6
## -0.6653487 1.7832876 -0.5937629 0.7543794 -1.0733987 -0.7775190
which(abs(r_std) > 2)
## 33 38 53
## 33 38 53
Higher than 2 are outliers.
which(abs(r_std) > 3)
## named integer(0)
Higher than 3 are strong outliers.
r_std <- rstandard(fit2)
std_fit <- scale(fitted(fit2))
plot(std_fit, r_std,
main = "Standardized Residuals vs Standardized Fitted Values",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "blue")
abline(h = 0, lty = 2, col = "red")
This is an example of homoskedaticity, heteroskedaticity is not present.
Homoskedaticity means that outliers have similar variance. They are
somewhat regularly distributed around the horizontal axis.
hist(r_std,
breaks = 20,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "orange")
They are not distributed normally.
str(Apartments)
## 'data.frame': 85 obs. of 5 variables:
## $ Age : num 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "yes","no": 2 1 1 2 2 2 2 2 1 1 ...
Apartments$Parking <- as.factor(Apartments$Parking)
Apartments$Balcony <- as.factor(Apartments$Balcony)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parkingyes 196.168 62.868 3.120 0.00251 **
## Balconyno 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null hypothesis: parking and balcony do not have an affect and beta is the same in both cases. ALternative hypothesis: adding parking and balcony improves the fit to explain price.
P value is smaller than 0.05 meaning than we have to accept the alternative hypothesis. Meaning fit 3 is statistically better than fit2.
Including the variables Parking and Balcony significantly improves the fit of the regression model for predicting Price. Model 2 (fit3) provides a statistically better explanation of the variability in Price than Model 1 (fit2).
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parkingyes 196.168 62.868 3.120 0.00251 **
## Balconyno 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Explanation: if use ceteris paribus principle and say all other variables are zero then price would be 2301,7 approximately.
If all else stays the same then every year added to age increases the price approximately for 6.8 units.
Every added unit of distance decreases the price for 18 units if all the other variables stay the same.
Apartments with parking mean a 196 units increase in price if all else stays the same.
Balcony affect is not statistically significat with p value of 0.97436.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
fitted_vals <- fitted(fit3)
residual_vals <- resid(fit3)
fitted_ID2 <- fitted_vals[2]
residual_ID2 <- residual_vals[2]
fitted_ID2
## 2
## 2357.411
residual_ID2
## 2
## 442.5889