library(readxl)
dt <- read_excel("/Users/vincentbecker/Library/Mobile Documents/com~apple~CloudDocs/WU/Courses/Applied data analysis in business with r/Data/Homework3/Apartments.xlsx")
Description:
dt$Parking <- factor(dt$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
dt$Balcony <- factor(dt$Balcony,
levels = c(0,1),
labels = c("No", "Yes"))
library(ggplot2)
ggplot(dt, aes(x=Price))+
geom_histogram(binwidth = 200)
shapiro.test(dt$Price)
##
## Shapiro-Wilk normality test
##
## data: dt$Price
## W = 0.94017, p-value = 0.0006513
The test \(H_0:\mu_{price}=1900\) is a hypothesis of the population arithmetic mean. To test this hypothesis, a t-test (or a non-parametric alternative) would be used. The underlying hypothesis of a t-test is normality of distribution. For this case this means that the price should be normally distributed. I checked this graphically with a histogram and statistically with a Shapiro-Wilk test of Normality. Both tests show a clear deviation from normality.
The sample size with 85 is rather large and therefore it could be considered to relax the normality assumption. However, to get a robuster result I decided to apply a Wilkox signed rank test.
wilcox.test(dt$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: dt$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
It can be observed that we can reject \(H_0\) with \(p = 0.029\).
We can also run the parametric test - e.g., t-test - to test if we obtain the same results.
t.test(dt$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: dt$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can see that that \(H_0\) is rejected at \(p=0.005\).
Generally, we can conclude that the true mean of price differs from 1900.
The simple OLS regression model is estimated with the following code:
fit1 <- lm(Price ~ Age, dt)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
With the below ANOVA we see that the regression explains a significant proportion of variance \((R^2=0.05, F(1, 83)= 4.65, p = 0.034)\). We see a negative regression coefficient estimate of age on standard error \((b = -8.96, t(83) = -2.16, p = 0.034)\). This coefficient suggest that a one unit increase in age (one year) results in a decrease of price per square meter of 8.98, on average. We furthermore see a significant intercept \((b=2185.46, t(83)=25.11, p<0.001)\) that suggests that a new apartment (zero years of age), has a price of 2185.46 per \(m^2\), on average.
The coefficient of determination, \(R^2\), corresponds to the goodness of fit of the model. An \(R^2\) of 0.053 means that 5.3% of variance in the dependent variable is explained by the model. Furthermore, the coefficient of correlation, \(R\), calculated by \(R=\sqrt{R^2}\), corresponds to the pearson correlation coefficient. In this case the coefficient of correlation is -0.231. The 0.231 is the result of \(\sqrt{R^2}\) and the negative sign comes from the fact that the single predictor variable, age, has a negative coefficient. When checking the effect size of this pearson correlation we can see that 0.231 is between 0.1 and 0.3. Therefore, we can say that the correlation is negative and weak.
The scatterplot matrix can be called with the following command of the package car:
library(car)
## Loading required package: carData
scatterplotMatrix(dt[,1:3], smooth = FALSE)
From the plots between Age and Distance we can see that multicolinearity will not be an issue, since the slope of the the correlation line is very flat. The slope between the plots of Age and Price, as well as, Distance and Price, is not really flat, but that is no problem (actually quite desirable), since Price is the depended variable. Therefore, we can conclude, that from the graphical analysis it does not seem as multicolinearity is a problem.
fit2 <- lm(Price ~ Age + Distance, dt)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
We can see that none of the VIF’s of the independent variables exceeds a VIF of 5. Furthermore, we can see that the average VIF is close to 1. Hence, there is no reason to assume multicolinearity.
dt$StdResid <-round(rstandard(fit2), 3)
head(dt[order(dt$StdResid),],3)
## # A tibble: 3 × 6
## Age Distance Price Parking Balcony StdResid
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 2 1760 No Yes -2.15
## 2 12 14 1650 No Yes -1.50
## 3 12 14 1650 No No -1.50
head(dt[order(-dt$StdResid),],3)
## # A tibble: 3 × 6
## Age Distance Price Parking Balcony StdResid
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 5 45 2180 Yes Yes 2.58
## 2 2 11 2790 Yes No 2.05
## 3 18 1 2800 Yes No 1.78
With the above code the standardised residuals are calculated and the highest 3 and lowest 3 residuals are displayed. We see that there are no standardised residuals displayed that are above 3, meaning that there are no points more than three standard deviations away from the fitted values. Therefore, we do not need to exclude any outliers.
dt$CooksD <- round(cooks.distance(fit2), 3)
ggplot(dt, aes(x=CooksD))+
geom_histogram(bins = 60)+
xlim(NA, 0.35)
head(dt[order(-dt$CooksD),],12)
## # A tibble: 12 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
## 7 8 2 2820 Yes No 1.66 0.037
## 8 8 26 2300 Yes Yes 1.57 0.034
## 9 10 1 2810 No No 1.60 0.032
## 10 18 1 2800 Yes No 1.78 0.03
## 11 45 21 1910 No Yes 0.889 0.03
## 12 18 1 2800 Yes Yes 1.78 0.03
Cook’s distance is used to determine highly influential points. We want to ensure that we have a continuous Cook’s distance. To ensure that, we draw a histogram of the distance. Here we can see three gaps in the distribution. To get deeper insight into those gaps we can display the datapoints in descending order according to Cook’s distance.
Judging from the plot and the displayed datapoints, I decided to exclude all values with Cook’s distance above 0.06. The resultsing histogramm can be seen below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dt <- dt %>%
filter(CooksD < 0.06)
ggplot(dt, aes(x=CooksD))+
geom_histogram(bins = 60)+
xlim(NA, 0.35)
head(dt[order(-dt$CooksD),],10)
## # A tibble: 10 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 40 2 2400 No Yes 1.09 0.038
## 2 8 2 2820 Yes No 1.66 0.037
## 3 8 26 2300 Yes Yes 1.57 0.034
## 4 10 1 2810 No No 1.60 0.032
## 5 18 1 2800 Yes No 1.78 0.03
## 6 45 21 1910 No Yes 0.889 0.03
## 7 18 1 2800 Yes Yes 1.78 0.03
## 8 16 1 2750 Yes No 1.55 0.023
## 9 38 13 1610 No Yes -1.01 0.022
## 10 38 13 1610 No No -1.01 0.022
fit2 <- lm(Price ~ Age + Distance, dt)
dt$StdFitted <- scale(fit2$fitted.values)
scatterplot(y = dt$StdResid, x = dt$StdFitted,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted Values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
With this plot we can observe homoskedasticity. It seems like there is a slight difference in variance of standardised residuals from negative to positive values of fitted residuals. This means that there could be heteroskedasticity. However, since the visual test is quite error prone, we should employ a more sophisticated measure.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
The measure employed here is the Breusch Pagan test, which has constant variance as \(H_0\). This test fails to reject \(H_0\) at \(p = 0.188\). Therefore, homoskedasticity can be assumed.
A histogram of standardised residuals can be created the following way.
hist(dt$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
shapiro.test(dt$StdResid)
##
## Shapiro-Wilk normality test
##
## data: dt$StdResid
## W = 0.93418, p-value = 0.0004761
This histogram seems already right skewed and hence not normally distributed. Furthermore, there seems to be another peak at the tail of the distribution.
To formally test normality, a Shapiro-Wilk normality test is employed. This test has normality of the distribution as \(H_0\). Since the resulting \(p<0.001\), we can reject the null hypothesis. Therefore, we can not assume normal distribution of standardised residuals.
However, this does not mean that we cannot use a standard regression. With a sample size of 84, the sample is somewhat big enough to relax the assumption of normality of errors. Nevertheless, the results of such regression should be taken with a grain of salt. If any p-values are close to the alpha level, one should consider to run the regression again, with a larger sample size, to get a more robust result.
dt <- read_excel("/Users/vincentbecker/Library/Mobile Documents/com~apple~CloudDocs/WU/Courses/Applied data analysis in business with r/Data/Homework3/Apartments.xlsx")
dt$Parking <- factor(dt$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
dt$Balcony <- factor(dt$Balcony,
levels = c(0,1),
labels = c("No", "Yes"))
fit2 <- lm(Price ~ Age + Distance, dt)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Here the dataset is loaded again, categorical values are factored, and the regression is estimated.
We again see a regression that significantly explains variation in the dependent variable \((R^2=0.44, F(1, 82)= 32.16, p < 0.01)\). With an \(R^2\) of 0.44, we can say that the model explains about 44% of the variation. Furthermore, we can observe a semi strong \((R=\sqrt{0.44}=0.664)\) correlation between the explanatory variables and the dependent variable.
The coefficient of intercept for the model \((b = 2460.1, t(82) = 32.1, p < 0.001)\) suggests that a brand new apartment in the center of the city (0km from the center) costs on average 2460.1 per \(m^2\). The coefficient of age \((b = -7.93, t(82) = -2.46, p = 0.016)\) suggests that, all else equal, each additional year decreases price per \(m^2\) by 7.93, on average. The coefficient of distance \((b = -20.67, t(82) = -7.52, p < 0.001)\) suggests that for each additional km in of distance from the center, the price per \(m^2\) decreases by 20.67, on average.
All coefficients are significant at an alpha level of 5%.
The regression is estimated using the following function.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, dt)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the analysis of variance \((F(2, 80)=4.87, p = 0.01)\) show a significant difference in explanatory power of the two models. Hence, we can see, based on the difference of about 6% in \(R^2\), that fit3 fits the data better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The coefficient of the categorical Parking variable \((b = 196.17, t(80) = 3.12, p = 0.003)\) is significant and shows that, everything else equal, a parking space, compared to an apartment without an parking space, increases the \(m^2\) price by 196.17, on average.
The coefficient of the categorical Balcony variable \((b = 1.94, t(80) = 0.03, p = 0.98)\) suggests that, all else equal, a balcony increases \(m^2\) price on average by 1.94, compared to an apartment without balcony. HOWEVER, since the p-value of this coefficient is very insignificant, there is most likely no influence of balcony on price in the population, and the obtained result is probably obtained by chance.
The hypotheses for the F-test are the following:
\(H_0:\rho^2=0\)
\(H_A:\rho^2>1\)
This F-test has the null hypothesis that the explanatory variables have no significance in explaining any variance in the dependent variable. The alternative hypothesis is that they have an explanatory power. It is basically tested if at least one coefficient differs from 0
dt$FittedValues <- fitted.values(fit3)
dt[2,]
## # A tibble: 1 × 6
## Age Distance Price Parking Balcony FittedValues
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 18 1 2800 Yes No 2357.
We can see that the fitted value for the apartment with the ID2 is 2357 and that the actual value is 2800. By calculating the difference between those two values we get to the residual of 443. If we have a look at the saved residuals we can see that that is exactly what was calculated.
dt$Residuals <- residuals(fit3)
dt[2,]
## # A tibble: 1 × 7
## Age Distance Price Parking Balcony FittedValues Residuals
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 18 1 2800 Yes No 2357. 443.