HW 3

Vincent Becker

1. Import the dataset Apartments.xlsx

library(readxl)

dt <- read_excel("/Users/vincentbecker/Library/Mobile Documents/com~apple~CloudDocs/WU/Courses/Applied data analysis in business with r/Data/Homework3/Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

dt$Parking <- factor(dt$Parking,
             levels = c(0,1),
             labels = c("No", "Yes"))

dt$Balcony <- factor(dt$Balcony,
             levels = c(0,1),
             labels = c("No", "Yes"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

library(ggplot2)

ggplot(dt, aes(x=Price))+
  geom_histogram(binwidth = 200)

shapiro.test(dt$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  dt$Price
## W = 0.94017, p-value = 0.0006513

The test \(H_0:\mu_{price}=1900\) is a hypothesis of the population arithmetic mean. To test this hypothesis, a t-test (or a non-parametric alternative) would be used. The underlying hypothesis of a t-test is normality of distribution. For this case this means that the price should be normally distributed. I checked this graphically with a histogram and statistically with a Shapiro-Wilk test of Normality. Both tests show a clear deviation from normality.

The sample size with 85 is rather large and therefore it could be considered to relax the normality assumption. However, to get a robuster result I decided to apply a Wilkox signed rank test.

wilcox.test(dt$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  dt$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

It can be observed that we can reject \(H_0\) with \(p = 0.029\).

We can also run the parametric test - e.g., t-test - to test if we obtain the same results.

t.test(dt$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  dt$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can see that that \(H_0\) is rejected at \(p=0.005\).

Generally, we can conclude that the true mean of price differs from 1900.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

The simple OLS regression model is estimated with the following code:

fit1 <- lm(Price ~ Age, dt)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = dt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

With the below ANOVA we see that the regression explains a significant proportion of variance \((R^2=0.05, F(1, 83)= 4.65, p = 0.034)\). We see a negative regression coefficient estimate of age on standard error \((b = -8.96, t(83) = -2.16, p = 0.034)\). This coefficient suggest that a one unit increase in age (one year) results in a decrease of price per square meter of 8.98, on average. We furthermore see a significant intercept \((b=2185.46, t(83)=25.11, p<0.001)\) that suggests that a new apartment (zero years of age), has a price of 2185.46 per \(m^2\), on average.

The coefficient of determination, \(R^2\), corresponds to the goodness of fit of the model. An \(R^2\) of 0.053 means that 5.3% of variance in the dependent variable is explained by the model. Furthermore, the coefficient of correlation, \(R\), calculated by \(R=\sqrt{R^2}\), corresponds to the pearson correlation coefficient. In this case the coefficient of correlation is -0.231. The 0.231 is the result of \(\sqrt{R^2}\) and the negative sign comes from the fact that the single predictor variable, age, has a negative coefficient. When checking the effect size of this pearson correlation we can see that 0.231 is between 0.1 and 0.3. Therefore, we can say that the correlation is negative and weak.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

The scatterplot matrix can be called with the following command of the package car:

library(car)

## Loading required package: carData

scatterplotMatrix(dt[,1:3], smooth = FALSE)

From the plots between Age and Distance we can see that multicolinearity will not be an issue, since the slope of the the correlation line is very flat. The slope between the plots of Age and Price, as well as, Distance and Price, is not really flat, but that is no problem (actually quite desirable), since Price is the depended variable. Therefore, we can conclude, that from the graphical analysis it does not seem as multicolinearity is a problem.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, dt)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

We can see that none of the VIF’s of the independent variables exceeds a VIF of 5. Furthermore, we can see that the average VIF is close to 1. Hence, there is no reason to assume multicolinearity.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

Standartized Residuals

dt$StdResid <-round(rstandard(fit2), 3)

head(dt[order(dt$StdResid),],3)

## # A tibble: 3 × 6
##     Age Distance Price Parking Balcony StdResid
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>
## 1     7        2  1760 No      Yes        -2.15
## 2    12       14  1650 No      Yes        -1.50
## 3    12       14  1650 No      No         -1.50

head(dt[order(-dt$StdResid),],3)

## # A tibble: 3 × 6
##     Age Distance Price Parking Balcony StdResid
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>
## 1     5       45  2180 Yes     Yes         2.58
## 2     2       11  2790 Yes     No          2.05
## 3    18        1  2800 Yes     No          1.78

With the above code the standardised residuals are calculated and the highest 3 and lowest 3 residuals are displayed. We see that there are no standardised residuals displayed that are above 3, meaning that there are no points more than three standard deviations away from the fitted values. Therefore, we do not need to exclude any outliers.

Cooks Distance

dt$CooksD <- round(cooks.distance(fit2), 3)

ggplot(dt, aes(x=CooksD))+
  geom_histogram(bins = 60)+
  xlim(NA, 0.35)

head(dt[order(-dt$CooksD),],12)

## # A tibble: 12 × 7
##      Age Distance Price Parking Balcony StdResid CooksD
##    <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
##  1     5       45  2180 Yes     Yes        2.58   0.32 
##  2    43       37  1740 No      No         1.44   0.104
##  3     2       11  2790 Yes     No         2.05   0.069
##  4     7        2  1760 No      Yes       -2.15   0.066
##  5    37        3  2540 Yes     Yes        1.58   0.061
##  6    40        2  2400 No      Yes        1.09   0.038
##  7     8        2  2820 Yes     No         1.66   0.037
##  8     8       26  2300 Yes     Yes        1.57   0.034
##  9    10        1  2810 No      No         1.60   0.032
## 10    18        1  2800 Yes     No         1.78   0.03 
## 11    45       21  1910 No      Yes        0.889  0.03 
## 12    18        1  2800 Yes     Yes        1.78   0.03

Cook’s distance is used to determine highly influential points. We want to ensure that we have a continuous Cook’s distance. To ensure that, we draw a histogram of the distance. Here we can see three gaps in the distribution. To get deeper insight into those gaps we can display the datapoints in descending order according to Cook’s distance.

Judging from the plot and the displayed datapoints, I decided to exclude all values with Cook’s distance above 0.06. The resultsing histogramm can be seen below.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dt <- dt %>% 
  filter(CooksD < 0.06)

ggplot(dt, aes(x=CooksD))+
  geom_histogram(bins = 60)+
  xlim(NA, 0.35)

head(dt[order(-dt$CooksD),],10)

## # A tibble: 10 × 7
##      Age Distance Price Parking Balcony StdResid CooksD
##    <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
##  1    40        2  2400 No      Yes        1.09   0.038
##  2     8        2  2820 Yes     No         1.66   0.037
##  3     8       26  2300 Yes     Yes        1.57   0.034
##  4    10        1  2810 No      No         1.60   0.032
##  5    18        1  2800 Yes     No         1.78   0.03 
##  6    45       21  1910 No      Yes        0.889  0.03 
##  7    18        1  2800 Yes     Yes        1.78   0.03 
##  8    16        1  2750 Yes     No         1.55   0.023
##  9    38       13  1610 No      Yes       -1.01   0.022
## 10    38       13  1610 No      No        -1.01   0.022

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

fit2 <- lm(Price ~ Age + Distance, dt)

dt$StdFitted <- scale(fit2$fitted.values)

scatterplot(y = dt$StdResid, x = dt$StdFitted,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

With this plot we can observe homoskedasticity. It seems like there is a slight difference in variance of standardised residuals from negative to positive values of fitted residuals. This means that there could be heteroskedasticity. However, since the visual test is quite error prone, we should employ a more sophisticated measure.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

The measure employed here is the Breusch Pagan test, which has constant variance as \(H_0\). This test fails to reject \(H_0\) at \(p = 0.188\). Therefore, homoskedasticity can be assumed.

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

A histogram of standardised residuals can be created the following way.

hist(dt$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized Residuals")

shapiro.test(dt$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  dt$StdResid
## W = 0.93418, p-value = 0.0004761

This histogram seems already right skewed and hence not normally distributed. Furthermore, there seems to be another peak at the tail of the distribution.

To formally test normality, a Shapiro-Wilk normality test is employed. This test has normality of the distribution as \(H_0\). Since the resulting \(p<0.001\), we can reject the null hypothesis. Therefore, we can not assume normal distribution of standardised residuals.

However, this does not mean that we cannot use a standard regression. With a sample size of 84, the sample is somewhat big enough to relax the assumption of normality of errors. Nevertheless, the results of such regression should be taken with a grain of salt. If any p-values are close to the alpha level, one should consider to run the regression again, with a larger sample size, to get a more robust result.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

dt <- read_excel("/Users/vincentbecker/Library/Mobile Documents/com~apple~CloudDocs/WU/Courses/Applied data analysis in business with r/Data/Homework3/Apartments.xlsx")

dt$Parking <- factor(dt$Parking,
             levels = c(0,1),
             labels = c("No", "Yes"))

dt$Balcony <- factor(dt$Balcony,
             levels = c(0,1),
             labels = c("No", "Yes"))

fit2 <- lm(Price ~ Age + Distance, dt)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Here the dataset is loaded again, categorical values are factored, and the regression is estimated.

We again see a regression that significantly explains variation in the dependent variable \((R^2=0.44, F(1, 82)= 32.16, p < 0.01)\). With an \(R^2\) of 0.44, we can say that the model explains about 44% of the variation. Furthermore, we can observe a semi strong \((R=\sqrt{0.44}=0.664)\) correlation between the explanatory variables and the dependent variable.

The coefficient of intercept for the model \((b = 2460.1, t(82) = 32.1, p < 0.001)\) suggests that a brand new apartment in the center of the city (0km from the center) costs on average 2460.1 per \(m^2\). The coefficient of age \((b = -7.93, t(82) = -2.46, p = 0.016)\) suggests that, all else equal, each additional year decreases price per \(m^2\) by 7.93, on average. The coefficient of distance \((b = -20.67, t(82) = -7.52, p < 0.001)\) suggests that for each additional km in of distance from the center, the price per \(m^2\) decreases by 20.67, on average.

All coefficients are significant at an alpha level of 5%.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

The regression is estimated using the following function.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, dt)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the analysis of variance \((F(2, 80)=4.87, p = 0.01)\) show a significant difference in explanatory power of the two models. Hence, we can see, based on the difference of about 6% in \(R^2\), that fit3 fits the data better.

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The coefficient of the categorical Parking variable \((b = 196.17, t(80) = 3.12, p = 0.003)\) is significant and shows that, everything else equal, a parking space, compared to an apartment without an parking space, increases the \(m^2\) price by 196.17, on average.

The coefficient of the categorical Balcony variable \((b = 1.94, t(80) = 0.03, p = 0.98)\) suggests that, all else equal, a balcony increases \(m^2\) price on average by 1.94, compared to an apartment without balcony. HOWEVER, since the p-value of this coefficient is very insignificant, there is most likely no influence of balcony on price in the population, and the obtained result is probably obtained by chance.

The hypotheses for the F-test are the following:

\(H_0:\rho^2=0\)

\(H_A:\rho^2>1\)

This F-test has the null hypothesis that the explanatory variables have no significance in explaining any variance in the dependent variable. The alternative hypothesis is that they have an explanatory power. It is basically tested if at least one coefficient differs from 0

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

dt$FittedValues <- fitted.values(fit3)

dt[2,]

## # A tibble: 1 × 6
##     Age Distance Price Parking Balcony FittedValues
##   <dbl>    <dbl> <dbl> <fct>   <fct>          <dbl>
## 1    18        1  2800 Yes     No             2357.

We can see that the fitted value for the apartment with the ID2 is 2357 and that the actual value is 2800. By calculating the difference between those two values we get to the residual of 443. If we have a look at the saved residuals we can see that that is exactly what was calculated.

dt$Residuals <- residuals(fit3)

dt[2,]

## # A tibble: 1 × 7
##     Age Distance Price Parking Balcony FittedValues Residuals
##   <dbl>    <dbl> <dbl> <fct>   <fct>          <dbl>     <dbl>
## 1    18        1  2800 Yes     No             2357.      443.