HW 3

Gerelchuluun Amarsanaa (h12300075)

1. Import the dataset Apartments.xlsx

library(readxl)
library(ggplot2)
library(car)

## Loading required package: carData

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- read_excel("Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$ParkingF <- factor(mydata$Parking,
                          levels = c(0, 1),
                          labels = c("No Parking", "Parking"))

mydata$BalconyF <- factor(mydata$Balcony,
                          levels = c(0, 1),
                          labels = c("No Balcony", "Balcony"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

To find if there is any change in arithmetic mean of apartment price, t-test will be used. The assumptions are:

price variable is numeric
price variable is normally distributed

To test the normality, Shapiro-Wilk normality test is used:

H0: price variable is normally distributed
H1: price variable is not normally distributed.

# Testing normality of price variable
shapiro.test(mydata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Price
## W = 0.94017, p-value = 0.0006513

From result above, it is clear that the null hypothesis can be rejected (p < 0.05). Therefore, as normality condition is not met, Wilcoxon Signed Rank Test is used:

H0: the location distribution of price variable is the same.
H1: the location distribution of price variable is different.

wilcox.test(mydata$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

From the result, it is clear that the null hypothesis cannot be rejected ( p > 0.001). Therefore, the average price, mu = 1900.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p) - CONTINUE

fit1 <- lm(Price ~ Age, data=mydata)
coer_coef <- sqrt(summary(fit1)$r.squared)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

coer_coef

## [1] 0.230255

Regression coefficient: If the age of apartment increases by one year, then the price per m2 of apartment decreases on average by 8.975. ( p<0.05 ) Note: the result is not very statistically significant for this course, as p should be less than 0.001.

Coefficient of correlation: The Pearson correlation coefficient is 0.23. Therefore, there is a weak relationship between the age of apartment and price per m2.

Coefficient of determination: The coefficient of correlation is 0.05302. Therefore, the proportion of total variability of price that can be explained by the linear effect of age is 5.3%.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p) - CONTINUE

scatterplotMatrix(mydata[c("Price", "Age", "Distance")], smooth=FALSE )

As there is no strong relationship between the three variables, there will be no potential issue with multicolinearity.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data=mydata)

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

From the results, it is clear that all variables have VIF statistics below 5. Furthermore, the average VIF statistics is very close to 1. Therefore, there is no issue of strong multicolinearity.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$StdResid <- round(rstandard(fit2), 3)

hist( mydata$StdResid, 
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

On the histogram above, since all values are within -3 and +3, we have no problems with outliers.

mydata$CookD <- round(cooks.distance(fit2), 3)

hist( mydata$CookD, 
     xlab = "Cook distance",
     ylab = "Frequency",
     main = "Histogram of Cook distance")

From the histogram of Cook distance, it is clear that there is a gape between 0.15 and 0.30. Therefore, we can find and remove the unit with high influence by using head() function:

#Adding ID variable
mydata$ID <- seq(1, nrow(mydata))

head(mydata[order(-mydata$CookD), c("ID", "CookD")], 6)

## # A tibble: 6 × 2
##      ID CookD
##   <int> <dbl>
## 1    38 0.32 
## 2    55 0.104
## 3    33 0.069
## 4    53 0.066
## 5    22 0.061
## 6    39 0.038

mydata <- mydata %>%
  filter(!ID %in% c(38, 55, 33, 53, 22))

After removing unit with ID 38, it was clear that there are 4 more units, which were creating gaps in Cook’s distance. Thus, those units were removed as well. Now if we plot histogram of Cook distance again, we can see that there no units with high influence:

hist( mydata$CookD, 
     xlab = "Cook distance",
     ylab = "Frequency",
     main = "Histogram of Cook distance")

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

fit2 <- lm(Price ~ Age + Distance, data=mydata)
mydata$StdFittedValues <- scale(fit2$fitted.values)

scatterplot( y = mydata$StdResid, x = mydata$StdFittedValues, 
             ylab = "Standardized residuals",
             xlab = "Standardized fitted values",
             boxplot = FALSE,
             regLine = FALSE,
             smooth = FALSE)

From the graph, it seems as if there is hint of heteroskedasticity, as the distribution if standard residuals is narrower between x of -3 to -2, and wider as x increases. To be confident, Breusch Pagan test was used:

H0: Variance of errors is constant
H1: Variance of errors is not constant.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

From the result above, it is clear that the p value is very high, 0.188. Therefore, the null hypothesis cannot be rejected, and there is no strong heteroskedasticity. (p > 0.05)

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

hist( mydata$StdResid, 
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

From the histogram, it is clear that the standardized residuals are slightly asymmetrically distributed to the right.

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.93418, p-value = 0.0004761

From the Shapiro test, it is clear that the p value is lower than 0.05. Thus, the null hypothesis can be rejected and it can be concluded that the distribution of standard residuals is not normal. (p<0.05) However, it is not an issue in this case, as the sample size is large enough.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p) - CONTINUE

fit2 <- lm(Price ~ Age + Distance, data=mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Coefficient for age:

Assuming distance variable remain unchanged, for each additional year, the price per m2 of apartment goes down on average by 8.674. ( p<0.01 )

Coefficient for distance:

Assuming age variable remain unchanged, for each additional km from city center, the price per m2 of apartment goes down on average by 24.063. ( p<0.001 )

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data=mydata)

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

H0: the difference between ro2 (coefficient of determination) in two models is 0
H1: the difference between ro2 (coefficient of determination) is more than 0.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

As the p value is 0.114, we cannot reject the null hypothesis. (p > 0.001). Therefore, the model fit3 does not fit the data better than fit2.

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2393.316     93.930  25.480  < 2e-16 ***
## Age               -7.970      3.191  -2.498   0.0147 *  
## Distance         -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFParking  128.700     60.801   2.117   0.0376 *  
## BalconyFBalcony    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Explanation for parking:

Assuming all other variables remain unchanged, the apartments with parking space is on average more expensive by 128.7 than the apartments without parking space. (p < 0.05)

Explanation for balcony:

The result is not significant, as the p value is 0.917.

The hypothesis for F-statistics:

H0: ro2 = 0. Or, all the partial regression coefficients equals to 0. Thus, the regression is not significant.
H1: ro2 > 0, Or, at least one partial regression coefficient is different from 0.

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

mydata$fittedValues3 <- fit3$fitted.values

ID2 <- mydata[mydata$ID == 2, ]

fitted_value <- ID2$fittedValues3
observed_value <- ID2$Price

residual <- observed_value - fitted_value

residual

##        2 
## 443.4026