HW 3

Yaqi Hu 11919233

1. Import the dataset Apartments.xlsx

library(readxl)
mydata<- read_xlsx("./Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$Parking <- factor(mydata$Parking,
                         levels = c(0, 1), labels = c("No", "Yes"))
mydata$Balcony <- factor(mydata$Balcony,
                         levels = c(0, 1), labels = c("No", "Yes"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Assumptions: Variable is numeric, standardised residuals of the population is normally distributed.

The t-test checks if the arithmetic mean of the sample is equal with the given price.

The p value (<0.05) indicates the mean of the sample is not equal to 1900. H0 hypothesis is rejected, the mean of the sample equals $2018.94. The price significantly differs from 1900, the t-test shows that the price deviation cannot be described by random error.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[c("Price", "Age")]))

##       Price   Age
## Price  1.00 -0.23
## Age   -0.23  1.00
## 
## n= 85 
## 
## 
## P
##       Price Age  
## Price       0.034
## Age   0.034

Price= 2185.455 + Age* -8.98

Estimate of regression coefficient: Given all other variables, the results of the fit1 model show that the price of an apartment per m2 decreases by EUR 8.98 on average with each additional year (p < 0.05).

Coefficient of correlation: Age and Price are negatively correlated (-0.23) with a weak effect

Coefficient of determination: % proportion of price that can be explained by the linear effect of explanatory variables (age) in the regression model. Age only explains 5 % of the price (R^2 = 0.05302)

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[c("Price", "Age", "Distance")],
smooth = FALSE)

No variables are highly correlated there are probably no problem with multicolinearity.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Price= 2460.101 + Age* -7.934+ Distance*-20.667

7. Check the multicolinearity with VIF statistics. Explain the findings. (1 p)

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

This function checks the degree of multicollinearity. All variables with a VIF statistic above 5 should be removed. Ideally the average VIF statistic should be close to 1 as possible. In our example, all VIF statistics are below 5. The average VIF is approximately at 1.00.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")

The histogram shows some gaps between 0.15 and 0.30

head(mydata[order(-mydata$CooksD), c("Price", "CooksD")], 10)

## # A tibble: 10 × 2
##    Price CooksD
##    <dbl>  <dbl>
##  1  2180  0.32 
##  2  1740  0.104
##  3  2790  0.069
##  4  1760  0.066
##  5  2540  0.061
##  6  2400  0.038
##  7  2820  0.037
##  8  2300  0.034
##  9  2810  0.032
## 10  2800  0.03

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
filter(!CooksD %in% c(0.320, 0.104))

Removing the outlier effect sizes 0.320 & 0.104. (I have chosen the effect, I can’t select price since there might be multiple apartments with the same price. A better solution would probably be if I add an ID for each apartment)

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

Since p is above 0.05. H0 can not be rejected. Therefore we assume homoskedasticity for the standardised residuals.

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

# Reassign the function, since we removed two observations with high impact
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95952, p-value = 0.01044

The histogram shows that the the s. residuals could be normally distributed. Shapiro-Wilk normality test, however suggests that they are not normally distributed (pvalue<0.05). Thus, the null hypothesis that the standardized residuals are normally distributed is rejected.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

mydata<- read_xlsx("./Apartments.xlsx")
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Price= 2460.101+ -7.93* Age- 20.67

Given the Distance from the city center in km, the results of the fit2 model show that the price of an apartment per m2 decreases by EUR 7.93 Euro on average with each additional year of an apartment (p < 0.05)

Given the Age of the apartment in years, the results of the fit2 model show that the price of an apartment per m2 decreases by EUR 20.67 Euro on average with each km the apartment is from the city center (p < 0.05)

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydata)

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova suggests that fit3 fits data better than model fit2, with an (p-value=0.01)

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking      196.168     62.868   3.120  0.00251 ** 
## Balcony        1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The fit3 model show that the price of an apartment per m2 is on average EUR 196.17 Euro more expensive if the apartment has parking in comparison to apartments without parking (p < 0.05).

The fit3 model show that there is on average no significant differences in the price of an apartment per m2 between apartment with and without balcony (p=0.97).

Possible explanation for these results could be that the value of parking is higher than balcony. People often rent a parking slot if the apartment doesn’t include one, while balcony is a nice-to-have.

The f-statistics tests the significance of the regression model. In other words, if one or more of the coefficient are significant, the regression model is significant.

H0 = R^2=0 𝛽_Age = 𝛽_Distance = 𝛽_Parking = 𝛽_Balcony = 0

HA= R^2 > 0 if at least one 𝛽 does not equal zero

R^2….R squared, gives % impact of the variables on the price 𝛽…..Coefficient in regression model

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

mydata$FittedValues <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
head(mydata[ , colnames(mydata) %in% c("Price", "FittedValues",
"Residuals")])

## # A tibble: 6 × 3
##   Price FittedValues Residuals
##   <dbl>        <dbl>     <dbl>
## 1  1640        1751.    -111. 
## 2  2800        2357.     443. 
## 3  1660        1749.     -88.8
## 4  1850        1590.     260. 
## 5  1640        2053.    -413. 
## 6  1770        1897.    -127.

The results show that apartment ID2, costs 2800 Euro. Based on our model this apartment would be estimated 2357.41 Euro. Therefore the residuals of ID2 equals 442.59 Euro.