HW 3

Hyejin Roh

1. Import the dataset Apartments.xlsx

library(readxl)

mydata <- read_excel("./Apartments.xlsx")

mydata <- cbind(ID = 1:nrow(mydata), mydata)

head(mydata)

##   ID Age Distance Price Parking Balcony
## 1  1   7       28  1640       0       1
## 2  2  18        1  2800       1       0
## 3  3   7       28  1660       0       0
## 4  4  28       29  1850       0       1
## 5  5  18       18  1640       1       1
## 6  6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

How do different factors(such as age, distance, parking, and balcony) influence the price of apartments?

3. Change categorical variables into factors. (0.5 p)

mydata$ParkingF <- factor(mydata$Parking,
                          levels = c("0", "1"),
                          labels = c("No", "Yes"))

mydata$BalconyF <- factor(mydata$Balcony,
                          levels = c("0", "1"),
                          labels = c("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

\(H_0\): \(\mu_{Price} = 1900\)
\(H_1\): \(\mu_{Price} \neq 1900\)
Estimated average price per \(m^2\): 2018.941
Reject \(H_0\) at \(p = 0.005\)
The average price per \(m^2\) is significantly different from 1900

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age,
           data = mydata)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

regression coefficient: If the age of the apartment increases by 1 year, the price per \(m^2\) decreases by 8.975EUR(\(p<0.034\)), assuming that the other explanatory variables remain unchanged.
coefficient of correlation: \(\sqrt{0.05302} = 0.2303\). Linear relationship between the price per \(m^2\) and the age of the apartment is weak.
coefficient of determination: 5.3% of the variability of the price per \(m^2\) can be explained by the linear impact of the age of the apartment.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## 필요한 패키지를 로딩중입니다: carData

scatterplotMatrix(mydata[c(4, 2, 3)],
                  smooth = FALSE)

Based on the scatter plots between Price, Age and Distance, there is no strong linear relationship between the explanatory variables.
So there is no potential problem with multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

All VIF statistics are below 5, while the average VIF statistic is 1.00
In model fit2 there is no problem with too strong multicolinearity.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$CooksD,
     breaks = 20,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distance")

head(mydata[order(-mydata$CooksD), ], 6)

##    ID Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## 38 38   5       45  2180       1       1      Yes      Yes    2.577  0.320
## 55 55  43       37  1740       0       0       No       No    1.445  0.104
## 33 33   2       11  2790       1       0      Yes       No    2.051  0.069
## 53 53   7        2  1760       0       1       No      Yes   -2.152  0.066
## 22 22  37        3  2540       1       1      Yes      Yes    1.576  0.061
## 39 39  40        2  2400       0       1       No      Yes    1.091  0.038

mydataC <- mydata[mydata$CooksD < 0.04, ]

hist(mydataC$CooksD,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distance")

Based on the histogram of Cooks distance, a cutoff value of 0.04 is used to identify potentially problematic units (with high influence).

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance,
           data = mydataC)

mydataC$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydataC$StdResid, x = mydataC$StdFitted,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

## 
## 다음의 패키지를 부착합니다: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

The scatter plot of the standardized residuals and the standardized fitted values shows that both homoskedasticity and linearity are fulfilled.
The Breusch-Pagan test:
- \(H_0\): The variance of errors is constant (homoskedasticity)
- \(H_1\): The variance of errors is not constant (heteroskedasticity)
We cannot reject \(H_0\), so we assume homoskedasticity.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(mydataC$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydataC$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataC$StdResid
## W = 0.93418, p-value = 0.0004761

The histogram of standardized residuals indicates that the errors(\(\epsilon\)) are most likely not normally distributed.
Shapiro-Wilk test
- \(H_0\): the errors(\(\epsilon\)) are normally distributed
- \(H_1\): the errors(\(\epsilon\)) are not normally distributed
Null hypothesis is rejected at \(p<0.001\)

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = mydataC)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Only the variable Distance associated with the price per \(m^2\).
regression coefficient: With 1km increase in distance from the city center, the price per \(m^2\) decreases by 24.06EUR(\(p<0.001\)), assuming that the other explanatory variables remain unchanged.
coefficient of correlation: \(\sqrt{0.5361} = 0.7322\). Linear relationship between the price per \(m^2\) and the distance from the city center is strong.
coefficient of determination: 53.61% of the variability of the price per \(m^2\) can be explained by the linear effect of the age and distance from the city center.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
            data = mydataC)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

\(H_0: \Delta \rho^2 =0\), both models(fit2 and fit3) fit the data equally well.
\(H_1: \Delta \rho^2 > 0\), model fit3 fits the data better.
We cannot reject the null hypothesis, indicating no significant improvent in model fit.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydataC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

regression coefficient:
- Parking: Given the values of the other explanatory variables, the group of apartments with parking space have an average price per \(m^2\) that is 128.7EUR higher compared to the group of apartments without parking space(\(p<0.04\))
- Balcony: Didn’t find the difference between the average price per \(m^2\) of two equal apartments with exception that one apartment has a balcony and the other one doesn’t. Given the values of the other explanatory variables, the group of apartments that have a balcony have an average price per \(m^2\) that is 6.03EUR higher compared to the group of apartments without a balcony(\(p<0.9)\), but it isn’t statistically significant.
F-statistic: whether the linear regression model fits the data better than a model that contains no explanatory variables.
- \(H_0: \rho^2 = 0\) or \(\beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\)
- \(H_1: \rho^2 > 0\) or at least one \(\beta_j\) is different from 0
Reject the null hypothesis at \(p<0.001\)
At least part of the variability of the dependent variable(price per \(m^2\)) can be explained by the linear effect of the explanatory variables.
The overall regression model is significant.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

mydataC$FittedValues <- fit3$fitted.values

mydataC$Residuals <- resid(fit3)

head(mydataC[ , colnames(mydataC) %in% c("ID", "Price", "FittedValues", "Residuals")])

##   ID Price FittedValues  Residuals
## 1  1  1640     1728.641  -88.64095
## 2  2  2800     2356.597  443.40256
## 3  3  1660     1722.609  -62.60903
## 4  4  1850     1539.312  310.68782
## 5  5  1640     1989.286 -349.28625
## 6  6  1770     1912.655 -142.65528

\(\hat{Y}_{ID2} = \beta_0 + \beta_1 \cdot \text{Age}_{ID2} + \beta_2 \cdot \text{Distance}_{ID2} + \beta_3 \cdot \text{ParkingF}_{ID2}\)
\(\hat{Y}_{ID2} = 2393.316 - 7.970 \cdot 18 - 21.961 \cdot 1 + 128.700 \cdot 1 = 2356.597\)
\(e_{ID2} = Y_{ID2} - \hat{Y}_{ID2} = 2800 - 2356.597 = 443.403\)
The difference between the actual and estimated price per \(m^2\) for apartment ID2 is 443.403EUR.