HW 3

NAME SURNAME

Nam Anh Le

1. Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

A possible research question would be whether Parking and Price related to each other?

3. Change categorical variables into factors. (0.5 p)

mydatanew <- mydata
mydatanew$Parking <- factor(mydata$Parking,
                            levels = c(0,1),
                            labels = c("No","Yes"))
mydatanew$Balcony <- factor(mydata$Balcony,
                            levels = c(0,1),
                            labels = c("No","Yes"))
head(mydatanew)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

\(H_0\): \(\mu_{\text{Price}}\) = 1900 euro
\(H_1\): \(\mu_{\text{Price}}\) \(\ne\) 1900 euro

Assumptions:

Variable is numeric
Normality - variable on the population is normally distributed

Normality test:

\(H_0\): Price is normally distributed on the population
\(H_1\): Price is not normally distributed on the population

library(ggplot2)
ggplot(mydatanew, aes(x = Price)) +
  geom_histogram(binwidth = 150, colour = "black") +
  ylab("Frequency") + 
  xlab("Price in EUR")

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

shapiro.test(mydatanew$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydatanew$Price
## W = 0.94017, p-value = 0.0006513

Based on the Shapiro Test, we reject \(H_0\) at p-value = 0.001

Non Parametric Test:

\(H_0\): \(Me_{\text{Price}}\) = 1900 euro
\(H_1\): \(Me_{\text{Price}}\) \(\ne\) 1900 euro

wilcox.test(mydatanew$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  mydatanew$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

Based on the Wilcox test, we reject \(H_0\) at p-value = 0.029

Conclusion: The median Price per \(m^2\) is not 1900 euro

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age,
           data = mydatanew)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydatanew)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(mydatanew$Price,mydatanew$Age)

## [1] -0.230255

Regression Coefficient: b = -8.975
Explanation: If Age increases by 1 year, the Price per \(m^2\) of the Apartment on average reduces by 8.975 euros.

Coefficient of Correlation: -0.23
Explanation: The linear relationship between Price and Age is negative and weak.

Coefficient of Determination: 0.053
Explanation: 5.3% of the variability of Price is explained by the linear effect of Age.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydatanew[,-c(4,5)],smooth = FALSE)

There is no clear trend between Age and Distance, we can see that the line is nearly flat
Thus there is no multicolinearity issues

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydatanew)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Based on VIF Statistic, There is weak multicolinearity between Age and Distance

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydatanew$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydatanew$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

head(mydatanew[order(mydatanew$StdResid),], 3) #Three units with lowest value of stand. residuals

## # A tibble: 3 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     7        2  1760 No      Yes        -2.15  0.066
## 2    12       14  1650 No      Yes        -1.50  0.013
## 3    12       14  1650 No      No         -1.50  0.013

head(mydatanew[order(-mydatanew$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061
## 6    40        2  2400 No      Yes         1.09  0.038

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydatanew <- mydatanew %>%
  filter(!CooksD == 0.320) %>%
  filter(!StdResid == -2.152)

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance,
           data = mydatanew)

mydatanew$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydatanew$StdResid, x = mydatanew$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatterplot, we see for different levels of the explanatory variables, the variance of errors is relativelyconstant

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(mydatanew$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydatanew$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydatanew$StdResid
## W = 0.93368, p-value = 0.0003444

Normality of standardized Residuals:

\(H_0\): Errors are normally distributed
\(H_1\): Errors are not normally distributed

Based on the Shapiro Test, we reject \(H_0\) at p-value = 0.001

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = mydatanew)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.51 -223.89  -62.78  202.78  575.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2489.617     73.524  33.861  < 2e-16 ***
## Age           -7.350      3.103  -2.368   0.0203 *  
## Distance     -23.636      2.731  -8.654 4.21e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.5008 
## F-statistic: 42.13 on 2 and 80 DF,  p-value: 3.177e-13

sqrt(summary(fit2)$r.squared)

## [1] 0.7162245

Regression Coefficient:

\(b_1\) = -7.35
\(b_2\) = -23.636

Explanation:

All other variables remain unchanged, if Age increases by 1 year, the Price per \(m^2\) of the Apartment on average decreases by 7.35 euros.
All other variables remain unchanged, if Distance from the city center increases by 1 km, the Price per \(m^2\) of the Apartment on average decreases by -23.636 euros.

Multiple Correlation Coefficient: 0.716
Explanation: The linear relationship between dependent and all explanatory variable is strong.

Coefficient of Determination: 0.513
Explanation: 51.3% of the variability of Price is explained by the linear effect of Age and Distance.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = mydatanew)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5795128                              
## 2     78 5410469  2    384659 2.7727 0.06866 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no significant difference between model fit3 and model fit2 since p-value = 0.069

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.58 -198.68  -44.44  229.33  529.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2367.282     93.781  25.243  < 2e-16 ***
## Age           -6.605      3.054  -2.162   0.0337 *  
## Distance     -21.140      2.878  -7.345 1.73e-10 ***
## ParkingYes   147.508     62.799   2.349   0.0214 *  
## BalconyYes    -3.122     58.635  -0.053   0.9577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared:  0.5453, Adjusted R-squared:  0.522 
## F-statistic: 23.39 on 4 and 78 DF,  p-value: 9.972e-13

\(H_0\): \(p^2\) = 0
\(H_1\): \(p^2\) > 0

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

mydatanew$Fitted <- round(fit3$fitted.values, 2)
mydatanew$Residual <- mydatanew$Price - mydatanew$Fitted

head(mydatanew)

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony StdResid CooksD StdFitted[,1] Fitted Residual
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>         <dbl>  <dbl>    <dbl>
## 1     7       28  1640 No      Yes       -0.665  0.007       -0.893   1726.    -86.0
## 2    18        1  2800 Yes     No         1.78   0.03         1.15    2375.    425. 
## 3     7       28  1660 No      No        -0.594  0.006       -0.893   1729.    -69.1
## 4    28       29  1850 No      Yes        0.754  0.008       -1.55    1566.    284. 
## 5    18       18  1640 Yes     Yes       -1.07   0.005       -0.323   2012.   -372. 
## 6    28       12  1770 No      Yes       -0.778  0.005       -0.0731  1926.   -156.

The residual for Apartment ID2 is 425.23