HW 3

OLEH BURDUKOV

1. Import the dataset Apartments.xlsx

library(readxl)
apartments <- read_xlsx('./Apartments.xlsx') #importing the Excel-dataset

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

How price of an apartment per square meter is influenced by such factors as balcony, distance from the city centre, parking lot and age of the apartment ?

3. Change categorical variables into factors. (0.5 p)

apartments$Parking <- factor(apartments$Parking,
                             levels = c(0,1),
                             labels = c('No','Yes'))
apartments$Balcony <- factor(apartments$Balcony,
                             levels = c(0,1),
                             labels = c('No','Yes'))
str(apartments) #to make sure if every variable is now numeric or factor.

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

#one sample two sided t-test
t.test(apartments$Price,
       mu = 1900,
       alternative = 'two.sided')

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: µ = 1900
H1: µ ≠ 1900

We reject the H0 at p = 0.005. We conclude that there is enough evidence for mean value of price to be different from 1900€.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

# coefficient of correlation
cor(apartments$Price, apartments$Age,
    method = c('pearson'))

## [1] -0.230255

From the pearson correlation coefficient (because both variables are numeric), we can conclude that there is weak negative correlation between the price of an apartment per square meter and its age.

#
fit1 <- lm(Price ~ Age,
          data = apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Accordingly to the regression coefficient of age (b1 = -8.975), age of an apartment has a negative impact on its price per m^2. When the age of an apartment increases by 1 year, its price per 1 m^2 decreases in average by 8.975€ (= 8.98€) (p = 0.034)

!The model doesn’t consider the impact of other possible explanatory variables. So, we don’t have to mention the assumption, that all other explanatory variables remain unchanged

The coefficient of determination shows us how many percents of variability of the dependent variable is explained by linear effect of the explanatory variables. In our case it means, that the linear effect of apartment’s age explains approximately 5,3% of variability of the apartment’s price per 1 m^2.
Moreover, the next line of output is the test of significance of the regression model
- H0: ‘roh’^2 = 0.
- H1: ‘roh’^2 > 0.

Due to the p-value we see that the risk of the 1st type error is low enough, so we can reject H0. We conclude that H1 is true and ‘roh’^2, which is population multiple regression coefficient, is greater than 0. Which means that age does have impact on apartments price per m^2.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(apartments[,-c(4,5)],
                  smooth = FALSE)

Considering the scatterplot matrix, we can make a conclusion that there is no multicolinearity between the age of apartment and its distance from the city centre.

# to make sure the conclusions are right
cor(apartments$Age, apartments$Distance)

## [1] 0.04290813

here, for example, we can observe semi-strong positive correlation, which seems graphically to be so. The third (most secure) method of checking the multicolinearity is going to be conducted further in the task 8

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = apartments)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

In line with the output of the Variance Inflation Statistics, we find no violation of assumption about multicolinearity, due to the values of VIF near 1 (therefore these values are of course less than 5, which is our assumed upper margin for this parameter)

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

apartments$StanRes <- round(rstandard(fit2), 3) #creating the column with (rounded) standardized residuals
apartments$CooksDis <- round(cooks.distance(fit2), 3) #creating the column with (rounded) Cook's distances
head(apartments[order(-apartments$StanRes),],) #displaying the first 6 observations in descending order of standardized residuals

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StanRes CooksDis
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>    <dbl>
## 1     5       45  2180 Yes     Yes        2.58    0.32 
## 2     2       11  2790 Yes     No         2.05    0.069
## 3    18        1  2800 Yes     No         1.78    0.03 
## 4    18        1  2800 Yes     Yes        1.78    0.03 
## 5     8        2  2820 Yes     No         1.66    0.037
## 6    10        1  2810 No      No         1.60    0.032

head(apartments[order(apartments$StanRes),] ,) #displaying the first 6 observations in increasing order of standardized residuals

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StanRes CooksDis
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>    <dbl>
## 1     7        2  1760 No      Yes       -2.15    0.066
## 2    12       14  1650 No      Yes       -1.50    0.013
## 3    12       14  1650 No      No        -1.50    0.013
## 4    13        8  1800 No      No        -1.38    0.012
## 5    14       16  1660 No      Yes       -1.26    0.008
## 6    24        5  1830 Yes     No        -1.19    0.012

head(apartments[order(-apartments$CooksDis),]) #displaying the first 6 observations in descending order of Cook´s distance

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StanRes CooksDis
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>    <dbl>
## 1     5       45  2180 Yes     Yes        2.58    0.32 
## 2    43       37  1740 No      No         1.44    0.104
## 3     2       11  2790 Yes     No         2.05    0.069
## 4     7        2  1760 No      Yes       -2.15    0.066
## 5    37        3  2540 Yes     Yes        1.58    0.061
## 6    40        2  2400 No      Yes        1.09    0.038

Already in the first pair of tables, which relate to the standardized residuals, we easily can see that there are no outliers in the sample, as the highest absolute value is still below 3.

On the other hand, the third table indicates some values with high influence in the two upper rows.

At this point, the plot for the standardized residuals is going to be presented for additional visual survey. The cooks distance plot will follow for the same reason.

hist(apartments$StanRes,
     main = 'Standardized residuals',
     xlab = 'Standardized residuals',
     xlim = c(-3, 3),
     ylab = 'Frequency')

The results of histogram is equal to the informatzion presented in the data table. Therefore, there are no outliers found in the sample.

hist(apartments$CooksDis,
     main = 'Cook`s distances',
     ylab = 'Frequency',
     xlab = 'Cook`s distance',
     breaks = 'Sturges') # histogram of the Cook's distances

Correspondingly to the presented histogram of the Cook`s distances, we might suggest to delete the observation with the value of 0.32, as this one has considerable impact among all observations.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

apartments <- apartments %>% 
  filter(!CooksDis > 0.3) #deleting the observation with the highest value of Cooks distance

We now repeat the histogram to check the units with big impact again

hist(apartments$CooksDis,
     main = 'Cook`s distances',
     ylab = 'Frequency',
     xlab = 'Cook`s distance',
     breaks = 'Sturges') #repeating the histogram

The new histogram indicates again another unit with high influence (Cooks distance = 0.104 accordingly to the data table)

library(dplyr)
apartments <- apartments %>% 
  filter(!CooksDis > 0.1) #deleting the observation with the highest value of Cooks distance

hist(apartments$CooksDis,
     main = 'Cooks distance',
     ylab = 'Frequency',
     xlab = 'Cooks distance',
     breaks = 'Sturges') #repeating the histohram

head(apartments[order(-apartments$CooksDis),],) #displaying the first 6 observations in descending order of cooks distance.

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StanRes CooksDis
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>    <dbl>
## 1     2       11  2790 Yes     No         2.05    0.069
## 2     7        2  1760 No      Yes       -2.15    0.066
## 3    37        3  2540 Yes     Yes        1.58    0.061
## 4    40        2  2400 No      Yes        1.09    0.038
## 5     8        2  2820 Yes     No         1.66    0.037
## 6     8       26  2300 Yes     Yes        1.57    0.034

Since there are 3 observations at once with values of Cooks distance quite close to each other and greater then 0.6, we won’t exclude them, as they do not seem to be quite a problem, due to their close position to the next observations

fit2 <- lm(Price ~ Age + Distance,
           data = apartments)

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

apartments$Fittedval <- scale(fit2$fitted.values)

library(ggplot2)
ggplot(apartments, aes(x = Fittedval, y = StanRes))+
  geom_point(colour = 'darkseagreen4')+
  labs(x = 'standardized fitted values',
       y = 'standardized residuals',
       title = 'The scatterplot between stan. fitted values and stan. residuals')+
  theme_grey()

Considering the scatterplot, we can conclude the heteroscedasticity of the errors, as the variance of errors doesn’t seem constant on different values of explanatory variables.

11. Are standardized residuals distributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

library(ggplot2)
ggplot(apartments, aes(x = StanRes))+
  geom_histogram(binwidth = 0.4, color = 'burlywood4', fill = 'cornflowerblue', linetype = 'longdash')+
  labs(x = 'Standardized residuals',
       y = 'Frequency',
       title = 'Distribution of standardized residuals')+
  theme_bw()

At this point the distribution of standardized residuals seems to be normal, though probably bimodal or even multimodal. In this case, I would assume to conduct the corresponding test, which is Shapiro-Wilk test.

shapiro.test(apartments$StanRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments$StanRes
## W = 0.94963, p-value = 0.002636

H0: the distribution of errors is linear
H1: the distribution of errors is not linear

We reject the H0 at p = 0.003. Apparently, there is not enough evidence for the distribution of errors to be normal. So we can assume the distribution of errors is not normal. But in fact, though these findings prove that the assumption of normality of the distribution isn’t met, it could be ignored since the number of observations is large enough. And so we can proceed the analysis of assumptions and requirements.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Age: Assuming all other explanatory variables remain unchanged, the price of 1 m^2 of apartment decreases on average by 7.85€, if the age of the apartment increases by 1 year.
Distance: Assuming all other explanatory variables remain unchanged, the price of 1 m^2 of apartment decreases on average by 23. 95€, if the distance to the city centre increases by 1 km
Coefficient of determination for sample (R^2): 49.68% of variability in the price for 1 m^2 of apartment can be explained by the linear effect of apartment’s age and its distance from the city centre.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Balcony + Parking,
           data = apartments)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Balcony + Parking
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Accordingly to results of the ANOVA test, in which

H0: ∆‘roh’^2 = 0
H1: ∆‘roh’^2 > 0

we reject the H0 at p = 0.03. The population multiple coefficient of determination of the model fit3 is better than such coefficient of the model fit2. The model fit3 fits the data better.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Balcony + Parking, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## BalconyYes    -6.985     58.745  -0.119  0.90566    
## ParkingYes   168.921     62.166   2.717  0.00811 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Assuming all other explanatory variables remain unchanged, if an apartment has a balcony, the price of m^2 in the apartment decreases on average by 6.98€ in comparison to the apartments without a balcony.
Assuming all other explanatory variables remain unchanged, if an apartment has a parking, the price of square meter in the apartment increases on average by 168.92€ in comparison to the apartments without parking
H0: ‘roh’^2 = 0 [the model doesn’t explain anything]
H1: ‘roh’^2 > 0 [at least 1 explanatory variable explains the differences in the price per m^2 of an apartment]

We reject the H0 at p < 0.001. We find the linear relationship between the variables in the model. At least 1 explanatory variable explains the differences in the price per m^2 of an apartment.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

apartments$FitVal <- fitted.values(fit3)
apartments$Residuals <- residuals(fit3)

apartments$ID <- c(1: nrow(apartments))
head(apartments)

## # A tibble: 6 × 11
##     Age Distance Price Parking Balcony StanRes CooksDis Fittedval[,1] FitVal Residuals    ID
##   <dbl>    <dbl> <dbl> <fct>   <fct>     <dbl>    <dbl>         <dbl>  <dbl>     <dbl> <int>
## 1     7       28  1640 No      Yes      -0.665    0.007        -0.953  1707.     -66.8     1
## 2    18        1  2800 Yes     No        1.78     0.03          1.13   2377.     423.      2
## 3     7       28  1660 No      No       -0.594    0.006        -0.953  1714.     -53.8     3
## 4    28       29  1850 No      Yes       0.754    0.008        -1.66   1534.     316.      4
## 5    18       18  1640 Yes     Yes      -1.07     0.005        -0.382  2009.    -369.      5
## 6    28       12  1770 No      Yes      -0.778    0.005        -0.139  1896.    -126.      6

Y_fitted = 2358.9 - 7.197 * 18 - 21.241 * 1 - 6.985 * 0 + 168.921 * 1
print(Y_fitted) # the same value can be seen in the data table, namely in the column of the fitted values

## [1] 2377.034

# estimating the residual for ID = 2. Actual price for the m^2 - fitted value of the price for m^2
2800 - Y_fitted

## [1] 422.966

# checking if the calculation is right (though it is easy to indicate it in the data set)
library(dplyr)
apartments %>% 
  filter(ID == '2') %>% 
  pull(Residuals) #I had to use function 'pull()', as the function 'select()' somehow rounded up the results anytime I was knitting the output

##        2 
## 422.9572