HW 3

Polina Kamenchuk

1. Import the dataset Apartments.xlsx

library(readxl)

data <- read_xlsx("./Apartments.xlsx")

head(data)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

data$ParkingF <- factor(data$Parking, 
                               levels = c(0, 1), 
                               labels = c("No", "Yes")) 

data$BalconyF <- factor(data$Balcony, 
                               levels = c(0, 1), 
                               labels = c("No", "Yes")) 

head(data)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF BalconyF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

summary(data$Price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1400    1710    1950    2019    2290    2820

first of all, I suggest to make basic descriptive statistics of the variable that we will study. So the price is a numerical variable. as we can see the minimum and maximum values are 1400 and 2820 respectively. the mean in the study of which we are interested is 2019. the median is 1950, that is, half or less of the apartments have a price of 1950 EUR per square meter

to investigate this question we have to conduct a t-test. he has two assumptions. the first - the variable must be numerical. this is true in our case the second is normality: the price must be normally distributed. Let’s check the second

library(ggplot2)
ggplot(data, aes(x = Price)) +
  geom_histogram(binwidth = 150, colour = "black") + 
  ylab("Frequency") + 
  xlab("Price")

based on the graph we can suggest that price is not normally distributed as histogram is a bit right-skewed and maybe bimodal because we can see the first mode somewhere in 1700 in the second one somewhere in 2250

to be more sure let’s run Shapiro-Wilk test to check normality

shapiro.test(data$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Price
## W = 0.94017, p-value = 0.0006513

Hypothesis for Shapiro-Wilk test:

H0: price is normally distributed

H1: price is not normally distributed

We reject H0 at p<0.001

Hence, Price variable is not normally distributed

This means we have to do non-parametric test which is wilcoxon signed rang test and here we will check medians but not means anymore

median(data$Price)

## [1] 1950

the median is 1950, that is, half or less of the apartments have a price of 1950 dollars per square meter

wilcox.test(data$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  data$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

Set hypotesis for the wilcoxon signed rang test:

H0: Mu_Price = 1900

H1: Mu_Price =/ 1900

We reject H0 at p=0.03 Hence, the median price per sq meter is NOT equal to 1900 eur.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age,     ## y = Price, x = Age
          data = data)

summary(fit1)    ## display summary

## 
## Call:
## lm(formula = Price ~ Age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor_coef <- sqrt(summary(fit1)$r.squared)    ##calculate coefficient of correlation, which is root square of coefficient of determination
cor_coef       ## display coefficient of correlation

## [1] 0.230255

the model will be the following

Price = 2185.455 -8.975*Age

Intercept (regression constant): if the apartment has no age, that is, the age will be zero, that is, the apartment will be newly built, its price will be on average 2185.45 euros per square meter
Age coefficient: if the age of an apartment increases by one year, then the price of the apartment will decrease by 8.975 euros per square meter on average. (at p = 0.034)

Coefficient of determination: Adjusted R-squared: 0.04161 the coefficient of determination means that with the help of this model we explained 4.16% of the reasons for the change in the price of apartments

coefficient of correlation = 0.23 this tells us that the correlation is negative and according to Pearson it is weak

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(data [  ,  c(1,2,3)],   ## this function to construct a matrix of scatterplots. we choose columns number 1 2 and 3 as a vector
                  smooth = FALSE)

we can analyze the slope of the regression line on scatterplots

as we can see from the scatterplot between age and distance, the line is almost perfectly horizontal, meaning there is no correlation. that is, for any change in the distance, the age remains unchanged on average. and vice versa, for any change in age, the distance remains unchanged on average

and if we talk about age and price, then here we can trace a negative weak correlation which can be seen from the way the points are located around the regression line

if we analyze the distance and the price, the correlation is negative and I would say of medium strength, which means that as the distance increases, the price will decrease and this effect has a moderate strength, as can be seen from the way the points are located around the regression line

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
          data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Model will be the following

Price = 2460.101 -7.934 * Age -20.667 * Distance

Coefficient interpretation:

Intercept: if all explanatory variables are equal to zero, that is, the age of the apartment is 0 years and the apartment is newly built, and the distance is 0 km and it is located in the heart of the city, then the price of the apartment will be on average 2460.101 euros per square meter (at p < 0.001)
Age coef: if the age of the apartment increases by one year, then the price of the apartment will on average fall by 7.934 euros per square meter, ceteris paribus, that is, keeping the distance the same (at p = 0.016)
Distance coef: if the distance of the apartment from the city centre increases by one km, then the price of the apartment will on average fall by 20.667 euros per square meter, ceteris paribus, that is, keeping the age the same (at p <0.001)

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

he VIF (Variance Inflation Factor) statistic is used to check the strengths of multicollinearity

if the vif value is greater than five, it means that there is a very strong multicollinearity between the independent variables and one of them should be excluded

also the average vif should be close to one

as we can see in our case the vif statistic is very close to 1 for both independent variables

Therefore, multicollinearity is absent

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals

hist(data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

we can detect outliers by calculating standardized residuals

values of standardized residuals should be in the ragne -/+ 3. if there are any units with the value less than -3 or more than +3 this units should be considered as outliers and be removed

in our case as we can see the histogram is a bit right-skewed however it looks moral more or less normal

And, what is important in our case, there are no units outside with the -/+ 3 range

Hence, no outliers

data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

we can calculate Cooks distances in order to detect units with high impact

when we calculate Cooks distances and build the histogram, we can check if there are any gaps in the histogram. if there are some gaps then there are units with a high impact.

as we can see in our case there is such a gap so we have to remove this unit

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Create a new column named 'ID' containing the serial numbers
data <- data %>%
  mutate(ID = 1:n())

head(data)

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD    ID
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl> <int>
## 1     7       28  1640       0       1 No       Yes        -0.665  0.007     1
## 2    18        1  2800       1       0 Yes      No          1.78   0.03      2
## 3     7       28  1660       0       0 No       No         -0.594  0.006     3
## 4    28       29  1850       0       1 No       Yes         0.754  0.008     4
## 5    18       18  1640       1       1 Yes      Yes        -1.07   0.005     5
## 6    28       12  1770       0       1 No       Yes        -0.778  0.005     6

here I created a new column called ID, it will contain the serial number of each unit and will help us identify and remove high impact units

head(data[order(data$StdResid),], 3) #Three units with lowest value of stand. residuals

## # A tibble: 3 × 10
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD    ID
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl> <int>
## 1     7        2  1760       0       1 No       Yes         -2.15  0.066    53
## 2    12       14  1650       0       1 No       Yes         -1.50  0.013    13
## 3    12       14  1650       0       0 No       No          -1.50  0.013    72

head(data[order(-data$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD    ID
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl> <int>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32     38
## 2    43       37  1740       0       0 No       No           1.44  0.104    55
## 3     2       11  2790       1       0 Yes      No           2.05  0.069    33
## 4     7        2  1760       0       1 No       Yes         -2.15  0.066    53
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061    22
## 6    40        2  2400       0       1 No       Yes          1.09  0.038    39

here we can see that the unit with the lowesr std resid has the serial number 53, std resid = 2.152, which is still within a -/+3 range

the unit with the hisgest Cooks distances is number 38, it must be removed

library(dplyr)
data <- data %>%
  filter(!ID == "38") #Removing unit number 38 from the dataset

fit2 <- lm(Price ~ Age + Distance, 
          data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

fiting new model, without that unit with high impact

however, now we have to do analysis once again.

So,

data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals

hist(data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

so, standardized residuals are still within a range of -/+ 3

data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

however in the Cook’s distances histogram there is still a unit with the gap, this is, the unit with high impact, that has to be removed in the same way and with the same logic

head(data[order(-data$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD    ID
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl> <int>
## 1    43       37  1740       0       0 No       No           1.60  0.129    55
## 2     2       11  2790       1       0 Yes      No           2.22  0.084    33
## 3     7        2  1760       0       1 No       Yes         -2.24  0.072    53
## 4    37        3  2540       1       1 Yes      Yes          1.47  0.056    22
## 5     8       26  2300       1       1 Yes      Yes          1.82  0.052    25
## 6     8        2  2820       1       0 Yes      No           1.70  0.039    58

The unit to be removed is a unit with the ID number 55

library(dplyr)
data <- data %>%
  filter(!ID == "55") #Removing unit number 55 from the dataset

again fitting the very new modal

fit2 <- lm(Price ~ Age + Distance, 
          data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

again checking residuals

data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals

hist(data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

they are still ok, within a -/+3 range

data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

as we can see here again there is a gap so again we have to repeat the whole same procedure with removing the high impact units, again building the regression model and again checking when the histogram no longer has gaps

with your permission I won’t describe it all again and again. here is the general logic. I will simply remove problematic units until all are gone

head(data[order(-data$CooksD),], 6)

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD    ID
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl> <int>
## 1     2       11  2790       1       0 Yes      No           2.17  0.084    33
## 2     7        2  1760       0       1 No       Yes         -2.35  0.084    53
## 3    37        3  2540       1       1 Yes      Yes          1.57  0.065    22
## 4     8       26  2300       1       1 Yes      Yes          1.85  0.054    25
## 5    45       21  1910       0       1 No       Yes          1.07  0.05     31
## 6    40        2  2400       0       1 No       Yes          1.04  0.038    39

data <- data %>%
  filter(!ID == "33") 

data <- data %>%
  filter(!ID == "53")

here I will delete two units at once because their Cooks distances are absolutely identical, so they both cause problems

fit2 <- lm(Price ~ Age + Distance, 
          data = data)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -408.72 -217.93  -38.86  220.60  506.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2492.715     75.544  32.997  < 2e-16 ***
## Age           -7.496      3.169  -2.365   0.0205 *  
## Distance     -24.574      2.700  -9.100  6.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5204 
## F-statistic:  44.4 on 2 and 78 DF,  p-value: 1.335e-13

data$StdResid <- round(rstandard(fit2), 3)  #Standardized residuals

hist(data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

finally the histogram of the cook distances does not contain any gaps hence there are no units with high impact anymore

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

data$StdFitted <- scale(fit2$fitted.values)   ## create new column with fitted values

library(car)
scatterplot(y = data$StdResid, x = data$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

for me, it looks as if there is heteroscedasticity here, because in the left part of the plot, the unit dispersion is less than in the right part. the scatterplot looks as if it opens to the right, like a fan, which indicates heteroscedasticity. If there was homoscedasticity, then the data were would be distributed evenly and there would be no significant difference in dispersion on the right and left sides

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

data$StdFitted <- scale(fit2$fitted.values)   ## create new column with fitted values

library(car)
scatterplot(y = data$StdResid, x = data$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.64762 
##  Prob > Chi2   =    0.1992832

now the formal test to check homoskedasticity for the regression model

lets set the hypothesis H0: the variance is constant measn there is a homoskedasticity H1: the variance is not constant means no homoskedasticity >> heteroskedasticity

we cannot reject H0 (p = 0.2, means p > 0.05)

Although the scatterplot analysis is more reminiscent of heteroskedasticity, the formal test did not confirm this. In addition, our sample is quite large (80). It’s not like it’s very big. but still we can assume that there is homoscedasticity here

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

fit2 <- lm(Price ~ Age + Distance, 
          data = data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -408.72 -217.93  -38.86  220.60  506.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2492.715     75.544  32.997  < 2e-16 ***
## Age           -7.496      3.169  -2.365   0.0205 *  
## Distance     -24.574      2.700  -9.100  6.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5204 
## F-statistic:  44.4 on 2 and 78 DF,  p-value: 1.335e-13

Model will be the following

Price = 2492.715 -7.496 * Age -24.574 * Distance

Coefficient interpretation:

Intercept: if all explanatory variables are equal to zero, that is, the age of the apartment is 0 years and the apartment is newly built, and the distance is 0 km and it is located in the heart of the city, then the price of the apartment will be on average 2492.715 euros per square meter (at p < 0.001)
Age coef: if the age of the apartment increases by one year, then the price of the apartment will on average fall by 7.496 euros per square meter, ceteris paribus, that is, keeping the distance the same (at p = 0.021)
Distance coef: if the distance of the apartment from the city centre increases by one km, then the price of the apartment will on average fall by 24.574 euros per square meter, ceteris paribus, that is, keeping the age the same (at p < 0.001)

Coefficient of determination = 0.5204, which means that the model 2 explains 52.04% of the variation in prices

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
          data = data)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.16 -204.08  -41.24  239.52  528.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2372.693     93.344  25.419  < 2e-16 ***
## Age           -6.906      3.118  -2.215   0.0298 *  
## Distance     -22.254      2.839  -7.837 2.25e-11 ***
## ParkingFYes  137.479     60.854   2.259   0.0267 *  
## BalconyFYes   17.234     57.099   0.302   0.7636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared:  0.5621, Adjusted R-squared:  0.539 
## F-statistic: 24.38 on 4 and 76 DF,  p-value: 5.29e-13

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     78 5248925                              
## 2     76 4915944  2    332981 2.5739 0.08287 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lest set hypothesis

H0: delta ro square = 0 >> means both models are equally good

H1: delta ro square > 0 >> means the second model is better

we connot reject H0 (p>0.05), means with the help of the third model we were able to explain no more than with the help of the second

these are the results of the formal anova test to compare how good the models are

however, if we look at the R^2 of both models, we will see that for the third model it slightly increased. for the 2nd model it was 0.5204. for the 3rd one it became 0.539. this is only 1%, but still this identifies that with the help of the third model we explained a little more variability in the price

but again, the formal anova test showed that there is no difference in the quality of the models

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.16 -204.08  -41.24  239.52  528.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2372.693     93.344  25.419  < 2e-16 ***
## Age           -6.906      3.118  -2.215   0.0298 *  
## Distance     -22.254      2.839  -7.837 2.25e-11 ***
## ParkingFYes  137.479     60.854   2.259   0.0267 *  
## BalconyFYes   17.234     57.099   0.302   0.7636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared:  0.5621, Adjusted R-squared:  0.539 
## F-statistic: 24.38 on 4 and 76 DF,  p-value: 5.29e-13

Model will be the following

Price = 2372.693 -6.906 * Age -22.254 * Distance + 137.479 * ParkingFYes + 17.234 * BalconyFYes

Coefficient interpretation:

Parking: ceteris paribus, keeping all other variables controlled and constant, switching from apartments without parking to those with parking, the price of an apartment will on average increase by 137.479 euros per square meter (at p = 0.027)
Balcony: is not statistically significant, hence, cannot be interpreted

. F-statictics is a test of significance of the model as a whole

Hypothesis to be tested:

H0: ro^2 = 0 H1: ro^2 > 0

“ro^2” -is the coeffition of determination but for population

we reject H0 at p<0.001, which means that we were able to explaine something with out model

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

data$Fitted <- round(fit3$fitted.values, 2)   ##to save fitted values 

data$Residual <- round(data$Price - data$Fitted, 2)     ## to calculate residuals

head(data[ 2 , c("ID", "Price", "Fitted", "Residual")])    ## to call for the second raw (apartment with ID 2), and vector of necessary columns "ID", "Price", "Fitted", "Residual"

## # A tibble: 1 × 4
##      ID Price Fitted Residual
##   <int> <dbl>  <dbl>    <dbl>
## 1     2  2800  2364.     436.