Take home exam AJDA ROBIDA

Task 1

data()
data(package = .packages(all.available = TRUE))

library(boot)
mydata <- force(melanoma)

head(mydata)

##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2   30      3   1  56 1968      0.65     0
## 3   35      2   1  41 1977      1.34     0
## 4   99      3   0  71 1968      2.90     0
## 5  185      1   1  52 1965     12.08     1
## 6  204      1   1  28 1971      4.84     1

Explanation of variables: - time: Survival time in days since the operation - status: The patients status at the end of the study. 1 indicates that they had died from melanoma, 2 indicates that they were still alive and 3 indicates that they had died from causes unrelated to their melanoma. - sex: The patients sex; 1=male, 0=female. - age: Age in years at the time of the operation. - year: Year of operation. - thickness: Tumour thickness in mm. - ulcer: Indicator of ulceration; 1=present, 0=absent

colnames(mydata)[3] <- "gender"

mydata$GenderF <- factor(mydata$gender,
                      levels = c(0, 1),
                      labels = c("F", "M"))
head (mydata)

##   time status gender age year thickness ulcer GenderF
## 1   10      3      1  76 1972      6.76     1       M
## 2   30      3      1  56 1968      0.65     0       M
## 3   35      2      1  41 1977      1.34     0       M
## 4   99      3      0  71 1968      2.90     0       F
## 5  185      1      1  52 1965     12.08     1       M
## 6  204      1      1  28 1971      4.84     1       M

I renamed the variable sex into gender and change 0, 1 into M, F to be more logical when analysing.

summary(mydata)

##       time          status         gender            age             year     
##  Min.   :  10   Min.   :1.00   Min.   :0.0000   Min.   : 4.00   Min.   :1962  
##  1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1968  
##  Median :2005   Median :2.00   Median :0.0000   Median :54.00   Median :1970  
##  Mean   :2153   Mean   :1.79   Mean   :0.3854   Mean   :52.46   Mean   :1970  
##  3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:65.00   3rd Qu.:1972  
##  Max.   :5565   Max.   :3.00   Max.   :1.0000   Max.   :95.00   Max.   :1977  
##    thickness         ulcer       GenderF
##  Min.   : 0.10   Min.   :0.000   F:126  
##  1st Qu.: 0.97   1st Qu.:0.000   M: 79  
##  Median : 1.94   Median :0.000          
##  Mean   : 2.92   Mean   :0.439          
##  3rd Qu.: 3.56   3rd Qu.:1.000          
##  Max.   :17.42   Max.   :1.000

I see that I don’t have any missing data, so I won’t delete some units with drop-na function.

#install.packages("dplyr") 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- dplyr::mutate(mydata, ID = row_number())

head(mydata)

##   time status gender age year thickness ulcer GenderF ID
## 1   10      3      1  76 1972      6.76     1       M  1
## 2   30      3      1  56 1968      0.65     0       M  2
## 3   35      2      1  41 1977      1.34     0       M  3
## 4   99      3      0  71 1968      2.90     0       F  4
## 5  185      1      1  52 1965     12.08     1       M  5
## 6  204      1      1  28 1971      4.84     1       M  6

I added new variable, named ID.

tail ((mydata), 4)

##     time status gender age year thickness ulcer GenderF  ID
## 202 4668      2      0  40 1965      6.12     0       F 202
## 203 4688      2      0  42 1965      0.48     0       F 203
## 204 4926      2      0  50 1964      2.26     0       F 204
## 205 5565      2      0  41 1962      2.90     0       F 205

Showing only last 4 rows.

mydata <- mydata[c(8, 3, 4, 5, 6, 7, 1, 2)]
head(mydata)

##   GenderF gender age year thickness ulcer time status
## 1       M      1  76 1972      6.76     1   10      3
## 2       M      1  56 1968      0.65     0   30      3
## 3       M      1  41 1977      1.34     0   35      2
## 4       F      0  71 1968      2.90     0   99      3
## 5       M      1  52 1965     12.08     1  185      1
## 6       M      1  28 1971      4.84     1  204      1

I made difference in column order so I couls easily analyse.

mydata2 <- mydata %>% select(-c(1, 2, 4, 5, 6, 7))

mydata2 <- mydata[mydata$age >= 20 & mydata$status == 3, ]

I made new data set. Firstly, I deleted the variables which I won’t need, then I made the set with only age and status 3 (which means the died), so I will easier see the colaration between age of deaths from malenoma. Now I have only 14 obs. so it’s easier to analyse.

#install.packages("pastecs")
library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

summary(mydata[, -c(1, 2, 6, 7, 8)])

##       age             year        thickness    
##  Min.   : 4.00   Min.   :1962   Min.   : 0.10  
##  1st Qu.:42.00   1st Qu.:1968   1st Qu.: 0.97  
##  Median :54.00   Median :1970   Median : 1.94  
##  Mean   :52.46   Mean   :1970   Mean   : 2.92  
##  3rd Qu.:65.00   3rd Qu.:1972   3rd Qu.: 3.56  
##  Max.   :95.00   Max.   :1977   Max.   :17.42

I wanted observe only this 3 specific variables variables.The youngest patient was 4 years old, the oldest 95. Avarage age of patient is 52,46. Because the median was 54, half of the patients were older than that and half was younger. We can see that all the operations happened between year 1962 and 1977. The smallest tumor thickness was 0,10mm and the biggest 17,42mm.

summary(mydata[, c(1)])

##   F   M 
## 126  79

We have some category variables. We can see that there was more women than men having the operation.

round(stat.desc(mydata[ , 3]))

##      nbr.val     nbr.null       nbr.na          min          max        range 
##          205            0            0            4           95           91 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##        10755           54           52            1            2          278 
##      std.dev     coef.var 
##           17            0

Statistic only for age. The 95% CI for the mean of age is 2.

#install.packages("ggplot2")
library(ggplot2)

ggplot(mydata, aes(x = thickness))+
  theme_bw()+
  geom_histogram(binwidth = 1, colour = "hotpink", fill = "lavenderblush")+
  ggtitle("Distribution of tumor thickness")+
  ylab("Frequency")+
  xlab("Thickness of tumor (mm)")

On the histogram we can see that the thickness is positively or right side. Tumor above 8mm is quite rare. The most frequently the tumor is about 2mm.

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:boot':
## 
##     logit

scatterplot(y = mydata$thickness, x = mydata$age,
            ylab = "Tumor thickness (mm)",
            xlab = "Age",
            boxplots = FALSE,
            smooth = FALSE)

We can see a positive correlation between age and thickness, what makes sense because usually the tumor gets the chance grow with age. We can see with the regression line on the scatter plot, the slope is positive.

Task 2

library(readxl)

Business_School <- read_excel("Task 2/Business School.xlsx")
head(Business_School)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

library(ggplot2)

ggplot(Business_School, aes(x = `Undergrad Degree`))+
  geom_bar()+
  labs(title = "University Degree Distribution", x = "Degree", y = "Count")

Business degree is most common.

summary(Business_School$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

library(pastecs)

round(stat.desc(Business_School$`Annual Salary`))

##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0

library(scales)

ggplot(Business_School, aes(x = `Annual Salary`))+
  geom_histogram()+
  scale_x_continuous(labels = comma)+
  labs(title = "Annual Salary Distribution", x = "Annual Salary", y = "Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Graph is right skewed. Mediana of annual salary is 103 500 (that means for half is higher, for half is lower), average salary is 109 058, minimal salary is 20 000, maximum 340 000. Range is 320 000.

library(stats)

t.test(Business_School$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

With this t-test e researched two hypothesis. H0= average MBA grade is equal to 74, Ha = average MBA Grade is not equal to 74. Because we are checking equality, the mean of MBA grade can either be higher or lower than 74 which makes this test a two tailed test. Conducting the t-test we can deduct that at a 5% level of significance (a= 5%) we can reject null hypothesis. The p value for the t-test is 0,009 or 9% which also equals to the percentage of us making the mistake and the null hypothesis being correct. Therefore we can say that the mean MBA grade is not equal to 74, it is higher. The 95% CI also shows us that we can say with 95% certainity that the population mean of MBA Grade of current MBA students lies between 74.52% and 77.56, which does not include 74.

Task 3

Import the dataset Apartments.xlsx

library(readxl)

Apartments <- read_excel("Task 3/Apartments.xlsx")

head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$BalconyF<-factor(Apartments$Balcony,
                    levels=c(1,0),
                    labels=c("Yes","No"))
Apartments$ParkingF<-factor(Apartments$Parking,
                    levels=c(1,0),
                    labels=c("Yes","No"))
head(Apartments)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony BalconyF ParkingF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 Yes      No      
## 2    18        1  2800       1       0 No       Yes     
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 Yes      No      
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 Yes      No

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price,
       mu=1900,
       alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0 : mu = 1900 Ha : mu is not = 1900 We have to check the p-value. Here it is very low, so we have to reject the hypothesis.The 95% CI for the apartment price is between 1937.443 and 2100.440, which means that we can say with 95% certainty that the true population means lies between those 2 values. Also, this interval does not include 1900, so we can again reject the null hypothesis.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

scatterplot(y = Apartments$Price, 
            x = Apartments$Age, 
            ylab = "Price", 
            xlab = "Age", 
            smooth = FALSE)

fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(Apartments$Price,Apartments$Age)

## [1] -0.230255

We can see that the p-value for both coeffcients is again under the 0.05 so we reject the hypothesis at each. The regression coeffient for age tells us, if age increases by 1. the price will on average decrease by 8.975, if averything else stay equal. Coefficient of determination is 0.05302, so the multiple r squared statistics tells us that age explains only around 5% of variability of price. The square root of R-squared is the corellation coefficient is -0.23, that means there is a linear relationship between price and age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(Apartments[ , c(-4,-5,-6, -7)], 
                  smooth = FALSE)

data2 <- lm(Price ~ Age + Distance, 
           data = Apartments)
vif(data2)

##      Age Distance 
## 1.001845 1.001845

When analysing scatter plot we can deduct that there is a potentinal problem with multycolinearity, by looking at the regression line on the scatter plot between the explanatory variables, in this case Age and Distance. In our case there is not problem, because we can see that the slope of the line is close to 0 therefore meaning that there is no significant correlation between the two variables.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = Apartments) 

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF statistic checks for multicolinearity between the explanatory variables. In our case the explanatury variables are both less than 5 and the mean is close to 1 which tells us there is no problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) 
Apartments$CooksD <- round(cooks.distance(fit2), 3) 

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Apartments$StdResid <- round(rstandard(fit2), 3) 
head(Apartments[order(-Apartments$StdResid), ])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2     2       11  2790       1       0 No       Yes          2.05  0.069
## 3    18        1  2800       1       0 No       Yes          1.78  0.03 
## 4    18        1  2800       1       1 Yes      Yes          1.78  0.03 
## 5     8        2  2820       1       0 No       Yes          1.66  0.037
## 6    10        1  2810       0       0 No       No           1.60  0.032

Apartments$CooksD <- round(cooks.distance(fit2), 3) 
head(Apartments[order(-Apartments$CooksD), ])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2    43       37  1740       0       0 No       No           1.44  0.104
## 3     2       11  2790       1       0 No       Yes          2.05  0.069
## 4     7        2  1760       0       1 Yes      No          -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 6    40        2  2400       0       1 Yes      No           1.09  0.038

tail(Apartments[order(-Apartments$StdResid), ])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    24        5  1830       1       0 No       Yes         -1.19  0.012
## 2    14       16  1660       0       1 Yes      No          -1.26  0.008
## 3    13        8  1800       0       0 No       No          -1.38  0.012
## 4    12       14  1650       0       1 Yes      No          -1.50  0.013
## 5    12       14  1650       0       0 No       No          -1.50  0.013
## 6     7        2  1760       0       1 Yes      No          -2.15  0.066

hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

If we look at the standardized residuals we can see that none are above 3 or below -3, which would automatically categorize them as outlines.However if we look at the cooks distance we can see that one observation has a significantly higher cooks distance than all others, so we can decide to remove it from data set.

Apartments <- Apartments[-38, ]
head(Apartments[order(-Apartments$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    43       37  1740       0       0 No       No           1.44  0.104
## 2     2       11  2790       1       0 No       Yes          2.05  0.069
## 3     7        2  1760       0       1 Yes      No          -2.15  0.066
## 4    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 5    40        2  2400       0       1 Yes      No           1.09  0.038
## 6     8        2  2820       1       0 No       Yes          1.66  0.037

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = fit2$StdResid, x = fit2$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Heteroskedasticity occurs when the variance is not constant. On this scatter plot we can see that dots are relatively constant, so there is no heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments$StdResid, 
     prob = TRUE,
     xlab = "Standardized residuals", 
     ylab = "Density", 
     main = "Histogram of standardized residuals")
curve(dnorm(x, mean = mean(Apartments$StdResid), sd = sd(Apartments$StdResid)), 
      add = TRUE,
      lwd = 2,
      col = "red")

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.94879, p-value = 0.002187

Because of the shape of the distribution, we see that is not normal.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, 
           data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Both age and distance have a negative correlation with a price of apartment. One year older apartment is on average 6,4 units cheaper than one year younger. And apartment with one additional unit of distance is on average 22,9 units cheaper than the one closer. R squared is 0,48 which means 48% of variability of price is explained by age and distance. F-statistic shows that this model is good and explains at least some of the variability.

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that at a 5% level of significance we can conclude that the fit3 regression fits data better than fit2. This is because if the p value is lower than 0.05.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2482.048     79.263  31.314  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFNo  -167.531     62.864  -2.665  0.00933 ** 
## BalconyFNo    15.207     59.201   0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

On average an apartment with parking is 167.5 units more expensive. Apartment having balcony is statistically irrelevant as p-value is greater than 0.05. F-statistic shows that we can reject null hypotheses so this model is good and explains at least some of the variability.

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3) 
Apartments$Residuals <- residuals(fit3)

head(Apartments[ ,colnames(Apartments)%in% c("Price", "Fitted", "Residuals")])

## # A tibble: 6 × 3
##   Price Fitted Residuals
##   <dbl>  <dbl>     <dbl>
## 1  1640  1706.     -66.0
## 2  2800  2372.     428. 
## 3  1660  1721.     -61.2
## 4  1850  1563.     287. 
## 5  1640  2012.    -372. 
## 6  1770  1908.    -138.

The residual for apartment ID2 is equal to 427.8.