Jan Verovšek

R take-home exam

TASK 1

library(readr) #Load library
airstat <- read_csv("R data/airline_passenger_satisfaction.csv") #import file
## Rows: 129880 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Gender, Customer Type, Type of Travel, Class, Satisfaction
## dbl (19): ID, Age, Flight Distance, Departure Delay, Arrival Delay, Departur...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(airstat, 5) #show first 5 rows
## # A tibble: 5 × 24
##      ID Gender   Age `Customer Type` `Type of Travel` Class    `Flight Distance`
##   <dbl> <chr>  <dbl> <chr>           <chr>            <chr>                <dbl>
## 1     1 Male      48 First-time      Business         Business               821
## 2     2 Female    35 Returning       Business         Business               821
## 3     3 Male      41 Returning       Business         Business               853
## 4     4 Male      50 Returning       Business         Business              1905
## 5     5 Female    49 Returning       Business         Business              3470
## # ℹ 17 more variables: `Departure Delay` <dbl>, `Arrival Delay` <dbl>,
## #   `Departure and Arrival Time Convenience` <dbl>,
## #   `Ease of Online Booking` <dbl>, `Check-in Service` <dbl>,
## #   `Online Boarding` <dbl>, `Gate Location` <dbl>, `On-board Service` <dbl>,
## #   `Seat Comfort` <dbl>, `Leg Room Service` <dbl>, Cleanliness <dbl>,
## #   `Food and Drink` <dbl>, `In-flight Service` <dbl>,
## #   `In-flight Wifi Service` <dbl>, `In-flight Entertainment` <dbl>, …
airstat2 <- airstat[-c(31:129880), -c(4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23)] #delete rows and collums

head(airstat2,30) #show the first 30 rows of a narrowed down dataset
## # A tibble: 30 × 9
##       ID Gender   Age `Type of Travel` `Flight Distance` `Departure Delay`
##    <dbl> <chr>  <dbl> <chr>                        <dbl>             <dbl>
##  1     1 Male      48 Business                       821                 2
##  2     2 Female    35 Business                       821                26
##  3     3 Male      41 Business                       853                 0
##  4     4 Male      50 Business                      1905                 0
##  5     5 Female    49 Business                      3470                 0
##  6     6 Male      43 Business                      3788                 0
##  7     7 Male      43 Business                      1963                 0
##  8     8 Female    60 Business                       853                 0
##  9     9 Male      50 Business                      2607                 0
## 10    10 Female    38 Business                      2822                13
## # ℹ 20 more rows
## # ℹ 3 more variables: `Arrival Delay` <dbl>, `Food and Drink` <dbl>,
## #   Satisfaction <chr>

Description

  • ID: Unique passanger identifier
  • Gender: Gender of passanger (Female/Male)
  • Age: Age of the passanger (Years)
  • Type of Travel: Purpose of the flight (First-time/Returning)
  • Flight distance: Flight distance in miles
  • Departure Delay: Flight departure delay in minutes
  • Arrival Delay: Flight arrival delay in minutes
  • Food and drink: Satisfaction level with the food and drinks on the airplane from 1 (lowest) to 5 (highest)
  • Satisfaction: Overall satisfaction level with the airline (Satisfied/Neutral or unsatisfied)
airstat2$"Meal Satisfaction" <- cut(airstat2$`Food and Drink`, 
                                   breaks = c(-Inf,2,5, Inf),
                                   labels = c("Unsatisfied", "OK", "Satisfied"),
                                   right = FALSE) #Create 3 order classes and create Meal Satisfaction variable from Food and Drink variable


airstat2 <- airstat2[,-8] #Delete Food and Drink variable

Description

I transformed the Food and Drink variable into a categorical variable called Meal Satisfaction with three ordered classes: Unsatisfied (scores 1–2), OK (scores 3–4), and Satisfied (score 5).

names(airstat2)[8] <- "Overall Satisfaction" #Change Satisfaction into Overall Satisfaction
library(pastecs) #Load library

round(stat.desc(airstat2$`Flight Distance`), 2) #Provide descriptive statistics rounded to 2 decimal points
##      nbr.val     nbr.null       nbr.na          min          max        range 
##        30.00         0.00         0.00       421.00      3788.00      3367.00 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     47782.00       853.00      1592.73       201.29       411.68   1215498.55 
##      std.dev     coef.var 
##      1102.50         0.69
median(airstat$Age) #Provide median for Age in original dataset
## [1] 40
median(airstat2$Age) #Provide median for Age in adjusted dataset
## [1] 48
min(airstat2$Age) #Minimum value of miles traveled
## [1] 9

Description

Because I wanted to get as many descriptive statistics possible to help me understand Flight Distance variable I used stat.desc. The values are interpreted as (left to right in upper table):

  • Number of non-missing values: 30
  • Smallest flight distance observed: 421 miles
  • Largest flight distance observed: 3788 miles
  • Sum of all 30 flights taken: 47782 miles
  • Middle value of distances: 853 miles
  • Arithmetic average of all flight distances: 1592.73 miles
  • Standard deviation from the average flight distance: 1102,5 miles

I calculated median for Age from the original and adjusted data. I can observe that the median age varies by 8 years due to the change of sample size.

Lastly I calculated the minimal Age of a traveler (9 years).

boxplot(airstat2$Age,
        main = "Age distribution in dataset", # Title of the plot
         ylab = "Age",                    # Label for y-axis
        col = "gray",                      # Color of the box
        border = "black")                  # Border color

Explanation

The generated boxplot represents the distribution of age of airline passengers from the second dataset. The box shows the middle 50% of the data (with lower and upper edges being 1st and 3rd quartiles). The Whiskers (lines) extend to datapoints within 1,5X interquartile range (points outside this range are outliers - total of 4). The distribution is somewhat balanced, with some outliers.

library(ggplot2) #Load library

hist(airstat2$`Flight Distance`, #Histogram from 2nd dataset for Flight distance
     ylab = "Number of passangers", #y-axis label
     xlab = "Distance traveled (miles)", #X-axis label
     main = "Flight distance", #Histogram title
     breaks = seq(from = 300, to = 3900, by = 300)) #Categories from 300 to 3900 with 300 width

hist(airstat$`Flight Distance`, #Histogram from original dataset for Flight distance
     ylab = "Number of passangers",
     xlab = "Distance traveled (miles)",
     main = "Flight distance - large dataset",
     breaks = seq(from = 0, to =5100, by = 300), #Categories from 300 to 5100 with 300
     col = "green", #Fill color of the bars
     border = "red") #Outline colors of the bars

Interpretation of histograms

  • Flight distance: The majority of flight distances from the conveniently chosen smaller dataset (airstat2) are between 500 and 1000 miles. After that, the bars decrease in frequency as the distance increases.

  • Flight distance - large dataset: The majority of flights from the original dataset (airstat) are in the 0 to 1000 miles range, with a gradual decrease in frequency as the distance increases.

By comparing the two histograms from the two datasets (original and conveniently chosen smaller dataset), I aimed to determine how the chosen data differs from the original dataset. We can see that the original dataset shows a much wider range of distances and greater variance in flight distances (and passenger numbers).

It is clear that the histogram generated from the original dataset is positively skewed (to the right), which cannot be said with as much confidence for the histogram generated from the choosen data.

library(ggplot2) #Load library

ggplot(airstat2, aes(y = `Arrival Delay`, x = `Departure Delay`)) + #Define x and y axis 
  geom_point() + #Add scatter points
  geom_smooth(method = "lm", se = FALSE, color = "blue") #Linear regression line without confidence intervals in blue colour 
## `geom_smooth()` using formula = 'y ~ x'

Interpretation

The plot shows a positive relation between Departure Delay and Arrival delay, indicating that an increase in Departure Delay leads to and increase in Arrival delay (more clearly shown by the blue line). Most of the points cluster close to the line, although there are some outliers.

TASK 2

library(readxl) #Load library
MBA <- read_excel("C:/Users/janve/Downloads/R Take Home Exam 2025/R Take Home Exam 2025/Task 2/Business School.xlsx") #Import data

head(MBA, 5) #Display first five rows of the data set
## # A tibble: 5 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

Description (in consecutive order)

  • Student ID: Numerical identifier for each student in the dataset.
  • Undergrad Degree: Field of study (or degree) of the students in their undergraduate education.
  • Undergrad Degree: Students grade during undergraduate studies (numerical).
  • MBA Grade: Students grade during MBA studies (numerical).
  • Work Experience: Whether student had work experience before attending MBA (Yes/No).
  • Employability (Before): Score representing students employability before attending MBA.
  • Employability (After): Score representing students employability after completing MBA programme.
  • Status: Current employment status of student (Placed - found employment).
  • Annual Salary: Students annual salary after completing MBA programme (in €).
library(ggplot2) #Load library

ggplot(MBA, aes(x = `Undergrad Degree`)) + #Initialize plot with Undergrad Degree on x
  geom_bar(fill="orange") + #Orange bars in bar chart
  labs(title = "Distribution of Undergraduate Degrees", #Title and axis lables
       x = "Undergraduate Degree",
       y = "Number of Students")+
  theme_minimal() #Minimal theme

Description

The most common degree prior to attending the MBA is a Business degree, followed by (in descending order) Computer Science and Finance degrees (with the same number of students), then Engineering, and lastly, with the fewest students, Art.

library(pastecs)

round(stat.desc(MBA$`Annual Salary`),)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(MBA, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 20000, colour = "pink", fill = "green", alpha = 0.5) +
  labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Count") +
  scale_x_continuous(labels = scales::comma_format())

Descriptive statistics explanation

  • sum: Total sum of annual salaries is 10958500€.
  • median: Half of individuals earn below 103500€ and half above.
  • mean: Average annual salary is 109058.
  • range: Range between minimum and maximum salary is 320000€.
  • std.dev: Standard deviation is 41501 which indicates the distribution of salaries is spread out.

Histogram explanation

The histogram is positively skewed (to the right) with a higher concentration of salaries around 100000€. There are outliers pulling the mean to the right.

library(effectsize)

mean(MBA$`MBA Grade`) #MBA grade mean
## [1] 76.04055
sd(MBA$`MBA Grade`) #Standard deviation of MBA grade
## [1] 7.675114
t.test(MBA$`MBA Grade`, mu=74) #One sample t-test (Ho: mean MBA grade = 74)
## 
##  One Sample t-test
## 
## data:  MBA$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055
cohens_d(MBA$`MBA Grade`, mu=74) #Effect size through Cohen's d
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

Explanation

I ran the the t-test with a hypothesis that the average MBA grade for this year students is equal to 74 (same as average grade of previous year).

Null hypothesis (Ho): Average MBA grade is 74. Alternative hypothesis (H1): Average MBA grade differs from 74.

The t-test results are:

  • t-statistic: 2,66 - average grade of this year is (76,04) is 2,66 standard errors away from previous year average of 74.

  • p-value: 0.00915 (p<0,05) - we can reject the null hypothesis as the difference between the average grade this year and 74 is statistically significant - we can confidently say the average grade is different from 74.

  • 95 percent confidence interval: 74.51764 77.56346 - range where we believe the average MBA grade lies. 74 is not in range - we can confidently say the average grade differs from 74.

  • sample estimates: mean of x: 76.04055 - average MBA grade this year was 76,04.

I also checked the effect size with Cohen’s d which shows whether the difference is practically significant and not just statistically. With Cohen’s d = 0.27, the effect is small, indicating a statistically significant but only slight change in the average grade.

TASK3

Import the dataset Apartments.xlsx

library(readxl) #Load library
Apartments <- read_excel("R Take Home Exam 2025/Task 3/Apartments.xlsx")

head(Apartments, 5) #First 5 rows of table
## # A tibble: 5 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking, #Create factor
                             levels = c(0,1),
                             labels = c("NO", "YES"))

Apartments$Balcony <- factor(Apartments$Balcony,#Create factor
                             levels = c(0,1),
                             labels = c("NO", "YES"))

head(Apartments, 5) #First 5 rows of table
## # A tibble: 5 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 NO      YES    
## 2    18        1  2800 YES     NO     
## 3     7       28  1660 NO      NO     
## 4    28       29  1850 NO      YES    
## 5    18       18  1640 YES     YES

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, mu=1900)
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Explanation

The p value (0,004731) is significantly lower than 0,05 (we can reject the null hypothesis) and the confidence interval [1937,443 , 2100,440] does not include 1900. These two data mutually support the conclusion that the average price is not 1900€. The calculated sample mean of price is 2018,941.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments) #Simple regression function to fit1

summary(fit1) 
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation

Regression coefficient: The slope of -8,975 indicates that on average, with each additional unit of Age the price of a m2 of an apartment decreases by 8,975€

Coefficient of correlation: When we calculate the coefficient by rooting Multiple R-squared value (r is approximately 0,23) we can see a weak correlation between age and price (there is a weak linear relationship).

Coefficient of determination: 5,32% of the variance in Price is explained by Age - only a small proportion of variation in apartment prices is explained with Age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

pairs(Apartments[, c("Price", "Age", "Distance")], #Scatterplot matrix for 3 variables
      main = "Scatterplot Matrix - Price, Age, Distance",
      pch = 20)

Explanation: There is no obvious problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments) #Multiple regression function to fit2

Chech the multicolinearity with VIF statistics. Explain the findings.

#install.packages("car") #install package (disabled after installation)
library(car) 
## Loading required package: carData
vif(fit2) #Multicolinearity check with VIF
##      Age Distance 
## 1.001845 1.001845

Explination: VIF values for Age and Distance are very close to 1 which indicates there is no significant issue with multicolinearity in this regression model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

stand.res.fit2 <- rstandard(fit2) #Calculate standardized residuals and show in Global Environment
cook.dis.fit2 <- cooks.distance(fit2) #Calculate cooks distances and show in Global Environment

problematic <- which(abs(stand.res.fit2) > 2.5 | cook.dis.fit2 > 1) #Identify either outliers (standardized residual value > 2,5) either highly influential units (Cook distance value > 1)

print(problematic) #Print the units that could potentially be problematic
## 38 
## 38

Explanation: I calculated the standardized residuals and Cooks distances and flagged those apartments that have values either higher than 2,5 (Standardized residuals) or higher than 1 (Cooks distance). There is only one - the 38th row in Apartments.

Apartments.1 <- Apartments[-problematic, ] #Delete 38th apartment (saved in value problematic) and save to fit2
fit2.1 <- lm(Price ~ Age + Distance, data = Apartments.1) #new model saved in fit2

library(car) 

vif(fit2.1) #new VIF multicolinearity check
##      Age Distance 
## 1.008869 1.008869

Explanation: I deleted the 38th row (saved in problematic) and created a new model saved in fit2 which I checked for multicolinearity (still no).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

stand.res.fit2.1 <- rstandard(fit2.1) #Standardised residuals withouth 38th apartment
stand.fit2.1 <- scale(fitted(fit2.1)) #Standardised fitted values withouth 38th apartment

plot(stand.fit2.1, stand.res.fit2.1, #Create scatterplot
     xlab = "Standardized fitted values",
     ylab = "Standardized residuals",
     main = "Residuals vs Fitted values (heteroskedasticity check)",
     pch = 20, col = "blue")
abline(h = 0, col = "red", lwd = 2)

Explanation: The scatterplott shows that the points are spread fairly evenly around zero without a clear pattern. There is no funnel shape which suggests that the spread of errors is roughly consistent. This means the variance of errors is approximately constant with no clear signs of heteroskedasticity in the model.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(stand.res.fit2.1, #Histogram of standardized residuals from fit2.1
     breaks = 10,
     main = "Histogram of standardized residuals",
     xlab = "Standardized residuals",
     col = "navy")

qqnorm(stand.res.fit2.1, # Q-Q plot of standardized residuals from fit2.1
       main = "Q-Q Plot of standardized residuals")
qqline(stand.res.fit2.1, col = "red", lwd = 2)

shapiro.test(stand.res.fit2.1) #Formal test with Shapiro-Wilk test
## 
##  Shapiro-Wilk normality test
## 
## data:  stand.res.fit2.1
## W = 0.9565, p-value = 0.00636

Explanation

Histogram: Most residuals cluster around 0, the histogram is not symmetric / is negatively skewed.

Q-Q Plot: The points are close to the middle line, but deviate at both ends. The distribution has heavier tails than a normal distribution.

Shapiro-Wilk Test: p is lower than 0,05 so we can reject Ho (Ho - residuals are normally distributed). This confirms the residuals deviate significantly from normality.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, data = Apartments) #Estimation of original fit2
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation

  • Intercept: Estimated average appartment price per m2 when Age and Distance is 0 (2460,101€).
  • Age: On average for each additional year of appartment age the price decreases by 7,934€.
  • Distance: On average, for each additional km of distance from city center the price decreases by 20,67€ per m2.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments) #Fit3 linear regression

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3) #Compare if fit3 fits better to the data than fit2 using anova f-test
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation: The p-value is less than 0,05 which means we can reject Ho (Ho: Adding additional variables (Parking, Balcony) does not have an affect on the model). This means that Fit3 fits the data better than fit2 (adding Parking and Balcony improves the model).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3) #fit3 summary
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYES   196.168     62.868   3.120  0.00251 ** 
## BalconyYES     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Regression coefficient explanation:

  • Parking: Apartments with Parking are priced on average 196,168€ per m2 higher than apartments without. The effect is statistically significant (p=0,0025)

  • Balcony: Apartments with a balcony are priced 1,953€ higher per m2 than apartments without. The p-value of 0,97436 (p>0,05) indicates that in this model having a balcony does not ave a significant effect on price of the apartment.

Hypothesis: Ho: Age, Distance, Parking and Balcony together have no effect on apartment price. H1: At least one of Age, Distance, Parking and Balcony has an effect on apartment price.

Save fitted values and claculate the residual for apartment ID2.

fit3.fitt <- fitted(fit3) #Fitted values for fit3
residuals <- resid(fit3) #Residuals for fit3

fit3.fitt[2] #Fitted value for ID2
##        2 
## 2357.411
residuals[2] #Residual value for ID2
##        2 
## 442.5889

Explanation: The model predicts that apartment 2 should cost €2357 per m². In reality, the apartment costs €2800 per m², which is €443 higher than predicted. This means the model underestimates the price of this apartment.