R take-home exam

Task 1 - Student Performance

Task 1.1: Explain the data set

mydata1 <- read.csv("student_performance.csv",
         header = TRUE, )

The dataset contains 6 variables for a sample of students. Most variables are numeric and continuous, such as self-study hours, attendance, and scores. The grade variable is categorical. This structure allows us to perform both descriptive and graphical analysis, as well as to construct derived variables and filtered datasets.

set.seed(2025)

mydata1 <- mydata1[sample(nrow(mydata1), min(50, nrow(mydata1))), ,drop = FALSE]

# Showing the first 6 rows of the dataset

head (mydata1) #With this function, we take a look at the first rows of our data so that we can check if we imported the right data.

##        student_id weekly_self_study_hours attendance_percentage
## 817612     817612                    10.4                  93.5
## 229274     229274                     7.2                  74.9
## 33047       33047                    23.6                  82.1
## 860347     860347                     6.8                 100.0
## 627149     627149                    12.7                  77.2
## 62412       62412                    14.3                  82.3
##        class_participation total_score grade
## 817612                 7.2        82.6     B
## 229274                 8.9        67.7     C
## 33047                  6.4       100.0     A
## 860347                 7.7        50.6     D
## 627149                 9.5        77.7     B
## 62412                  7.2        63.1     C

Comment: The data checks, so we continue.

# Descriptive statistics for all variables
summary(mydata1)

##    student_id     weekly_self_study_hours attendance_percentage
##  Min.   : 13052   Min.   : 0.00           Min.   : 66.30       
##  1st Qu.:231991   1st Qu.:10.40           1st Qu.: 77.83       
##  Median :479707   Median :16.35           Median : 83.10       
##  Mean   :447598   Mean   :15.39           Mean   : 84.27       
##  3rd Qu.:639711   3rd Qu.:20.02           3rd Qu.: 90.08       
##  Max.   :977693   Max.   :33.60           Max.   :100.00       
##  class_participation  total_score        grade          
##  Min.   : 1.400      Min.   : 40.60   Length:50         
##  1st Qu.: 5.225      1st Qu.: 76.12   Class :character  
##  Median : 6.300      Median : 88.25   Mode  :character  
##  Mean   : 6.358      Mean   : 84.03                     
##  3rd Qu.: 7.575      3rd Qu.: 99.67                     
##  Max.   :10.000      Max.   :100.00

Description of the variables (student_performance.csv):

student_id: Identification number for each student (numeric, used only for identification).
weekly_self_study_hours: Number of hours a student spends on self-study per week (numeric variable, measured in hours).
attendance_percentage: Attendance rate of the student (numeric variable, expressed in %).
class_participation: Level of class participation (numeric variable, scale depends on scoring system).
total_score: Total performance score across all activities/exams (numeric variable).
grade: Final grade assigned to the student (categorical variable, e.g., A, B, C, D, F).

Task 1.2: Data manipulation

#Renaming the variables in the dataset
colnames(mydata1) <- c("Student ID",
                       "Weekly Self Study Hours",
                       "Attendance Percentage",
                       "Class Participation",
                       "Total Score",
                       "Grade")

#Checking if the variables were renamed correctly
head(mydata1) #The column names now include spaces

##        Student ID Weekly Self Study Hours Attendance Percentage
## 817612     817612                    10.4                  93.5
## 229274     229274                     7.2                  74.9
## 33047       33047                    23.6                  82.1
## 860347     860347                     6.8                 100.0
## 627149     627149                    12.7                  77.2
## 62412       62412                    14.3                  82.3
##        Class Participation Total Score Grade
## 817612                 7.2        82.6     B
## 229274                 8.9        67.7     C
## 33047                  6.4       100.0     A
## 860347                 7.7        50.6     D
## 627149                 9.5        77.7     B
## 62412                  7.2        63.1     C

# We create a new variable that reflects how many total points a student earns per hour of self-study. Study Efficiency = Total Score per hour of self-study

mydata1$Study_Efficiency <- mydata1$`Total Score` / mydata1$`Weekly Self Study Hours`

#Quick check of the new variable
round(summary(mydata1$Study_Efficiency), 2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.98    4.52    5.44     Inf    7.51     Inf

If any student has 0 weekly hours, the division will produce Inf. We’ll handle that in the next step when we clean data.

Before analysis, we clean the data to ensure validity. Units with missing values or logical inconsistencies (e.g., 0 self-study hours) are excluded. We also compute a new variable Study Efficiency, which represents how many points a student earns per hour of self-study. This will help us later analyze performance efficiency.

#Cleaning: remove units with any missing values and impossible Attendance %
#We also remove rows with 0 Weekly hours to avoid Inf in Study Efficiency
library(dplyr)
library(tidyr)

# We clean the dataset by removing missing values and extreme/illogical values (e.g. attendance > 100%, 0 self-study).

mydata1 <- mydata1 %>%
  drop_na() %>%                                  #removing rows with any NA values
  filter(`Attendance Percentage` >= 0,
         `Attendance Percentage` <= 100,
         `Weekly Self Study Hours` > 0)          #avoid division by zero

#Check the dimensions after cleaning
nrow(mydata1); head(mydata1)

## [1] 49

##   Student ID Weekly Self Study Hours Attendance Percentage Class Participation
## 1     817612                    10.4                  93.5                 7.2
## 2     229274                     7.2                  74.9                 8.9
## 3      33047                    23.6                  82.1                 6.4
## 4     860347                     6.8                 100.0                 7.7
## 5     627149                    12.7                  77.2                 9.5
## 6      62412                    14.3                  82.3                 7.2
##   Total Score Grade Study_Efficiency
## 1        82.6     B         7.942308
## 2        67.7     C         9.402778
## 3       100.0     A         4.237288
## 4        50.6     D         7.441176
## 5        77.7     B         6.118110
## 6        63.1     C         4.412587

#Creating a new data.frame: students with strong engagement
#Criteria: Attendance >= 80% and Class Participation >= its sample median
med_part <- median(mydata1$`Class Participation`)

mydata_high <- mydata1 %>%
  filter(`Attendance Percentage` >= 80,
         `Class Participation` >= med_part)

#Show first rows and how many students qualify
head(mydata_high)

##   Student ID Weekly Self Study Hours Attendance Percentage Class Participation
## 1     817612                    10.4                  93.5                 7.2
## 2      33047                    23.6                  82.1                 6.4
## 3     860347                     6.8                 100.0                 7.7
## 4      62412                    14.3                  82.3                 7.2
## 5     286103                    15.6                  87.6                 8.1
## 6     309024                    24.8                  86.7                 6.7
##   Total Score Grade Study_Efficiency
## 1        82.6     B         7.942308
## 2       100.0     A         4.237288
## 3        50.6     D         7.441176
## 4        63.1     C         4.412587
## 5        86.1     A         5.519231
## 6       100.0     A         4.032258

nrow(mydata_high)

## [1] 20

Task 1.3: Descriptive statistics

The descriptive statistics help us understand the central tendency and variability in student performance. The mean represents the average score, the median is the midpoint of the distribution, and the standard deviation quantifies how much individual scores deviate from the mean. These statistics are useful for comparing subgroups or analyzing performance trends.

#Select variables to summarize
vars_sel <- c("Weekly Self Study Hours",
              "Attendance Percentage",
              "Class Participation",
              "Total Score")

#1) Compact overview (min, 1st Qu., median, mean, 3rd Qu., max)
summary(mydata1[ , vars_sel])  #overview of key descriptive statistics

##  Weekly Self Study Hours Attendance Percentage Class Participation
##  Min.   : 1.7            Min.   : 66.3         Min.   : 1.40      
##  1st Qu.:10.4            1st Qu.: 77.9         1st Qu.: 5.20      
##  Median :16.5            Median : 83.2         Median : 6.40      
##  Mean   :15.7            Mean   : 84.5         Mean   : 6.38      
##  3rd Qu.:20.1            3rd Qu.: 90.1         3rd Qu.: 7.60      
##  Max.   :33.6            Max.   :100.0         Max.   :10.00      
##   Total Score    
##  Min.   : 40.60  
##  1st Qu.: 76.80  
##  Median : 88.60  
##  Mean   : 84.33  
##  3rd Qu.:100.00  
##  Max.   :100.00

#2) Three key stats for one focal variable (Total Score)
mean(mydata1$`Total Score`, na.rm = TRUE)    #average (mean)

## [1] 84.32857

median(mydata1$`Total Score`, na.rm = TRUE)  #middle value (median)

## [1] 88.6

sd(mydata1$`Total Score`, na.rm = TRUE)      #typical spread (sd)

## [1] 15.57392

We summarize the dataset with summary(). For Total Score, the mean is the average score, the median is the middle score when ordered, and the standard deviation shows typical variation around the mean. We use these three statistics to describe central tendency and variability.

Task 1.4: Graphs

library(ggplot2)  #for ggplot graphics

# Create a filtered version of the data just for plotting (to avoid warnings)
plot_data <- mydata1 %>%
  filter(!is.na(`Total Score`) & `Total Score` >= 50)


## 1) Histogram: distribution of Total Score
ggplot(mydata1, aes(x = `Total Score`)) +
  geom_histogram(binwidth = 5, colour = "black") +  #bars every 5 points
  scale_x_continuous(limits = c(50, max(mydata1$`Total Score`, na.rm = TRUE))) +
  xlab("Total Score (points)") +
  ylab("Frequency") +
  ggtitle("Distribution of Total Score")

Note: To improve the histogram, we filtered out students with missing or extremely low scores (below 50). This helps us focus on the meaningful part of the score distribution.

The histogram shows the distribution of total scores among students. Most students have scores clustered around the mean, with some outliers on the higher end.

The histogram of total scores reveals a left-skewed (negatively skewed) distribution. Most students achieved relatively high scores, with the peak frequency occurring between 85 and 90 points.

#2. Scatterplot with linear regression line
ggplot(plot_data, aes(x = `Weekly Self Study Hours`, y = `Total Score`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  #adds linear trend line (no CI band)
  xlab("Weekly Self Study Hours (hours)") +
  ylab("Total Score (points)") +
  ggtitle("Study Hours and Total Score")

The scatterplot reveals a positive association between weekly self-study hours and total score, suggesting that more study time generally leads to better performance.

# 3) Boxplot: Total Score by Grade (if Grade is categorical)
if(!is.factor(mydata1$Grade)) mydata1$Grade <- as.factor(mydata1$Grade)

ggplot(plot_data, aes(x = Grade, y = `Total Score`)) +
  geom_boxplot() +
  xlab("Grade") +
  ylab("Total Score (points)") +
  ggtitle("Total Score by Grade")

The boxplot compares scores across grade categories.

We observe that students with grade A tend to have the highest median scores, while lower grades show wider score variability and more outliers.

Task 2 - 100 MBA Students

# Load the dataset

library(readxl)

mba_data <- read_excel("Business School.xlsx", sheet = "Sheet1")

# Preview the dataset
head(mba_data)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

Comment: The data checks, so we continue.

Task 2.1: Distribution of undergrad degrees

library(ggplot2)

ggplot(mba_data, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "steelblue", color = "black") +
  xlab("Undergrad Degree") +
  ylab("Number of Students") +
  ggtitle("Distribution of Undergraduate Degrees") +
  theme_minimal()

Interpretation: The bar chart displays the distribution of undergraduate degrees among 100 MBA students. The most common undergraduate degree is Business, with over 30 students, followed by Finance and Computer Science, each with around 24–25 students.

Degrees such as Engineering and Art are less represented, with fewer than 10 students each. This indicates a strong dominance of business-related academic backgrounds among the current MBA cohort, which may reflect the program’s industry orientation or admission preferences.

Task 2.2: Descriptive statistics and histogram of annual salary

# Basic summary statistics
summary(mba_data$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

# Additional useful measures
mean(mba_data$`Annual Salary`, na.rm = TRUE)

## [1] 109058

median(mba_data$`Annual Salary`, na.rm = TRUE)

## [1] 103500

sd(mba_data$`Annual Salary`, na.rm = TRUE)

## [1] 41501.49

The annual salary of MBA students ranges widely, from $20,000 to $340,000, indicating a highly diverse income distribution.

The mean salary is $109,058, while the median is $103,500. Since the mean is higher than the median, the distribution is right-skewed — this suggests that a few individuals with very high salaries are pulling the average upward.

The standard deviation is approximately $41,501, indicating a large variability in salaries. This spread reinforces the observation that the group includes both low and high earners.

Overall, while many students earn between $87,000 and $124,000 (the interquartile range), the presence of outliers above $200,000 significantly influences the distribution.

Task 2.3: One-sample t-test for MBA grade

# One-sample t-test: is the mean MBA grade different from 74?
t.test(mba_data$`MBA Grade`, mu = 74)

## 
##  One Sample t-test
## 
## data:  mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

# Calculate Cohen's d
mean_grade <- mean(mba_data$`MBA Grade`, na.rm = TRUE)
sd_grade <- sd(mba_data$`MBA Grade`, na.rm = TRUE)

cohen_d <- (mean_grade - 74) / sd_grade
cohen_d

## [1] 0.2658658

A one-sample t-test was conducted to assess whether the average MBA grade differs from 74.The test result was statistically significant (p = 0.009), so we reject the null hypothesis. The average MBA grade is significantly higher than 74, with a sample mean of 76.04. The effect size is Cohen’s d = 0.27, which represents a small practical difference.

Task 3 - Linear regression analysis

Task 3.1: Import the dataset Apartments

library(readxl)
apartments <- read_excel("Apartments.xlsx")

# Show the first 10 rows of the dataset
head(apartments, 10)  # Showing first 10 units

## # A tibble: 10 × 5
##      Age Distance Price Parking Balcony
##    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
##  1     7       28  1640       0       1
##  2    18        1  2800       1       0
##  3     7       28  1660       0       0
##  4    28       29  1850       0       1
##  5    18       18  1640       1       1
##  6    28       12  1770       0       1
##  7    14       20  1850       0       1
##  8    18        6  1970       1       1
##  9    22        7  2270       1       0
## 10    25        2  2570       1       0

Task 3.2: Change categorical variables into factors

apartments$Parking <- as.factor(apartments$Parking)
apartments$Balcony <- as.factor(apartments$Balcony)

# Show the first 10 rows to confirm the change
knitr::kable(head(apartments, 10), caption = "First 10 rows after converting Parking and Balcony to factors")

First 10 rows after converting Parking and Balcony to factors
Age	Distance	Price	Parking	Balcony
7	28	1640	0	1
18	1	2800	1	0
7	28	1660	0	0
28	29	1850	0	1
18	18	1640	1	1
28	12	1770	0	1
14	20	1850	0	1
18	6	1970	1	1
22	7	2270	1	0
25	2	2570	1	0

Task 3.3: Test the hypothesis H0: Mu_Price = 1900 eur

H₀: The true average apartment price is €1900 H₁: The true average apartment price is not €1900

t.test(apartments$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The one-sample t-test indicates that the average apartment price is statistically significantly different from €1900 (p = 0.0047). Since the p-value is less than 0.05, we reject the null hypothesis. The 95% confidence interval for the mean price is between €1937.44 and €2100.44, suggesting that the true mean is significantly higher than €1900.

Task 3.4: Estimate simple regression function Price ~ Age

fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

We estimated a simple linear regression to evaluate the effect of apartment age on price. The results show that with each additional year of age, the apartment price decreases by approx. €8.98 per m². The effect is statistically significant (p = 0.034), indicating a real negative relationship.

cor(apartments$Price, apartments$Age)

## [1] -0.230255

However, the model only explains about 5.3% of price variation, and the correlation between age and price is weak (r = -0.23). This suggests that while Age matters, other factors (e.g., distance, parking, balcony) likely play a bigger role in determining price.

Task 3.5: Show the scatter plot matrix and check multicollinearity

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(apartments[ , c("Price", "Age", "Distance")],
                  smooth = FALSE)

The scatterplot matrix illustrates the pairwise relationships between the variables Price, Age, and Distance.

A negative linear relationship is visible between Price and Age, indicating that older apartments tend to have lower prices.

A stronger negative relationship is observed between Price and Distance, suggesting that apartments located farther from the city center are generally cheaper.

There is no apparent relationship between Age and Distance. The points are widely dispersed without a clear pattern.

This lack of a strong relationship between Age and Distance implies that there is no multicollinearity problem between these two predictors, meaning they can be used together in a multiple regression model without concern.

Task 3.6: Estimate multiple regression Price ~ Age + Distance

fit2 <- lm(Price ~ Age + Distance, data = apartments)

# Show the results
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Intercept (2460.10): Represents the estimated price when both age and distance are zero. While not practically meaningful, it serves as a mathematical baseline.

Age (-7.93): Each additional year in apartment age is associated with a €7.93 decrease in price per m², holding distance constant.

Distance (-20.67): Each additional kilometer from the city center corresponds to a €20.67 decrease in price per m², holding age constant.

Both coefficients are statistically significant at the 5% level, with Distance showing a particularly strong effect (p < 0.001).

R² = 0.44: The model explains approximately 44% of the variation in apartment prices, which is considered moderate explanatory power in social sciences.

Adjusted R² = 0.43: The adjusted value remains close, indicating a good balance between model complexity and explanatory strength.

F-statistic (p < 0.001): The overall model is statistically significant, meaning that at least one of the predictors significantly explains price variation.

Conclusion: The analysis shows that both age and distance negatively influence apartment prices. Distance has a stronger marginal impact, while age also contributes meaningfully. The model provides a solid foundation for predicting price based on these characteristics.

Task 3.7: Check multicollinearity with VIF test

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Since both VIF values for Age and Distance are well below the commonly used threshold of 5, we conclude that there is no evidence of multicollinearity between the predictors.

This means that both Age and Distance can be safely included in the regression model without inflating the standard errors or distorting the coefficient estimates.

Task 3.8: Identify outliers and influential points

# Calculate standardized residuals
standard_resid <- rstandard(fit2)

# Calculate Cook's distance
cook_d <- cooks.distance(fit2)

# Combine with apartment IDs (row numbers) for review
diagnostics <- data.frame(
  ID = 1:nrow(apartments),
  Standardized_Residual = standard_resid,
  Cooks_Distance = cook_d
)



# Calculate threshold for Cook's Distance
n <- nrow(apartments)
threshold_cook <- 4 / n  # Rule: Cook's D > 4/n → possible influential point

# Subset potential problematic observations
subset(diagnostics, abs(Standardized_Residual) > 2 | Cooks_Distance > threshold_cook)

##    ID Standardized_Residual Cooks_Distance
## 22 22              1.575982     0.06086868
## 33 33              2.050586     0.06913379
## 38 38              2.576772     0.31973058
## 53 53             -2.151787     0.06625775
## 55 55              1.444768     0.10420445

# Add standardized residuals and Cook's distance to the apartments data
apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)

# Histogram of standardized residuals
hist(apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "gray",
     border = "black")

# Histogram of Cook's distances
hist(apartments$CooksD,
     xlab = "Cook's distance",
     ylab = "Frequency",
     main = "Histogram of Cook's distances",
     col = "gray",
     border = "black")

Based on the standardized residuals and Cook’s distances, we identify several observations that may present problems for our regression model:

Observation 38 stands out as a potential outlier and influential unit. Its standardized residual exceeds the conventional threshold of ±2, and its Cook’s Distance is above the common threshold of 4/𝑛, indicating high influence on the model’s estimates.

Observations 33 and 53 have standardized residuals slightly above 2 in absolute value, suggesting they may be mild outliers, although their Cook’s distances remain relatively low. These do not appear to have a strong influence on the regression model.

Following standard diagnostic guidelines:

We will exclude observation 38 from further analysis due to both its high residual and influence.

Observations 33 and 53 are retained in the model, as their influence is minimal, and their exclusion is not strictly necessary.

This step improves the robustness of our regression analysis and ensures that the model is not disproportionately affected by extreme or unusual observations.

apartments_clean <- apartments[-c(33, 38, 53), ]

Task 3.9: Check for heteroskedasticity with scatterplot

# Calculate standardized fitted values
standard_fitted <- scale(fitted(fit2))

# Scatterplot of standardized residuals vs. standardized fitted values
plot(standard_fitted, standard_resid,
     main = "Standardized Residuals vs Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "steelblue")

abline(h = 0, col = "red", lty = 2)

The plot shows a clear linear pattern, indicating possible heteroskedasticity. Instead of being randomly scattered, residuals increase with fitted values. This violates the constant variance assumption, meaning standard errors and p-values may be unreliable.

Task 3.10: Check normality of residuals (histogram + Q-Q + Shapiro test)

# Histogram of standardized residuals
hist(standard_resid,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "lightblue",
     breaks = 10)

# Shapiro-Wilk test for normality
shapiro.test(standard_resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  standard_resid
## W = 0.95306, p-value = 0.00366

There is evidence of deviation from normality in the residuals. While minor deviations may not severely affect the regression model (due to robustness of OLS under large sample sizes), the non-normality should be kept in mind — especially if using inference-based methods like confidence intervals or hypothesis testing on coefficients.

Task 3.11: Re-estimate model after removing outliers)

# Remove problematic observations (by row number / ID)
clean_apartments <- apartments[-c(33, 38, 53), ]

# Re-estimate the model without those units
fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)

# View the summary of the new model
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -404.0 -230.9  -51.4  190.6  504.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2455.768     73.296  33.505  < 2e-16 ***
## Age           -6.011      3.086  -1.948    0.055 .  
## Distance     -23.543      2.665  -8.834 2.05e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared:  0.5179, Adjusted R-squared:  0.5057 
## F-statistic: 42.44 on 2 and 79 DF,  p-value: 3.042e-13

After removing a few outliers or influential points, the model fit has significantly improved.The effect of Age is still negative but only marginally significant, suggesting it might not be a strong predictor alone. The effect of Distance remains clearly negative and highly significant. Overall, the model now explains more than 50% of the variability in apartment prices, which is quite solid for cross-sectional data.

Task 3.12: Full regression with categorical variables (Parking, Balcony)

# Estimate model with all predictors
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = clean_apartments)

# Show summary
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = clean_apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -410.55 -190.37  -35.23  205.41  545.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2339.781     92.756  25.225  < 2e-16 ***
## Age           -5.413      3.044  -1.778   0.0793 .  
## Distance     -21.224      2.818  -7.531 8.09e-11 ***
## Parking1     136.806     61.698   2.217   0.0295 *  
## Balcony1       8.546     57.679   0.148   0.8826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 257.9 on 77 degrees of freedom
## Multiple R-squared:  0.5469, Adjusted R-squared:  0.5233 
## F-statistic: 23.23 on 4 and 77 DF,  p-value: 1.282e-12

The model suggests that distance from city center and presence of parking are significant predictors of apartment prices. Age shows a weak negative effect, and balconies do not appear to influence price significantly in this sample.

Task 3.13: Compare fit2 vs fit3 using ANOVA

# Compare nested models: fit2_clean (Age + Distance) vs fit3 (Age + Distance + Parking + Balcony)
anova(fit2_clean, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq    F  Pr(>F)  
## 1     79 5447075                            
## 2     77 5119927  2    327148 2.46 0.09212 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We used an ANOVA to test whether the extended model (fit3), which includes the categorical variables Parking and Balcony, improves the model fit compared to the simpler model (fit2_clean) that only includes Age and Distance.

The null hypothesis is that the additional variables (Parking and Balcony) do not improve the model.

The p-value is 0.092, which is above the standard 0.05 threshold, but below 0.10, suggesting weak evidence in favor of the extended model.

Conclusion: Since p = 0.092, we do not reject the null hypothesis at the 5% level. However, at a 10% significance level, there is marginal evidence that adding Parking and Balcony may improve model fit.

Task 3.14: Get fitted value and residual for Apartment ID 2

# Save fitted values from fit3
fitted_values <- fitted(fit3)

# Save residuals
residuals_fit3 <- resid(fit3)

# Get fitted value and residual for apartment ID 2 (row 2)
fitted_values[2]

##        2 
## 2357.925

residuals_fit3[2]

##       2 
## 442.075

From the model fit3, which includes Age, Distance, Parking, and Balcony, we retrieve the following for Apartment ID 2 (i.e., the second row in the dataset): Predicted (fitted) price: €2,357.93 per m², residual: €442.08

This means the model underestimated the price of Apartment 2 by €442.08. In other words, the actual price was €2800, while the model predicted only €2357.93. This residual reflects that this apartment performs significantly above expectations based on the observed characteristics in the model. This could be due to unobserved features such as: Interior quality, Renovation, Floor level and Neighborhood prestige.