Aljoš Matičič

Take Home Exam 2025

Task 1

1. Dataset explanation

The dataset contains 25 different motorcycle models from various manufacturers.
It includes technical specifications, engine size, power, torque etc., and a categorical variable for the type of bike.

Variables: - Price: base model retail price in €
- Weight: wet weight in kg with full fuel tank
- EngineCC: displacement in cubic centimeters (ccm)
- PowerHP: peak power in horsepower (hp)
- TorqueNm: peak torque in Newton meters (Nm)
- Cylinders: number of cylinders (integer)
- Category: type of motorcycle (Sport, Naked, Cruiser, Adventure, Touring)

motorbikes <- data.frame(
  Model = c(
    "Yamaha R6", "Honda CBR650R", "Kawasaki Ninja 1000SX", "Suzuki GSX-R750",
    "BMW S1000RR", "Ducati Panigale V2", "KTM 890 Duke R", "Yamaha MT-07",
    "Honda CB500F", "Kawasaki Z900", "Suzuki V-Strom 650", "BMW R1250GS",
    "Harley-Davidson Iron 883", "Indian Scout Bobber", "Triumph Street Triple RS",
    "Ducati Multistrada V4", "Honda Africa Twin", "Kawasaki Vulcan S",
    "Yamaha XSR900", "KTM 1290 Super Adventure R", "Honda Gold Wing",
    "Harley-Davidson Road Glide", "BMW F900XR", "Triumph Tiger 900", "Ducati Monster 937"
  ),
  Price = c(12500, 9500, 13500, 11800, 19000, 17000, 12500, 8000, 6500, 9500,
            9000, 19500, 11000, 13000, 12500, 21000, 15000, 9000, 11000, 18500,
            30000, 28000, 12500, 13500, 14000),
  Weight = c(190, 208, 235, 190, 197, 200, 184, 184, 189, 210, 213, 249, 256, 252,
             188, 240, 226, 228, 193, 245, 360, 375, 219, 228, 188),
  EngineCC = c(599, 649, 1043, 750, 999, 955, 890, 689, 471, 948, 645, 1254, 883,
               1133, 765, 1158, 1084, 649, 889, 1301, 1833, 1868, 895, 888, 937),
  PowerHP = c(118, 94, 142, 128, 207, 155, 121, 73, 47, 125, 70, 136, 49, 100,
              121, 170, 101, 61, 117, 160, 125, 89, 105, 95, 111),
  TorqueNm = c(61, 64, 111, 86, 113, 104, 99, 68, 43, 98, 62, 143, 74, 97, 79,
               125, 105, 63, 93, 140, 170, 150, 92, 87, 95),
  Cylinders = c(4, 4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 3, 4, 2, 2, 3, 2, 6,
                2, 2, 3, 2),
  Category = c("Sport", "Sport", "Sport", "Sport", "Sport", "Sport", "Naked", "Naked",
               "Naked", "Naked", "Adventure", "Adventure", "Cruiser", "Cruiser",
               "Naked", "Adventure", "Adventure", "Cruiser", "Naked", "Adventure",
               "Touring", "Cruiser", "Adventure", "Adventure", "Naked")
)

2. Data manipulations

I have created a few new variables that are shown on a new table below.
- Power-to-weight ratio (hp/kg) - Torque-to-weight ratio (Nm/kg) - Power-to-torque ratio (hp/Nm)

All are calculated to three decimal places.

Then I removed heavier bikes, above 250 kg.

I also renamed some variables to make the data easier to understand.

And finally, because I like Naked bikes, I created a subset with only Naked bikes in my data.

# New variables
motorbikes <- motorbikes %>%
  mutate(
    PowerToWeight = round(PowerHP / Weight, 3),
    TorqueToWeight = round(TorqueNm / Weight, 3),
    PowerToTorque = round(PowerHP / TorqueNm, 3)
  )

# Remove heavier bikes (over 250 kg)
motorbikes_clean <- motorbikes %>%
  filter(Weight < 250)

# Rename variables
motorbikes_clean <- motorbikes_clean %>%
  rename(Displacement_ccm = EngineCC,
         Power_HP = PowerHP,
         Torque_Nm = TorqueNm)

# Only Naked bikes
naked_bikes <- motorbikes_clean %>%
  filter(Category == "Naked")

kable(naked_bikes, caption = "Final dataset: Naked bikes only")

Final dataset: Naked bikes only
Model	Price	Weight	Displacement_ccm	Power_HP	Torque_Nm	Cylinders	Category	PowerToWeight	TorqueToWeight	PowerToTorque
KTM 890 Duke R	12500	184	890	121	99	2	Naked	0.658	0.538	1.222
Yamaha MT-07	8000	184	689	73	68	2	Naked	0.397	0.370	1.074
Honda CB500F	6500	189	471	47	43	2	Naked	0.249	0.228	1.093
Kawasaki Z900	9500	210	948	125	98	4	Naked	0.595	0.467	1.276
Triumph Street Triple RS	12500	188	765	121	79	3	Naked	0.644	0.420	1.532
Yamaha XSR900	11000	193	889	117	93	3	Naked	0.606	0.482	1.258
Ducati Monster 937	14000	188	937	111	95	2	Naked	0.590	0.505	1.168

3. Descriptive statistics

Here you can see the summary of a cleaned dataset (motorbikes_clean) for all bike categories.
The cleaned dataset differs from the original one by including new calculated variables and renamed columns.

# Summary stats of cleaned dataset
kable(summary(motorbikes_clean[, c("Price", "Weight", "Displacement_ccm", "Power_HP", "Torque_Nm")]),
      caption = "Summary statistics of cleaned dataset")

Summary statistics of cleaned dataset
Price	Weight	Displacement_ccm	Power_HP	Torque_Nm
Min. : 6500	Min. :184.0	Min. : 471	Min. : 47	Min. : 43.00
1st Qu.: 9500	1st Qu.:190.0	1st Qu.: 689	1st Qu.: 95	1st Qu.: 68.00
Median :12500	Median :208.0	Median : 890	Median :118	Median : 93.00
Mean :13110	Mean :209.7	Mean : 879	Mean :117	Mean : 91.95
3rd Qu.:15000	3rd Qu.:228.0	3rd Qu.: 999	3rd Qu.:136	3rd Qu.:105.00
Max. :21000	Max. :249.0	Max. :1301	Max. :207	Max. :143.00

4. Graphical analysis

For graphical analysis I opted for a histogram, scatterplot, and a boxplot to show different motorcycle characteristics.
- The histogram shows the distribution of engine power.
- The scatterplot shows the connection between engine power and price.
- The boxplot shows the relationship between price and category.

# Histogram of engine power
ggplot(motorbikes_clean, aes(x = Power_HP)) +
  geom_histogram(binwidth = 20, fill = "darkgreen", color = "black") +
  labs(title = "Distribution of Engine Power", x = "Engine Power (HP)", y = "Count")

# Scatterplot: Power vs Price
ggplot(motorbikes_clean, aes(x = Price, y = Power_HP, color = Category)) +
  geom_point(size=3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Engine Power vs Price ", x = "Price (€)", y = "Power (HP)")

## `geom_smooth()` using formula = 'y ~ x'

# Boxplot: Price by Category
ggplot(motorbikes_clean, aes(x = Category, y = Price, fill = Category)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Category", x = "Category", y = "Price (€)")

The power histogram shows that most bikes fall in the 100–150 HP range, with a few powerful outliers such as the Kawasaki Ninja, while models like the Honda CB500F are on the lower end of the power scale.

The scatterplot indicates a clear positive relationship between price and engine power, which was expected. Sport bikes stand out as the most powerful category on average, while Naked and Cruiser bikes tend to cluster at lower values. We can also observe that the trend line for naked and adventure bikes is practically the same, which was unexpected.

The boxplot shows that Adventure bikes are consistently the most expensive, with the widest price dispersion. Naked bikes are the cheapest on average, while Sport bikes and Naked bikes show similar variability in prices. Because there is only one Cruiser bike left after cleaning, its boxplot is not representative.

Task 2

1. Distribution of undergrad degrees

library(readxl)
library(ggplot2)
library(dplyr)
library(knitr)

mba <- read_excel("Business School.xlsx")

# Distribution of undergrad degrees
ggplot(mba, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar() +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the graph we can see that the most common degree is business.

2. Descriptive statistics

# Descriptive statistics of Annual Salary
salary_summary <- as.data.frame(t(summary(mba$`Annual Salary`)))
kable(salary_summary, caption = "Summary of Annual Salary")

Summary of Annual Salary
Var1	Var2	Freq
A	Min.	20000
A	1st Qu.	87125
A	Median	103500
A	Mean	109058
A	3rd Qu.	124000
A	Max.	340000

# Histogram of salary
ggplot(mba, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 10000, fill = "darkgreen", color = "black") +
  labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Count")

The summary table shows the mean, median, minimum, maximum, and quartiles of Annual Salary.
The histogram indicates that most earn between 90,000 and 120,000 , while a few earn much higher salaries.
This creates a positively skewed distribution, where the long tail extends to the right.
As a result, the mean salary is higher than the median, since the few very high salaries pull the average upwards.

3. Hypothesis test: H₀: μ(MBA Grade) = 74

# t-test against population mean = 74
t_result <- t.test(mba$`MBA Grade`, mu = 74)
t_result

## 
##  One Sample t-test
## 
## data:  mba$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

# Effect size (Cohen's d)
mean_grade <- mean(mba$`MBA Grade`)
sd_grade <- sd(mba$`MBA Grade`)
mu0 <- 74
cohen_d <- (mean_grade - mu0) / sd_grade
cohen_d

## [1] 0.2658658

Sampel’s mean MBA grade is 76.04, which is higher than 74.
The t-test result is t(99) = 2.66, p = 0.009. Since the p-value is less than 0.05, we reject the null hypothesis.
The 95% confidence interval for the mean (74.52 – 77.56) does not include 74, which further confirms the result.
The effect size, Cohen’s d = 0.27, indicates a small effect.

Task 3

Import the dataset Apartments.xlsx

# Load dataset
apartments <- read_excel("Apartments.xlsx")

# Show first rows
kable(head(apartments), caption = "Apartments dataset")

Apartments dataset
Age	Distance	Price	Parking	Balcony
7	28	1640	0	1
18	1	2800	1	0
7	28	1660	0	0
28	29	1850	0	1
18	18	1640	1	1
28	12	1770	0	1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors

apartments$Parking <- factor(apartments$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

apartments$Balcony <- factor(apartments$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

# Updated dataset
knitr::kable(head(apartments), caption = "Apartments dataset with categorical variables as factors")

Apartments dataset with categorical variables as factors
Age	Distance	Price	Parking	Balcony
7	28	1640	No	Yes
18	1	2800	Yes	No
7	28	1660	No	No
28	29	1850	No	Yes
18	18	1640	Yes	Yes
28	12	1770	No	Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

# One-sample t-test
t_price <- t.test(apartments$Price, mu = 1900)
t_price

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

# Effect size (Cohen's d)
mean_price <- mean(apartments$Price)
sd_price <- sd(apartments$Price)
mu0 <- 1900
cohen_d <- (mean_price - mu0) / sd_price
cohen_d

## [1] 0.314791

The hypothesis is that the average apartment price per m2 is 1900 €.

The sample mean turned out to be 2018.94 €/m2, which is higher than 1900 €. The test result was t(84) = 2.90, p = 0.0047. Since the p-value is below 0.05, we can say this difference is statistically significant. The 95% confidence interval 1937.44 – 2100.44 €/m2 doesn’t include 1900, which backs up this conclusion.

The effect size was Cohen’s d = 0.31, which is considered a small to medium effect. In practice, this means the difference is real, but not extremely large.

Apartments in Ljubljana are on average priced noticeably above 1900 €/m2.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

# Simple regression: Price = f(Age)
fit1 <- lm(Price ~ Age, data = apartments)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

# Correlation
cor_age_price <- cor(apartments$Age, apartments$Price)
cor_age_price

## [1] -0.230255

I estimated a simple regression function: Price = f(Age).

Regression coefficient (slope): The slope is –8.98, which means that for every additional year of building age, the average apartment price per m2 decreases by about 9 €. Since the p-value = 0.034 (< 0.05), this effect is statistically significant.

Intercept: The intercept is 2185.46, which represents the estimated price per m2 when the building is new.

Correlation coefficient (r): The correlation between Age and Price is –0.23, showing a weak negative relationship: as apartments get older, prices tend to decrease.

Coefficient of determination (R²): The R² value is 0.053, meaning that about 5.3% of the variation in apartment prices can be explained by age alone. The remaining variation depends on other factors.

Older apartments tend to be cheaper, but age alone explains only a small portion of price differences.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(GGally)

# Scatterplot matrix
ggpairs(apartments[, c("Price", "Age", "Distance")])

Price vs Age: has a weak negative relationship (correlation = –0.23). Older apartments are slightly cheaper. Price vs Distance: stronger negative relationship (correlation = –0.63). Apartments farther from the city center are cheaper.
Age vs Distance: almost no correlation (correlation = 0.04).

Estimate the multiple regression function: Price = f(Age, Distance)

# Multiple regression function
fit2 <- lm(Price ~ Age + Distance, data = apartments)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

# VIF for fit2
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF values for both Age (1.00) and Distance (1.00) are very close to 1.
This means that the two predictors are practically uncorrelated, and there is no problem with multicollinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

# Standardized residuals
std_resid <- rstandard(fit2)

# Cooks distances
cooks_d <- cooks.distance(fit2)

# Combine into a data frame for inspection
diagnostics <- data.frame(std_resid, cooks_d)
knitr::kable(head(diagnostics), caption = "First rows of diagnostics: standardized residuals & Cooks D")

First rows of diagnostics: standardized residuals & Cooks D
std_resid	cooks_d
-0.6653487	0.0073866
1.7832876	0.0303654
-0.5937629	0.0058826
0.7543794	0.0082992
-1.0733987	0.0051126
-0.7775190	0.0049009

# Identify problematic units
# 
# |standardized residual| > 2 (possible outlier)
# Cooks D > 4/n (influential point)
n <- nrow(apartments)
outliers <- which(abs(std_resid) > 2 | cooks_d > (4/n))

outliers

## 22 33 38 53 55 
## 22 33 38 53 55

apartments[outliers, ]

## # A tibble: 5 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1    37        3  2540 Yes     Yes    
## 2     2       11  2790 Yes     No     
## 3     5       45  2180 Yes     Yes    
## 4     7        2  1760 No      Yes    
## 5    43       37  1740 No      No

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

# Standardized residuals and fitted values
std_resid <- rstandard(fit2)
std_fitted <- scale(fit2$fitted.values)

# Scatterplot
plot(std_fitted, std_resid,
     main = "Residuals vs Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

The residuals are scattered randomly around zero without a clear pattern, but the spread does get a bit wider for higher fitted values.
This suggests that heteroskedasticity is not a serious problem, although there may be a slight increase in variance at higher fitted values.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

# Standardized residuals
std_resid <- rstandard(fit2)

# Histogram
hist(std_resid, breaks = 10, col = "darkgreen", border = "black",
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals")

# Q-Q plot
qqnorm(std_resid, main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "blue", lwd = 2)

# Formal test of normality (Shapiro-Wilk test)
shapiro_test <- shapiro.test(std_resid)
shapiro_test

## 
##  Shapiro-Wilk normality test
## 
## data:  std_resid
## W = 0.95306, p-value = 0.00366

The histogram of standardized residuals looks roughly bell-shaped, and the Q-Q plot shows that most points follow the straight line, with only small deviations at the tails.
In our case, the residuals appear reasonably normal, so the normality assumption for regression is not a serious concern.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

# Remove problematic units (identified earlier)
apartments_clean <- apartments[-outliers, ]

# Re-estimate regression: Price ~ Age + Distance
fit2_clean <- lm(Price ~ Age + Distance, data = apartments_clean)

# Show results
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Intercept (2502.47): Estimated price per m² for a brand-new apartment located right in the city center (Age = 0, Distance = 0). This is the baseline value.
Age (–8.67): For every additional year of apartment age, the price decreases on average by about 9 €/m², keeping distance constant.
Distance (–24.06): For every additional kilometer away from the city center, the price decreases on average by about 24 €/m², keeping age constant.
R² = 0.54: About 54% of the variation in apartment prices can be explained by Age and Distance together, which indicates a fairly strong model fit.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

# Linear regression function Price = f(Age, Distance, Parking and Balcony)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

With function anova check if model fit3 fits data better than model fit2.

#### Fit2 and fit3 comparison using ANOVA
anova(fit2_clean, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

#### Fit3 results
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Hypotheses tested with the F-statistic:

Null hypothesis (H₀): All slope coefficients are equal to 0.
→ Age, Distance, Parking, and Balcony have no effect on Price.
Alternative hypothesis (H₁): At least one slope coefficient is different from 0.
→ At least one of the predictors significantly affects Price.

F(4, 75) = 24.08, p < 0.001 → we reject H₀. The model as a whole is statistically significant, meaning that Age, Distance, Parking, and Balcony explain a significant share of apartment price variation.

Save fitted values and claculate the residual for apartment ID2.

# Fitted values and residuals from fit3
fitted_values <- fitted(fit3)
residuals <- resid(fit3)

# Data frame with row numbers as IDs
results <- data.frame(
  Row = 1:length(fitted_values),
  Fitted = fitted_values,
  Residual = residuals
)

# Show only apartment ID2 (2nd row in dataset)
results[results$Row == 2, ]

##   Row   Fitted Residual
## 2   2 2356.597 443.4026

Take home exam - Maticic

Aljoš Matičič

2025-09-27

Aljoš Matičič

Take Home Exam 2025

Task 1

1. Dataset explanation

2. Data manipulations

3. Descriptive statistics

4. Graphical analysis

Task 2

1. Distribution of undergrad degrees

2. Descriptive statistics

3. Hypothesis test: H₀: μ(MBA Grade) = 74

Task 3

Import the dataset Apartments.xlsx

Change categorical variables into factors

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance)

Chech the multicolinearity with VIF statistics. Explain the findings.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Save fitted values and claculate the residual for apartment ID2.