Task 1 - Cost of living index in Europe

mydata <- read.table("./Cost_of_Living_Index_Europe.csv",
                     header = TRUE, 
                     sep = ",", 
                     dec = ".", 
                     check.names = FALSE)

mydata <- mydata[ , c("Rank", 
                      "Country",
                      "Cost of Living Index",
                      "Rent Index",
                      "Groceries Index",
                      "Local Purchasing Power Index")]

knitr::kable(head(mydata, 10), caption = "Cost of living in Europe")
Cost of living in Europe
Rank Country Cost of Living Index Rent Index Groceries Index Local Purchasing Power Index
2 Albania 35.50 8.47 29.32 30.19
5 Armenia 33.89 11.61 27.59 28.86
7 Austria 71.04 27.13 65.88 77.25
13 Belarus 30.89 9.81 27.24 31.78
14 Belgium 72.61 25.79 63.32 79.72
18 Bosnia And Herzegovina 36.12 6.82 31.14 44.10
21 Bulgaria 38.38 9.07 34.02 45.96
29 Croatia 48.94 13.38 41.98 47.55
31 Cyprus 59.03 22.36 49.20 57.31
32 Czech Republic 48.24 18.40 42.76 66.47

For my analysis, I chose a dataset containing information about different indexes related to the cost of living. The original dataset consisted of all the countries in the world, but I selected only the European countries for my analysis. The original dataset included six indexes, and I focused on four of them.

The ranking may appear unusual because it is still calculated in relation to all the countries in the world.

The dataset consists of 44 units (countries) and 6 variables, of which 1 is categorical (Country) and 5 are numerical (Rank, Cost of Living Index, Rent Index, Groceries Index and Local Purchasing Power Index).

#install.packages("tidyr")
library("tidyr")

mydata <- drop_na(mydata)

I wrote this to check if there were any rows with missing values. As it turns out, there are none

mydata$`Ratio COL/Rent` <- round(
  mydata$`Cost of Living Index` / mydata$`Rent Index`, 2)

#mydata <- subset(mydata, select = -c(`COL_to_rent`))

I introduced a new variable, the ratio, which compares the overall Cost of Living Index to the Rent Index. The ratio is rounded to two decimal points.

The higher the value of the ratio, the cheaper the rent is compared to the overall cost of living (rent takes up a smaller share of expenses). If the ratio has a lower value, the rent is relatively expensive compared to the overall cost of living (housing is a heavier burden).

mydata$Country <- as.character(mydata$Country)
mydata$Country[mydata$Country == "Bosnia And Herzegovina"] <- "BIH"
mydata <- mydata[order(-mydata$`Ratio COL/Rent`), ]
mydata$Country <- factor(mydata$Country, levels = mydata$Country)

With this chunk I changed Bosnia and Herzegovina to BIH, to fit better in the graph.

#install.packages("ggplot2")
library(ggplot2)

ggplot(mydata, aes(x = Country, y = `Ratio COL/Rent`)) +
  geom_col(fill = "pink") +
  xlab("Country") +
  ylab("Cost of Living / Rent (ratio)") +
  ggtitle("Cost of Living to Rent Ratio — Europe") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 7))

This graph shows the ranking of European countries based on the ratio, calculated as the Cost of Living Index divided by the Rent Index, ordered from the highest to the lowest ratio.

The country with the highest ratio is BIH (Bosnia and Herzegovina), meaning that rent makes up only a small share of the overall cost of living compared to other countries. At the other end, Luxembourg has the lowest ratio, where rent represents a larger proportion of living costs, making housing comparatively more expensive.

summary(mydata[ , c("Cost of Living Index", "Rent Index", "Groceries Index",
                    "Local Purchasing Power Index", "Ratio COL/Rent")])
##  Cost of Living Index   Rent Index    Groceries Index  Local Purchasing Power Index Ratio COL/Rent 
##  Min.   : 27.05       Min.   : 6.06   Min.   : 22.64   Min.   : 28.86               Min.   :1.340  
##  1st Qu.: 35.97       1st Qu.:10.47   1st Qu.: 30.82   1st Qu.: 39.26               1st Qu.:2.600  
##  Median : 48.70       Median :16.24   Median : 42.13   Median : 56.01               Median :2.915  
##  Mean   : 54.69       Mean   :20.31   Mean   : 47.83   Mean   : 60.38               Mean   :3.078  
##  3rd Qu.: 71.22       3rd Qu.:26.25   3rd Qu.: 62.41   3rd Qu.: 80.36               3rd Qu.:3.425  
##  Max.   :123.35       Max.   :60.09   Max.   :128.13   Max.   :118.44               Max.   :5.300
library(knitr)

num_vars <- c("Cost of Living Index", 
              "Rent Index", 
              "Groceries Index", 
              "Local Purchasing Power Index")

stats_all <- sapply(mydata[, num_vars], function(x) {
  c(Mean   = round(mean(x, na.rm = TRUE), 2),
    Median = round(median(x, na.rm = TRUE), 2),
    SD     = round(sd(x, na.rm = TRUE), 2))})

stats_all <- t(stats_all)

stats_all <- as.data.frame(stats_all)

kable(stats_all, caption = "Descriptive Statistics for Selected Variables")
Descriptive Statistics for Selected Variables
Mean Median SD
Cost of Living Index 54.69 48.70 22.12
Rent Index 20.31 16.24 12.69
Groceries Index 47.83 42.13 22.29
Local Purchasing Power Index 60.38 56.01 24.06

The Cost of Living Index has a mean of 54.69 and a median of 48.70, suggesting a mild right-skew in the distribution.

The Rent Index has the lowest average (20.31) and also the lowest variability (SD = 12.69), which means that rent prices are relatively more stable across countries compared to other indices.

The Groceries Index has an average of 47.83, close to the Cost of Living Index, with a similar spread (SD = 22.29).

The Local Purchasing Power Index has the highest average (60.38) and a large spread (SD = 24.06), indicating that purchasing power varies greatly among countries.

library(ggplot2)

ggplot(mydata, aes(x = `Local Purchasing Power Index`, y = `Cost of Living Index`)) +
  geom_point(shape = 21, color = "darkolivegreen3", fill = NA, size = 2, stroke = 1) + 
  scale_x_continuous(breaks = seq(0, max(mydata$`Local Purchasing Power Index`), by = 15), 
                     position = "bottom") + 
  geom_smooth(method = "lm", se = FALSE, color = "violetred3", linetype = "dashed") +  
  xlab("Local Purchasing Power Index") +
  ylab("Cost of Living Index") +
  ggtitle("Link Between Purchasing Power and Cost of Living") +
  theme_bw()

The scatterplot shows the relationship between the Local Purchasing Power Index (x-axis) and the Cost of Living Index (y-axis), where each point represents one country.

The points form an upward trend, suggesting that higher purchasing power is associated with higher costs of living. Although people earn more, goods and services are also more expensive.

If the points were to form a downward trend, it would suggest that countries with higher local purchasing power tend to have a lower cost of living, meaning people in those countries can afford more with their income.

From this graph, we can conclude that purchasing power and the cost of living are positively related.

library(ggplot2)

mydata$LPP_group <- cut(
  mydata$`Local Purchasing Power Index`,
  breaks = c(-Inf, 40, 70, Inf),  
  labels = c("Low", "Medium", "High"))

ggplot(mydata, aes(x = LPP_group, y = `Rent Index`)) +
  geom_boxplot(fill = "steelblue4") +
  xlab("LPP group") +
  ylab("Rent Index") +
  scale_y_continuous(breaks = seq(0, max(mydata$`Rent Index`, na.rm = TRUE), by = 10)) +
  theme_light()

I then created a new categorical variable, LPP group, which divides countries into three groups based on their Local Purchasing Power Index: Low (values up to 40), Medium (values between 40 and 70), and High (values above 70).

In the next step I plotted a boxplot graph of Rent index for each of these groups.

For each group (Low, Medium, High), the box shows the middle 50% of the Rent Index values. The line inside the box represents the median Rent Index for that group, while the “whiskers” show the range of most Rent Index values. Any dots outside the “whiskers” are outliers, meaning that rents in those countries are very different from the rest of their group.

I wanted to see if there is any correlation between the purchasing power in countries and the rent. The higher the median Rent index is, the higher is the purchasing power in the country as well. That means that countries with higher purchasing power tend to have generally higher rents.

On the other hand the Low LPP group has also a lower median Rent index, which means that rents are cheaper where people have less purchasing power.

Because the spread of each box (its hight) tells us variability of rents within that group, we can conclude that in countries with higher purchasing power (that also have higher rents), rents vary more than in those with lower rents.

Even though there are some outliers which highlight “exceptional” countries, they still follow the trend of the LPP groups.

Task 2 - Business school

library(readxl)
library(knitr)

mydata <- read_xlsx("~/Documents/Magisterij /IMB /Bootcamp /Statistika/Take_home_exam/R Take Home Exam 2025-2/Task 2/Business School.xlsx")

mydata <- mydata[, c("Undergrad Degree", "Annual Salary", "MBA Grade")]

kable(head(mydata, 10), caption = "Business School data (first 10 rows)")
Business School data (first 10 rows)
Undergrad Degree Annual Salary MBA Grade
Business 111000 90.2
Computer Science 107000 68.7
Finance 109000 83.3
Business 148000 88.7
Finance 255500 75.4
Computer Science 103500 82.1
Engineering 114500 66.9
Engineering 124000 76.8
Business 132500 72.3
Finance 99000 72.4

I imported the data set and selected the three variables relevant for this task: Undergrad Degree, Annual Salary, and MBA Grade.

The table above displays the first 10 rows of the data set to provide an overview of the data that will be analyzed later.

The Undergrad Degree column identifies the students’ field of undergraduate study.

The Annual Salary column reports their annual income (in USD).

The MBA Grade column shows their performance in the MBA program.

library(ggplot2)

ggplot(mydata, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "magenta4", color = "black", width = 0.6) +  
  xlab("Undergraduate Degree") +
  ylab("Count of Students") +
  ggtitle("Distribution of Undergraduate Degrees (MBA Cohort)") +
  scale_y_continuous(breaks = seq(0, 35, 5)) +  
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart shows the distribution of undergraduate degrees among MBA students. Business is the most common background, followed by Computer Science and Finance, while Art and Engineering are less represented.

library(knitr)

deg_counts <- sort(table(mydata$`Undergrad Degree`), decreasing = TRUE)

kable(
  as.data.frame(deg_counts),
  col.names = c("Undergraduate Degree", "Number of Students"),
  caption = "Counts of Undergraduate Degrees (sorted)")
Counts of Undergraduate Degrees (sorted)
Undergraduate Degree Number of Students
Business 35
Computer Science 25
Finance 25
Engineering 9
Art 6

The table shows the distribution of undergraduate degrees among the MBA students. It confirms what the graph has shown us. The most common background is Business (35 students), followed by Computer Science and Finance (25 each). Engineering (9) and Art (6) are less frequent, this indicates that students who pursue the MBA degree have most commonly a business background.

salary <- mydata$`Annual Salary`

mean_sal   <- round(mean(salary), 2)
median_sal <- round(median(salary), 2)
sd_sal     <- round(sd(salary), 2)
min_sal    <- round(min(salary), 2)
max_sal    <- round(max(salary), 2)
q1_sal     <- round(quantile(salary, 0.25), 2)
q3_sal     <- round(quantile(salary, 0.75), 2)

kable(data.frame(
  Statistic = c("Mean","Median","SD","Min","Q1","Q3","Max"),
  Value = c(mean_sal, median_sal, sd_sal, min_sal, q1_sal, q3_sal, max_sal)),
  caption = "Annual Salary — Descriptive Statistics")
Annual Salary — Descriptive Statistics
Statistic Value
Mean 109058.00
Median 103500.00
SD 41501.49
Min 20000.00
Q1 87125.00
Q3 124000.00
Max 340000.00

The average annual salary in the dataset is 109,058, with a median of 103,500, suggesting that salaries are slightly right-skewed (since the mean is higher than the median).

The salaries range from a minimum of 20,000 to a maximum of 340,000, indicating a wide variation.

The interquartile range (from Q1 = 87,125 to Q3 = 124,000) shows that the middle 50% of students earn within this range, while the relatively large standard deviation (41,501) confirms that there is large variability in salaries among students.

ggplot(mydata, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = diff(range(salary))/12, fill = "cornflowerblue", color = "navyblue") +
  xlab("Annual Salary") + ylab("Frequency") +
  ggtitle("Distribution of Annual Salary") +
  theme_minimal()

The histogram shows the distribution of annual salaries among MBA students. Most students earn between 80,000 and 130,000, with a peak around 100,000, while a few students earn much higher salaries, creating a right-skewed distribution.

This suggests that although the majority of salaries are clustered in the lower range, there are outliers with significantly higher incomes.

On the x-axis, salaries are displayed in scientific notation, where 1e+05 represents 1×10*5=100,000, 2e+05 is 200,000, and so on.

library(broom)
library(knitr)
library(dplyr)

tt <- t.test(mydata$`MBA Grade`, mu = 74, alternative = "two.sided")

tt_tidy <- tidy(tt) %>%
  rename(
    Estimate = estimate,
    Statistic = statistic,
    `P-value` = p.value,
    `Degrees of Freedom` = parameter,
    `CI Lower` = conf.low,
    `CI Upper` = conf.high,
    Method = method,
    Alternative = alternative)

tt_tidy$Alternative <- "Two-sided"

kable(tt_tidy, caption = "T-test results for MBA Grade (H₀: μ = 74)")
T-test results for MBA Grade (H₀: μ = 74)
Estimate Statistic P-value Degrees of Freedom CI Lower CI Upper Method Alternative
76.04055 2.658658 0.0091499 99 74.51764 77.56346 One Sample t-test Two-sided

The one-sample t-test shows that the average MBA grade in the sample (76.04) is significantly higher than the hypothesized one.

The 95% confidence interval for the mean ranges from 74.52 to 77.56, which does not include 74, supporting the conclusion that the true mean differs from the hypothesized value.

grade <- mydata$`MBA Grade`

mean_grade <- round(mean(grade, na.rm = TRUE), 3)
sd_grade   <- round(sd(grade, na.rm = TRUE), 3)
t_stat     <- round(unname(tt$statistic), 3)
df_val     <- unname(tt$parameter)
p_val      <- round(tt$p.value, 3)
ci_low     <- round(tt$conf.int[1], 3)
ci_high    <- round(tt$conf.int[2], 3)
cohen_d    <- round((mean(grade, na.rm = TRUE) - 74) / sd(grade, na.rm = TRUE), 3)

summary_table <- data.frame(
  Statistic = c("Mean (MBA Grade)", "SD (MBA Grade)", "t statistic", "df",
                "p-value", "95% CI Lower", "95% CI Upper", "Cohen's d"),
  Value = c(mean_grade, sd_grade, t_stat, df_val, p_val, ci_low, ci_high, cohen_d)
)

kable(summary_table,
      caption = "Summary statistics and effect size for MBA Grade (vs μ = 74)")
Summary statistics and effect size for MBA Grade (vs μ = 74)
Statistic Value
Mean (MBA Grade) 76.041
SD (MBA Grade) 7.675
t statistic 2.659
df 99.000
p-value 0.009
95% CI Lower 74.518
95% CI Upper 77.563
Cohen’s d 0.266

The average MBA grade in the sample is 76.041 (SD = 7.675), which is higher than the hypothesized mean of 74. The t-test result (t(99)=2.659, p=0.009) indicates that this difference is statistically significant at the 1% level.

The 95% confidence interval [74.518, 77.563] does not include 74, supporting the conclusion.

However, the effect size (Cohen’s d = 0.266) suggests that the difference is small in practical terms.

Task 3 - Apartments in Ljubljana

Import the dataset Apartments.xlsx

library(readxl)
library(knitr)
library(kableExtra)

apartments <- read_excel("~/Documents/Magisterij /IMB /Bootcamp /Statistika/Take_home_exam/R Take Home Exam 2025-2/Task 3/Apartments.xlsx")

kable(head(apartments, 10), caption = "Overview of the Apartments Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Overview of the Apartments Dataset
Age Distance Price Parking Balcony
7 28 1640 0 1
18 1 2800 1 0
7 28 1660 0 0
28 29 1850 0 1
18 18 1640 1 1
28 12 1770 0 1
14 20 1850 0 1
18 6 1970 1 1
22 7 2270 1 0
25 2 2570 1 0

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

library(knitr)

apartments$Parking <- factor(apartments$Parking,
                             levels = c("0", "1"),
                             labels = c("No", "Yes"))

apartments$Balcony <- factor(apartments$Balcony,
                             levels = c("0", "1"),
                             labels = c("No", "Yes"))

kable(head(apartments, 10), caption = "Apartments Dataset with Factors")
Apartments Dataset with Factors
Age Distance Price Parking Balcony
7 28 1640 No Yes
18 1 2800 Yes No
7 28 1660 No No
28 29 1850 No Yes
18 18 1640 Yes Yes
28 12 1770 No Yes
14 20 1850 No Yes
18 6 1970 Yes Yes
22 7 2270 Yes No
25 2 2570 Yes No

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

library(broom)
library(knitr)
library(dplyr)

price <- apartments$Price  

tt_price <- t.test(price, mu = 1900, alternative = "two.sided")

mean_price <- round(mean(price, na.rm = TRUE), 3)
sd_price   <- round(sd(price, na.rm = TRUE), 3)
t_stat     <- round(unname(tt_price$statistic), 3)
df_val     <- unname(tt_price$parameter)
p_val      <- round(tt_price$p.value, 3)
ci_low     <- round(tt_price$conf.int[1], 3)
ci_high    <- round(tt_price$conf.int[2], 3)
cohen_d    <- round((mean(price, na.rm = TRUE) - 1900) / sd(price, na.rm = TRUE), 3)

summary_table <- data.frame(
  Statistic = c("Mean (Price)", "SD (Price)", "t statistic", "df", "p-value", 
                "95% CI Lower", "95% CI Upper", "Cohen's d"),
  Value = c(mean_price, sd_price, t_stat, df_val, p_val, ci_low, ci_high, cohen_d))

tt_price_tidy <- tidy(tt_price) %>%
  rename(
    Estimate = estimate,
    Statistic = statistic,
    `P-value` = p.value,
    `Degrees of Freedom` = parameter,
    `CI Lower` = conf.low,
    `CI Upper` = conf.high,
    Method = method,
    Alternative = alternative)

tt_price_tidy$Alternative <- "Two-sided"

kable(tt_price_tidy, caption = "T-test results for Apartment Price (H₀: μ = 1900 EUR)")
T-test results for Apartment Price (H₀: μ = 1900 EUR)
Estimate Statistic P-value Degrees of Freedom CI Lower CI Upper Method Alternative
2018.941 2.90223 0.0047306 84 1937.443 2100.44 One Sample t-test Two-sided
kable(summary_table, caption = "Summary statistics and effect size for Apartment Price (vs μ = 1900 EUR)")
Summary statistics and effect size for Apartment Price (vs μ = 1900 EUR)
Statistic Value
Mean (Price) 2018.941
SD (Price) 377.842
t statistic 2.902
df 84.000
p-value 0.005
95% CI Lower 1937.443
95% CI Upper 2100.440
Cohen’s d 0.315

The one-sample t-test shows that the mean apartment price per m² (2018.94€) is significantly higher than the hypothesized value of 1900€ (t(84) = 2.90, p = 0.005).

The 95% confidence interval [1937.44, 2100.44] does not include 1900, meaning we can reject the null hypothesis.

The effect size (Cohen’s d = 0.315) indicates a small-to-moderate practical difference.

To conclude, apartments in Ljubljana, are on average, priced noticeably higher than 1900€ per m².

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = apartments)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
reg_coef <- coef(fit1)[2]                   
cor_val  <- cor(apartments$Price, apartments$Age) 
r_squared <- summary(fit1)$r.squared       

reg_coef
##       Age 
## -8.975058
cor_val
## [1] -0.230255
r_squared
## [1] 0.05301737
library(knitr)

summary_table <- data.frame(
  Statistic = c("Regression Coefficient (Age)", "Correlation (Price, Age)", "R-squared"),
  Value = c(reg_coef, cor_val, r_squared))

kable(summary_table, caption = "Regression results: Price as a function of Age")
Regression results: Price as a function of Age
Statistic Value
Regression Coefficient (Age) -8.9750576
Correlation (Price, Age) -0.2302550
R-squared 0.0530174

Regression Coefficient (Age) = 8.976. This means that for every additional year of apartment age, the price per m² decreases by about 9 EUR, on average.The negative value shows that older apartments are generally cheaper.

Coefficient of Correlation (r) = –0.230. The correlation between Age and Price is negative. This confirms that older apartments have a lower price, but since the value is close to 0, the relationship is not strong.

Coefficient of Determination (R²) = 0.053. About 5.3% of the variation in apartment prices can be explained by the age of the apartment. This is a relatively small proportion, meaning other factors (distance, parking, balcony, etc.) likely explain price much better.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(GGally)

ggpairs(apartments[, c("Price", "Age", "Distance")],
        title = "Scatterplot Matrix: Price, Age, and Distance")

Price vs Age Correlation = –0.230. This suggests that as apartments get older, the price per m² decreases slightly, but the relationship is weak.

Price vs Distance Correlation = –0.631. This is a clear relationship that shows that apartments further from the city center tend to be significantly cheaper.

Age vs Distance Correlation = 0.043. No meaningful relationship between age and distance, because the value is close to 0.

Multicolinearity is only a problem if the independent variables (Age and Distance) are strongly correlated with each other. Here, their correlation is only 0.043.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartments)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11
library(knitr)
library(broom)

fit2_tidy <- tidy(fit2)
fit2_glance <- glance(fit2)

summary_table <- data.frame(
  Term = fit2_tidy$term,
  Estimate = round(fit2_tidy$estimate, 3),
  `Std. Error` = round(fit2_tidy$std.error, 3),
  `t value` = round(fit2_tidy$statistic, 3),
  `p value` = round(fit2_tidy$p.value, 3))

kable(summary_table, caption = "Multiple Regression Results: Price ~ Age + Distance")
Multiple Regression Results: Price ~ Age + Distance
Term Estimate Std..Error t.value p.value
(Intercept) 2460.101 76.632 32.103 0.000
Age -7.934 3.225 -2.460 0.016
Distance -20.667 2.748 -7.520 0.000

I created a table for easier representation of the results.

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif_values <- vif(fit2)

library(knitr)
kable(as.data.frame(vif_values), 
      caption = "Variance Inflation Factor (VIF) for predictors in fit2")
Variance Inflation Factor (VIF) for predictors in fit2
vif_values
Age 1.001845
Distance 1.001845

The VIF values for both Age and Distance are approximately 1 (1.0018), which indicates that the predictors are not correlated with each other. Since VIF values below 5 (and especially close to 1) suggest no multicolinearity, we conclude that there is no evidence of multicolinearity in this regression model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

library(car)

std_res <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)

thr_res  <- 3                           
thr_cook <- 4 / length(std_res)

problematic_idx <- which(abs(std_res) > thr_res | cooks_d > thr_cook)

apartments_clean <- if (length(problematic_idx)) apartments[-problematic_idx, ] else apartments

fit2_clean <- lm(Price ~ Age + Distance, data = apartments_clean)

cat("Removed", length(problematic_idx), "observations:",
    if (length(problematic_idx)) paste(problematic_idx, collapse = ", ") else "(none)", "\n")
## Removed 5 observations: 22, 33, 38, 53, 55
cat("Cook's D threshold:", round(thr_cook, 5), "\n\n")
## Cook's D threshold: 0.04706
cat("Original model (fit2):\n")
## Original model (fit2):
print(summary(fit2))
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11
cat("\nCleaned model (fit2_clean):\n")
## 
## Cleaned model (fit2_clean):
print(summary(fit2_clean))
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_resid <- rstandard(fit2_clean)
std_fitted <- scale(fitted(fit2_clean))

plot(std_fitted, std_resid,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Heteroskedasticity")
abline(h = 0, col = "hotpink2", lwd = 2)

lines(lowess(std_fitted, std_resid), col = "lightskyblue", lwd = 2)

The scatterplot of standardized residuals versus standardized fitted values shows a curved and slightly spread-out pattern. This means the variation in residuals is not the same across all fitted values. In other words, there are signs of heteroskedasticity.

This is important because heteroskedasticity can make the estimated standard errors unreliable, which may lead to incorrect conclusions about which predictors are significant.

Are standardized residuals distributed normally? Show the graph and formally test it. Explain the findings.

std_resid <- rstandard(fit2_clean)

par(mfrow = c(1, 2))  
hist(std_resid,
     main = "Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "plum1", border = "violetred",
     ylim = c(0, max(hist(std_resid, plot = FALSE)$counts) + 2))

qqnorm(std_resid, main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "red", lwd = 2)

shapiro_test <- shapiro.test(std_resid)

library(knitr)
shapiro_table <- data.frame(
  Statistic = c("W Statistic", "p-value"),
  Value = c(round(shapiro_test$statistic, 3), round(shapiro_test$p.value, 3)))

kable(shapiro_table, caption = "Shapiro-Wilk Test for Normality of Standardized Residuals")
Shapiro-Wilk Test for Normality of Standardized Residuals
Statistic Value
W W Statistic 0.942
p-value 0.001

To check if the standardized residuals are normally distributed, I looked at both the graphs and the formal test.

The histogram looks kind of bell-shaped, but it isn’t perfectly symmetric.

The Q–Q plot shows most points close to the red line, but the points at the ends deviate, which means the distribution is not perfectly normal.

The Shapiro–Wilk test gave a p-value of 0.001, which is below 0.05. This means we reject the assumption of normality.

We can conclude that the residuals are not normally distributed. They look okay in the middle but the tails don’t fit well, and the formal test confirms this.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2_summary <- summary(fit2_clean)

coef_table <- as.data.frame(fit2_summary$coefficients)
coef_table$Variable <- rownames(coef_table)
rownames(coef_table) <- NULL


coef_table <- coef_table[, c("Variable", "Estimate", "Std. Error", "t value", "Pr(>|t|)")]

library(knitr)
kable(coef_table, caption = paste("Regression results for Price ~ Age + Distance"))
Regression results for Price ~ Age + Distance
Variable Estimate Std. Error t value Pr(>|t|)
(Intercept) 2502.466701 75.023970 33.355562 0.0000000
Age -8.673848 3.220960 -2.692939 0.0086876
Distance -24.063024 2.691783 -8.939438 0.0000000

The regression table shows the estimates, their standard errors, t-values, and p-values.

Within the estimate coefficients the intercept (2502.47) means that if an apartment had age = 0 and distance = 0, its predicted price would be around 2502 EUR.The Age coefficient (-8.67) shows that with every additional year, the price decreases by about 9 EUR. The Distance coefficient (-24.06) shows that for every extra kilometer from the city, the price drops by about 24 EUR.

The standard errors are relatively small, which means the estimates are quite precise.

The t-value is calculated as the estimate divided by its standard error, and it shows how strongly the coefficient differs from zero. Larger absolute t-values mean stronger evidence. In this case, Age has a t-value of -2.69 and Distance has -8.94, both of which are far from zero, so together with the low p-values this confirms that both predictors significantly affect apartment price.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit3 fits the data better than Fit2 because the ANOVA test gave a p-value of about 0.01, which is below 0.05. This means that including Parking and Balcony explains more of the variation in Price, so Fit3 is the preferred model.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

library(broom)
library(knitr)

fit3_table <- tidy(fit3)

fit3_table$estimate <- round(fit3_table$estimate, 3)
fit3_table$std.error <- round(fit3_table$std.error, 3)
fit3_table$statistic <- round(fit3_table$statistic, 3)
fit3_table$p.value <- round(fit3_table$p.value, 4)

kable(fit3_table, caption = "Regression Results for Model fit3")
Regression Results for Model fit3
term estimate std.error statistic p.value
(Intercept) 2301.667 94.271 24.415 0.0000
Age -6.799 3.110 -2.186 0.0317
Distance -18.045 2.758 -6.543 0.0000
ParkingYes 196.168 62.868 3.120 0.0025
BalconyYes 1.935 60.014 0.032 0.9744

First categorical variable, parking: Apartments with parking cost on average about 196 EUR more than those without parking. This effect is statistically significant (p = 0.0025).

First categorical variable, balcony: Apartments with a balcony cost on average about 2 EUR more than those without a balcony. However, this effect is not significant (p = 0.97), meaning balconies don’t really impact price.

Hypothesis being tested with F-statistics: H₀ (null hypothesis), none of the predictors (Age, Distance, Parking, Balcony) influence apartment prices. H₁ (alternative hypothesis), at least one predictor has an effect on apartment prices.

Save fitted values and claculate the residual for apartment ID2.

fitted_values <- fitted(fit3)
residuals_values <- resid(fit3)

ID2_fitted <- fitted_values[2]
ID2_residual <- residuals_values[2]

library(knitr)
ID2_results <- data.frame(
  Apartment = "ID2",
  Fitted_Value = round(ID2_fitted, 2),
  Residual = round(ID2_residual, 2))

kable(ID2_results, caption = "Fitted Value and Residual for Apartment ID2")
Fitted Value and Residual for Apartment ID2
Apartment Fitted_Value Residual
2 ID2 2357.41 442.59

For apartment ID2, the regression model predicts a fitted value of 2357.41 EUR. The actual price of this apartment is 2800 EUR, which is higher than the predicted value. The residual for ID2 is therefore 442.59 EUR, meaning that the model underestimates the price of this apartment by about 443 EUR.

This difference (residual) shows how far the observed value is from the predicted regression line. In this case, the positive residual indicates that the apartment is more expensive than what the model expects, given its age, distance, parking, and balcony.