Ana Stajnko

Task 1

A)

Data set consists of 418 units at the start and then 331 after removing those with missing data.

Population are all passengers that were on Titanic and sample is those 418 passenger selected for analysis.

One unit is one passenger.

Variables:

  • 12 variables and 5 after filtering them.

  • Numerical ratio variables: “Age”, “SibSp”, “Fare”

  • Categorical nominal variable: “Survived”, “Name”,

Variable Meaning:

  • SurvivedF: Indicates if the passenger survived (1) or died (0) during the Titanic disaster.

  • Name: Full name of the passenger.

  • Age: of the passenger in years.

  • SibSp: Number of siblings or spouses traveling with the passenger. We use it to calculate Family Size.

  • Parch: Number of parents or children traveling with the passenger. We use it to calculate FamilySize.

  • Fare: Price of the passenger’s ticket in pounds.

B)

mydata <- read.table("~/R/Bootcamp2025/R Take Home Exam 2025/tested.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

mydata$Fare <- round(mydata$Fare, 2)
mydata$Age <- round(mydata$Age, 0)

mydata$SurvivedF <- factor (mydata$Survived,
                          levels = c(0, 1),
                          labels = c("No", "Yes")) 

#install.packages("tidyr")
library(tidyr)
mydata <- drop_na(mydata)

mydata$FamilySize <- mydata$SibSp + mydata$Parch + 1 

mydata2 <- mydata[, c("Name", "Age", "FamilySize","Fare", "SurvivedF")]

head(mydata2)
##                                           Name Age FamilySize  Fare SurvivedF
## 1                             Kelly, Mr. James  34          1  7.83        No
## 2             Wilkes, Mrs. James (Ellen Needs)  47          2  7.00       Yes
## 3                    Myles, Mr. Thomas Francis  62          1  9.69        No
## 4                             Wirz, Mr. Albert  27          1  8.66        No
## 5 Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22          3 12.29       Yes
## 6                   Svensson, Mr. Johan Cervin  14          1  9.22        No

C) DESCRIPTIVE STATISTICS

library(psych)
describe(mydata2[ , -c(1, 3,5)])
##      vars   n  mean    sd median trimmed   mad min    max  range skew kurtosis
## Age     1 331 30.16 14.11     27   29.54 11.86   0  76.00  76.00 0.45     0.07
## Fare    2 331 40.98 61.23     16   25.63 12.41   0 512.33 512.33 3.26    13.84
##        se
## Age  0.78
## Fare 3.37
summary(mydata2$SurvivedF)
##  No Yes 
## 204 127

Number of people in our sample that survived the Titanic disaster is 127.

50 % of people in our sample are 27 years old or younger, the other 50 % are older.

The maximum fare that was paid for the Titanic was 512.33 punds.

On average people in our sample were 30.16 years old.

D) GRAPHS

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata2, aes(x = Age)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "white") +
  geom_text(stat = "bin", bins = 10, aes(label = after_stat(count)),
            vjust = -0.5, size = 3.5) +
  geom_vline(xintercept = mean(mydata2$Age, na.rm = TRUE),
             color = "red", linetype = "dashed", linewidth = 1) +          #color of mean
  geom_vline(xintercept = median(mydata2$Age, na.rm = TRUE),
             color = "darkgreen", linetype = "dotdash", linewidth = 1) +   #color of medain
  labs(title = "Histogram of Passenger Age",
       x = "Age (years)",
       y = "Count") +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Histogram shows that distribution is slightly asymetrical to the right.The most cmmon age group is between 20 and 30 years. Number of people that are younger than 10 years in our sample is 22. Number of people above 60 is 10.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
medians <- mydata %>%          #first i calculated medians for each group
  group_by(SurvivedF) %>%
  summarise(med = median(Age, na.rm = TRUE))

ggplot(mydata, aes(x = SurvivedF, y = Age, fill = SurvivedF)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2,
               alpha = 0.7, width = 0.6) +
  geom_segment(data = medians,                         #here i added median points
               aes(x = as.numeric(SurvivedF) - 0.25,
                   xend = as.numeric(SurvivedF) + 0.25,
                   y = med, yend = med),
               inherit.aes = FALSE, color = "black", linewidth = 1.2) +
  geom_text(data = medians,                                   
            aes(x = SurvivedF, y = med, label = round(med, 1)),
            vjust = -1, fontface = "bold", color = "black") +
  scale_fill_manual(values = c("violet", "lightgreen")) +
  scale_y_continuous(breaks = seq(0, max(mydata$Age, na.rm = TRUE), by = 10)) +
  labs(title = "Age by Survival Status",
       x = "Survived",
       y = "Age") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold") )

With this boxplot we can see that those who survived and those who did not had similar range of age. The difference that is the most obvious one is outlier. Outlier of those who did survive Titanic is above the age of 70, while from the ones who did not surve is between the age of 65 and 70.

library(ggplot2)

ggplot(mydata, aes(x = Fare, y = Age, color = SurvivedF)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, size = 1.2) +
  scale_color_brewer(palette = "Pastel1") +
  labs(x = "Fare", y = "Age", color = "Survived") +
  theme_minimal(base_size = 14) +
  theme( panel.grid.major = element_line(color = "grey90"),
    panel.grid.minor = element_blank(),
    legend.position = "top",
    plot.title = element_text(hjust = 0.5, face = "bold"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows a weak positive relationship between fare and age, with most passengers clustered at low fares. Survivors are more frequent among those who paid higher fares, suggesting wealthier passengers had a greater chance of survival.

Task 2

library(readxl)

mydata <- read_excel("R Take Home Exam 2025/Task 2/Business School.xlsx")
head(mydata)
## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

A) GGPLOT

The most common degree is Business degree.

library(ggplot2)

ggplot(mydata, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar(color = "white") +
  geom_text(stat = "count", 
    aes(label = ..count..), 
    vjust = -0.3, 
    size = 4, 
    color = "black") +
  scale_fill_brewer(palette = "Pastel1") +     #so that every bar is its own colour
  labs(title = "Distribution of Undergraduate Degrees",
    x = "Undergraduate Degree",
    y = "Count") +
  theme_minimal(base_size = 14) +
  theme( plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1, size = 12),
    panel.grid.major.x = element_blank(),  #whit this and next one I removed major grid lines
    panel.grid.minor = element_blank(),
    legend.position = "none")  #i hid the legend 
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

B) Descriptive statistics of annual salary

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:tidyr':
## 
##     extract
options(scipen = TRUE)   #scientific notation
round(stat.desc(mydata$`Annual Salary`), 2)
##       nbr.val      nbr.null        nbr.na           min           max 
##        100.00          0.00          0.00      20000.00     340000.00 
##         range           sum        median          mean       SE.mean 
##     320000.00   10905800.00     103500.00     109058.00       4150.15 
##  CI.mean.0.95           var       std.dev      coef.var 
##       8234.80 1722373474.75      41501.49          0.38
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
library(ggplot2)

ggplot(mydata, aes(x = `Annual Salary`)) +
  geom_histogram( binwidth = 50000,
    aes(fill = ..count..),
    color = "white",
    alpha = 0.9 ) +
  geom_text( stat = "bin",
    binwidth = 50000,
    aes(label = ..count..),
    vjust = -0.5,
    color = "black",
    size = 3.5 ) +
  scale_fill_gradient(low = "lightgoldenrod1", high = "darkorange") +     #the highest is different colour
  scale_x_continuous(labels = comma_format()) +
  theme_minimal(base_size = 13) +
  theme( panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16)) +
  labs(title = "Distribution of Annual Salary",
    x = "Annual Salary",
    y = "Count")

Distribution of Annual Salary is slightly asymetrical to the right. The majority of MBA Students have Annual salary between 75.000EUR and 125.000 EUR. Just 1 has over 300.000EUR.

C) Testing of the Hypothesis

t.test(mydata$`MBA Grade`, mu = 74)
## 
##  One Sample t-test
## 
## data:  mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

At the p= 0.00915, which means p<0.05 we reject H0, meaning that MBA grade of the current generation significantly different from the previous year’s average of 74.

#install.packages("effectsize")

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
cohens_d(mydata$`MBA Grade`, mu = 74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

The effect size was small (Cohen’s d = 0.27), suggesting that the difference, while statistically reliable, represents a modest deviation in practical terms.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_excel("~/R/Bootcamp2025/R Take Home Exam 2025/Task 3/Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata$ParkingF <- factor (mydata$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes")) 
mydata$BalconyF <- factor (mydata$Balcony,
                           levels = c(0, 1),
                           labels = c("No", "Yes"))
mydata2 <- mydata[, -c(4, 5)]
head(mydata2)
## # A tibble: 6 × 5
##     Age Distance Price ParkingF BalconyF
##   <dbl>    <dbl> <dbl> <fct>    <fct>   
## 1     7       28  1640 No       Yes     
## 2    18        1  2800 Yes      No      
## 3     7       28  1660 No       No      
## 4    28       29  1850 No       Yes     
## 5    18       18  1640 Yes      Yes     
## 6    28       12  1770 No       Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price, mu = 1900)
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

At the p=0.004731, which means p<0.05, we reject H0 and conclude that average price is different from 1900EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, 
            data = mydata2)

options(scipen = TRUE)   #scientific notation
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata2$Age, mydata2$Price)
## [1] -0.230255

Explanation:

  • If the age of an apartment increases by 1 year, price per m2 will on average decrease by 8.975. (Regression coef)

  • 5,3 % of variability of the price of an apartmen per m2 is described by linear effect of an age of an apartment.(coef. of determination)

  • Relationship between age and price is negative and weak. (Coeff. of corelation)

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(mydata2 [  , c(3, 2, 1)],
                  smooth = FALSE)

Based on scatterplot matrix I would not say that there is a problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata2)
options(scipen = TRUE)   #scientific notation

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
max(vif(fit2))  #i added max so it only shows one number and not two identicall
## [1] 1.001845

Based on the VIS statistics, which is <5, we conclude there is not a problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

std_resid <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)

outliers <- which(abs(std_resid) > 3 | cooks_d > 4/length(cooks_d))

mydata3 <- mydata2[-outliers, ]
summary(mydata3)
##       Age           Distance         Price      ParkingF BalconyF
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400   No :40   No :46  
##  1st Qu.:12.00   1st Qu.: 4.75   1st Qu.:1702   Yes:40   Yes:34  
##  Median :18.00   Median :12.50   Median :1930                    
##  Mean   :18.54   Mean   :13.89   Mean   :2008                    
##  3rd Qu.:24.00   3rd Qu.:19.25   3rd Qu.:2275                    
##  Max.   :45.00   Max.   :40.00   Max.   :2820
hist(cooks_d,
     main = "Cooks distance")

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_resid <- rstandard(fit2)
std_fitted <- scale(fitted(fit2))

plot(std_fitted, std_resid,
     main = "Standardized Residuals vs. Standardized Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

The scatterplot shows residuals randomly spread around zero with no clear pattern or funnel shape, suggesting no evidence of heteroskedasticity.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit2) #I decided to also formaly test it
## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 4.4439, df = 2, p-value = 0.1084
  • H0: The variance is constant

  • H1: The variance is not constant

Even formal testing of our model shows (p>0.05) that we fail to reject the null hypothesis. There is no statistical evidence of heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

fit2_no <- lm(Price ~ Age + Distance,
              data = mydata3)

residuals_no <- rstandard (fit2_no)

hist(residuals_no,
     main = "Histogram of Standardised Residuals")

shapiro.test(residuals(fit2_no))  
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit2_no)
## W = 0.94289, p-value = 0.00138
  • H0: Errors are normally distributed

  • H1: Errors are not normally distributed

At the p<0.05 we reject H0 and conclude that errors are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2_no)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

If the age of an apartment increase by 1 year, the price of an apartment on average falls for 8.674 per m2. If the distance to the center increases by 1 km, the price of an apartment will on average decrease by 24.063 per m2.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
           data = mydata3)  #from the data without outliers

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_no, fit3)  #fit2_no is fit2 without outliers, so that both models do not have them
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135
  • H0: Model fit3 does not significantly improve the fit compared to Model 1

  • H1: Model fit3 provides a significantly better fit than Model 1

Based on p>0,05 we can not reject null hypothesis and conclude that fit3 fits significantly better than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13
options(scipen = TRUE)   #scientific notation
  • Apartments with parking are on average for 196.168 more expensive per m2 than apartments without parking, holding all other variables constant.

  • Apartments with a balcony are on average for 1.935 more expensive per m2 than apartments without a balcony, holding all other variables constant.

F-statistics>:

  • H0: All regression coefficients are equal 0

  • H1: All regression coefficients are not equal to 0

Save fitted values and claculate the residual for apartment ID2.

mydata3$residuals <- residuals(fit3)    #I toolk mydata3, because it does not include outliers and influential units
mydata3$fitted <- fitted.values(fit3)

apt2_fitted <- mydata3$fitted[2]
apt2_residual <- mydata3$residuals[2]

 #Apartment2
round(apt2_fitted, 2)   
##      2 
## 2356.6
round(apt2_residual, 2)
##     2 
## 443.4