Ana Stajnko

Task 1

A)

Data set consists of 418 units at the start and then 331 after removing those with missing data.

Population are all passengers that were on Titanic and sample is those 418 passenger selected for analysis.

One unit is one passenger.

Variables:

  • 12 variables and 5 after filtering them.

  • Numerical ratio variables: “Age”, “SibSp”, “Fare”

  • Categorical nominal variable: “Survived”, “Name”,

Variable Meaning:

  • SurvivedF: Indicates if the passenger survived (1) or died (0) during the Titanic disaster.

  • Name: Full name of the passenger.

  • Age: of the passenger in years.

  • SibSp: Number of siblings or spouses traveling with the passenger. We use it to calculate Family Size.

  • Parch: Number of parents or children traveling with the passenger. We use it to calculate FamilySize.

  • Fare: Price of the passenger’s ticket in pounds.

B)

mydata <- read.table("~/R/Bootcamp2025/R Take Home Exam 2025/tested.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

mydata$Fare <- round(mydata$Fare, 2)
mydata$Age <- round(mydata$Age, 0)

mydata$SurvivedF <- factor (mydata$Survived,
                          levels = c(0, 1),
                          labels = c("No", "Yes")) 

#install.packages("tidyr")
library(tidyr)
mydata <- drop_na(mydata)

mydata$FamilySize <- mydata$SibSp + mydata$Parch + 1 

mydata2 <- mydata[, c("Name", "Age", "FamilySize","Fare", "SurvivedF")]

head(mydata2)
##                                           Name Age FamilySize  Fare SurvivedF
## 1                             Kelly, Mr. James  34          1  7.83        No
## 2             Wilkes, Mrs. James (Ellen Needs)  47          2  7.00       Yes
## 3                    Myles, Mr. Thomas Francis  62          1  9.69        No
## 4                             Wirz, Mr. Albert  27          1  8.66        No
## 5 Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22          3 12.29       Yes
## 6                   Svensson, Mr. Johan Cervin  14          1  9.22        No

C) DESCRIPTIVE STATISTICS

library(psych)
describe(mydata2[ , -c(1, 3,5)])
##      vars   n  mean    sd median trimmed   mad min    max  range skew kurtosis
## Age     1 331 30.16 14.11     27   29.54 11.86   0  76.00  76.00 0.45     0.07
## Fare    2 331 40.98 61.23     16   25.63 12.41   0 512.33 512.33 3.26    13.84
##        se
## Age  0.78
## Fare 3.37
summary(mydata2$SurvivedF)
##  No Yes 
## 204 127

Number of people in our sample that survived the Titanic disaster is 127.

50 % of people in our sample are 27 years old or younger, the other 50 % are older.

The maximum fare that was paid for the Titanic was 512.33 punds.

On average people in our sample were 30.16 years old.

D) GRAPHS

hist_fare <- hist(mydata2$Age,
                  breaks = 10,           
                  col = "lightblue",
                  main = "Histogram of Passenger Age",
                  xlab = "Age in years",
                  ylab = "Count",
                  ylim = c(0, max(hist(mydata2$Fare, plot=FALSE)$counts) * 1.1))
text(x = hist_fare$mids,            
     y = hist_fare$counts,         
     labels = hist_fare$counts,    
     pos = 3,                       
     cex = 1,                     
     col = "black")      

Histogram shows that distribution is slightly asymetrical to the right.The most cmmon age group is between 20 and 30 years. Number of people that are younger than 10 years in our sample is 22. Number of people above 60 is 10.

boxplot(Age ~ SurvivedF, data = mydata,
        main = "Age by survived",
        xlab = "SurvivedF",
        ylab = "Age",
        col = c("orange","green"))
axis(side = 2, at = seq(0, max(mydata$Age, na.rm=TRUE), by = 10))  

With this boxplot we can see that those who survived and those who did not had similar range of age. The difference that is the most obvious one is outlier. Outlier of those who did survive Titanic is above the age of 70, while from the ones who did not surve is between the age of 65 and 70.

#install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(Age ~ Fare | SurvivedF, 
            ylab = "Age", 
            xlab = "Fare", 
            smooth = FALSE, 
            data = mydata)

The scatterplot shows a weak positive relationship between fare and age, with most passengers clustered at low fares. Survivors are more frequent among those who paid higher fares, suggesting wealthier passengers had a greater chance of survival.

Task 2

library(readxl)

mydata <- read_excel("R Take Home Exam 2025/Task 2/Business School.xlsx")
head(mydata)
## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

A) GGPLOT

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
  ggplot(mydata, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "brown") +
    geom_text(stat = "count", aes(label = ..count..), 
            vjust = -0.3, size = 4) +
  theme_linedraw() +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree",
       y = "Count")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The most common degree is Business degree.

B) Descriptive statistics of annual salary

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
round(stat.desc(mydata$`Annual Salary`), 2)
##      nbr.val     nbr.null       nbr.na          min          max        range 
## 1.000000e+02 0.000000e+00 0.000000e+00 2.000000e+04 3.400000e+05 3.200000e+05 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 1.090580e+07 1.035000e+05 1.090580e+05 4.150150e+03 8.234800e+03 1.722373e+09 
##      std.dev     coef.var 
## 4.150149e+04 3.800000e-01
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
library(ggplot2)
ggplot(mydata, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 50000, fill = "gold", color = "black") +
  scale_x_continuous(labels = scales::comma_format()) +   #i used it so that we don't have scientific notation of numbers
     stat_bin(binwidth = 50000, geom = "text", aes(label = ..count..), 
           vjust = -0.3, color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary",
       y = "Frequency")

Distribution of Annual Salary is slightly asymetrical to the right. The majority of MBA Students have Annual salary between 75.000EUR and 125.000 EUR. Just 1 has over 300.000EUR.

C) Testing of the Hypothesis

t.test(mydata$`MBA Grade`, mu = 74)
## 
##  One Sample t-test
## 
## data:  mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

At the p= 0.00915, which means p<0.05 we reject H0, meaning that MBA grade of the current generation significantly different from the previous year’s average of 74.

R2 <- 2.6587^2 / (2.6587^2 + 99)

r <- sqrt(R2)
print(r)
## [1] 0.2581521

The effect size (r = 0.258) shows that the difference between this year’s and last year’s average MBA grade is small to moderate. This means the improvement is real and statistically significant, but in practical terms the increase is modest rather than large.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_excel("~/R/Bootcamp2025/R Take Home Exam 2025/Task 3/Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata$ParkingF <- factor (mydata$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes")) 
mydata$BalconyF <- factor (mydata$Balcony,
                           levels = c(0, 1),
                           labels = c("No", "Yes"))
mydata2 <- mydata[, -c(4, 5)]
head(mydata2)
## # A tibble: 6 × 5
##     Age Distance Price ParkingF BalconyF
##   <dbl>    <dbl> <dbl> <fct>    <fct>   
## 1     7       28  1640 No       Yes     
## 2    18        1  2800 Yes      No      
## 3     7       28  1660 No       No      
## 4    28       29  1850 No       Yes     
## 5    18       18  1640 Yes      Yes     
## 6    28       12  1770 No       Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price, mu = 1900)
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

At the p=0.004731, which means p<0.05, we reject H0 and conclude that avergae price is different from 1900EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, 
            data = mydata2)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata2$Age, mydata2$Price)
## [1] -0.230255

Explanation:

  • If the age of an apartment increases by 1 year, price per m2 will on average decrease by 8.975. (Regression coef)

  • 5,3 % of variability of the price of an apartmen per m2 is described by linear effect of an age of an apartment.(coef. of determination)

  • Relationship between age and price is negative and weak. (Coeff. of corelation)

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(mydata2 [  , c(3, 2, 1)],
                  smooth = FALSE)

Based on scatterplot matrix I would not say that there is a problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata2)
options(scipen = TRUE)   #scientific notation

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
max(vif(fit2))  #i added max so it only shows one number and not two identicall
## [1] 1.001845

Based on the VIS statistics, which is <5, we conclude there is not a problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

std_resid <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)

outliers <- which(abs(std_resid) > 3 | cooks_d > 4/length(cooks_d))

mydata3 <- mydata2[-outliers, ]
summary(mydata3)
##       Age           Distance         Price      ParkingF BalconyF
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400   No :40   No :46  
##  1st Qu.:12.00   1st Qu.: 4.75   1st Qu.:1702   Yes:40   Yes:34  
##  Median :18.00   Median :12.50   Median :1930                    
##  Mean   :18.54   Mean   :13.89   Mean   :2008                    
##  3rd Qu.:24.00   3rd Qu.:19.25   3rd Qu.:2275                    
##  Max.   :45.00   Max.   :40.00   Max.   :2820

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_resid <- rstandard(fit2)
std_fitted <- scale(fitted(fit2))

plot(std_fitted, std_resid,
     main = "Standardized Residuals vs. Standardized Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "blue")
abline(h = 0, col = "purple", lwd = 2)

The scatterplot shows residuals randomly spread around zero with no clear pattern or funnel shape, suggesting no evidence of heteroskedasticity.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit2)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 4.4439, df = 2, p-value = 0.1084

Even formal testing of our model shows (p>0.05) that we fail to reject the null hypothesis. There is no statistical evidence of heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

#residuals for fit2

hist(std_resid, col= "darkgreen",
     main = "Histogram of Standardised Residuals")

shapiro.test(residuals(fit2))  
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit2)
## W = 0.95366, p-value = 0.00398

At the p<0.05 we reject H0 and conclude that errors are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, 
                 data = mydata2)
options(scipen = TRUE)   #scientific notation
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

If the age of an apartment increase by 1 year, the price of an apartment on average falls for 8.674 per m2. If the distance to the center increases by 1 km, the price of an apartment will on average decrease by 24.063 per m2.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
           data = mydata2)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)  
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on p<0,05 we reject null hypothesis and conclude that fit3 fits significantly better than fit2 (so ParkingF and BalconyF together add meaningful explanatory power).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFYes  196.168     62.868   3.120  0.00251 ** 
## BalconyFYes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11
options(scipen = TRUE)   #scientific notation
  • Apartments with parking are on average for 196.168 more expensive per m2 than apartments without parking, holding all other variables constant.

  • Apartments with a balcony are on average for 1.935 more expensive per m2 than apartments without a balcony, holding all other variables constant.

F-statistics>:

  • H0: All regression coefficients are equal 0

  • H1: All regression coefficients are not equal to 0

Save fitted values and claculate the residual for apartment ID2.

mydata2$residuals <- residuals(fit3)
mydata2$fitted <- fitted.values(fit3)

apt2_fitted <- mydata2$fitted[2]
apt2_residual <- mydata2$residuals[2]

 #Apartment2
round(apt2_fitted, 2)   
##       2 
## 2357.41
round(apt2_residual, 2)
##      2 
## 442.59