Data set consists of 418 units at the start and then 331 after removing those with missing data.
Population are all passengers that were on Titanic and sample is those 418 passenger selected for analysis.
One unit is one passenger.
Variables:
12 variables and 5 after filtering them.
Numerical ratio variables: “Age”, “SibSp”, “Fare”
Categorical nominal variable: “Survived”, “Name”,
Variable Meaning:
SurvivedF: Indicates if the passenger survived (1) or died (0) during the Titanic disaster.
Name: Full name of the passenger.
Age: of the passenger in years.
SibSp: Number of siblings or spouses traveling with the passenger. We use it to calculate Family Size.
Parch: Number of parents or children traveling with the passenger. We use it to calculate FamilySize.
Fare: Price of the passenger’s ticket in pounds.
mydata <- read.table("~/R/Bootcamp2025/R Take Home Exam 2025/tested.csv",
header = TRUE,
sep = ",",
dec = ".")
mydata$Fare <- round(mydata$Fare, 2)
mydata$Age <- round(mydata$Age, 0)
mydata$SurvivedF <- factor (mydata$Survived,
levels = c(0, 1),
labels = c("No", "Yes"))
#install.packages("tidyr")
library(tidyr)
mydata <- drop_na(mydata)
mydata$FamilySize <- mydata$SibSp + mydata$Parch + 1
mydata2 <- mydata[, c("Name", "Age", "FamilySize","Fare", "SurvivedF")]
head(mydata2)
## Name Age FamilySize Fare SurvivedF
## 1 Kelly, Mr. James 34 1 7.83 No
## 2 Wilkes, Mrs. James (Ellen Needs) 47 2 7.00 Yes
## 3 Myles, Mr. Thomas Francis 62 1 9.69 No
## 4 Wirz, Mr. Albert 27 1 8.66 No
## 5 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 22 3 12.29 Yes
## 6 Svensson, Mr. Johan Cervin 14 1 9.22 No
library(psych)
describe(mydata2[ , -c(1, 3,5)])
## vars n mean sd median trimmed mad min max range skew kurtosis
## Age 1 331 30.16 14.11 27 29.54 11.86 0 76.00 76.00 0.45 0.07
## Fare 2 331 40.98 61.23 16 25.63 12.41 0 512.33 512.33 3.26 13.84
## se
## Age 0.78
## Fare 3.37
summary(mydata2$SurvivedF)
## No Yes
## 204 127
Number of people in our sample that survived the Titanic disaster is 127.
50 % of people in our sample are 27 years old or younger, the other 50 % are older.
The maximum fare that was paid for the Titanic was 512.33 punds.
On average people in our sample were 30.16 years old.
hist_fare <- hist(mydata2$Age,
breaks = 10,
col = "lightblue",
main = "Histogram of Passenger Age",
xlab = "Age in years",
ylab = "Count",
ylim = c(0, max(hist(mydata2$Fare, plot=FALSE)$counts) * 1.1))
text(x = hist_fare$mids,
y = hist_fare$counts,
labels = hist_fare$counts,
pos = 3,
cex = 1,
col = "black")
Histogram shows that distribution is slightly asymetrical to the right.The most cmmon age group is between 20 and 30 years. Number of people that are younger than 10 years in our sample is 22. Number of people above 60 is 10.
boxplot(Age ~ SurvivedF, data = mydata,
main = "Age by survived",
xlab = "SurvivedF",
ylab = "Age",
col = c("orange","green"))
axis(side = 2, at = seq(0, max(mydata$Age, na.rm=TRUE), by = 10))
With this boxplot we can see that those who survived and those who did not had similar range of age. The difference that is the most obvious one is outlier. Outlier of those who did survive Titanic is above the age of 70, while from the ones who did not surve is between the age of 65 and 70.
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(Age ~ Fare | SurvivedF,
ylab = "Age",
xlab = "Fare",
smooth = FALSE,
data = mydata)
The scatterplot shows a weak positive relationship between fare and age, with most passengers clustered at low fares. Survivors are more frequent among those who paid higher fares, suggesting wealthier passengers had a greater chance of survival.
library(readxl)
mydata <- read_excel("R Take Home Exam 2025/Task 2/Business School.xlsx")
head(mydata)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "brown") +
geom_text(stat = "count", aes(label = ..count..),
vjust = -0.3, size = 4) +
theme_linedraw() +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree",
y = "Count")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The most common degree is Business degree.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(mydata$`Annual Salary`), 2)
## nbr.val nbr.null nbr.na min max range
## 1.000000e+02 0.000000e+00 0.000000e+00 2.000000e+04 3.400000e+05 3.200000e+05
## sum median mean SE.mean CI.mean.0.95 var
## 1.090580e+07 1.035000e+05 1.090580e+05 4.150150e+03 8.234800e+03 1.722373e+09
## std.dev coef.var
## 4.150149e+04 3.800000e-01
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
library(ggplot2)
ggplot(mydata, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 50000, fill = "gold", color = "black") +
scale_x_continuous(labels = scales::comma_format()) + #i used it so that we don't have scientific notation of numbers
stat_bin(binwidth = 50000, geom = "text", aes(label = ..count..),
vjust = -0.3, color = "black") +
theme_minimal() +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Frequency")
Distribution of Annual Salary is slightly asymetrical to the right. The majority of MBA Students have Annual salary between 75.000EUR and 125.000 EUR. Just 1 has over 300.000EUR.
t.test(mydata$`MBA Grade`, mu = 74)
##
## One Sample t-test
##
## data: mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
At the p= 0.00915, which means p<0.05 we reject H0, meaning that MBA grade of the current generation significantly different from the previous year’s average of 74.
R2 <- 2.6587^2 / (2.6587^2 + 99)
r <- sqrt(R2)
print(r)
## [1] 0.2581521
The effect size (r = 0.258) shows that the difference between this year’s and last year’s average MBA grade is small to moderate. This means the improvement is real and statistically significant, but in practical terms the increase is modest rather than large.
library(readxl)
mydata <- read_excel("~/R/Bootcamp2025/R Take Home Exam 2025/Task 3/Apartments.xlsx")
Description:
mydata$ParkingF <- factor (mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$BalconyF <- factor (mydata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata2 <- mydata[, -c(4, 5)]
head(mydata2)
## # A tibble: 6 × 5
## Age Distance Price ParkingF BalconyF
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
t.test(mydata$Price, mu = 1900)
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
At the p=0.004731, which means p<0.05, we reject H0 and conclude that avergae price is different from 1900EUR.
fit1 <- lm(Price ~ Age,
data = mydata2)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata2$Age, mydata2$Price)
## [1] -0.230255
Explanation:
If the age of an apartment increases by 1 year, price per m2 will on average decrease by 8.975. (Regression coef)
5,3 % of variability of the price of an apartmen per m2 is described by linear effect of an age of an apartment.(coef. of determination)
Relationship between age and price is negative and weak. (Coeff. of corelation)
library(car)
scatterplotMatrix(mydata2 [ , c(3, 2, 1)],
smooth = FALSE)
Based on scatterplot matrix I would not say that there is a problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata2)
options(scipen = TRUE) #scientific notation
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
max(vif(fit2)) #i added max so it only shows one number and not two identicall
## [1] 1.001845
Based on the VIS statistics, which is <5, we conclude there is not a problem with multicolinearity.
std_resid <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)
outliers <- which(abs(std_resid) > 3 | cooks_d > 4/length(cooks_d))
mydata3 <- mydata2[-outliers, ]
summary(mydata3)
## Age Distance Price ParkingF BalconyF
## Min. : 1.00 Min. : 1.00 Min. :1400 No :40 No :46
## 1st Qu.:12.00 1st Qu.: 4.75 1st Qu.:1702 Yes:40 Yes:34
## Median :18.00 Median :12.50 Median :1930
## Mean :18.54 Mean :13.89 Mean :2008
## 3rd Qu.:24.00 3rd Qu.:19.25 3rd Qu.:2275
## Max. :45.00 Max. :40.00 Max. :2820
std_resid <- rstandard(fit2)
std_fitted <- scale(fitted(fit2))
plot(std_fitted, std_resid,
main = "Standardized Residuals vs. Standardized Fitted Values",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "blue")
abline(h = 0, col = "purple", lwd = 2)
The scatterplot shows residuals randomly spread around zero with no clear pattern or funnel shape, suggesting no evidence of heteroskedasticity.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit2)
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 4.4439, df = 2, p-value = 0.1084
Even formal testing of our model shows (p>0.05) that we fail to reject the null hypothesis. There is no statistical evidence of heteroskedasticity.
#residuals for fit2
hist(std_resid, col= "darkgreen",
main = "Histogram of Standardised Residuals")
shapiro.test(residuals(fit2))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit2)
## W = 0.95366, p-value = 0.00398
At the p<0.05 we reject H0 and conclude that errors are not normally distributed.
fit2 <- lm(Price ~ Age + Distance,
data = mydata2)
options(scipen = TRUE) #scientific notation
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
If the age of an apartment increase by 1 year, the price of an apartment on average falls for 8.674 per m2. If the distance to the center increases by 1 km, the price of an apartment will on average decrease by 24.063 per m2.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = mydata2)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on p<0,05 we reject null hypothesis and conclude that fit3 fits significantly better than fit2 (so ParkingF and BalconyF together add meaningful explanatory power).
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFYes 196.168 62.868 3.120 0.00251 **
## BalconyFYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
options(scipen = TRUE) #scientific notation
Apartments with parking are on average for 196.168 more expensive per m2 than apartments without parking, holding all other variables constant.
Apartments with a balcony are on average for 1.935 more expensive per m2 than apartments without a balcony, holding all other variables constant.
F-statistics>:
H0: All regression coefficients are equal 0
H1: All regression coefficients are not equal to 0
mydata2$residuals <- residuals(fit3)
mydata2$fitted <- fitted.values(fit3)
apt2_fitted <- mydata2$fitted[2]
apt2_residual <- mydata2$residuals[2]
#Apartment2
round(apt2_fitted, 2)
## 2
## 2357.41
round(apt2_residual, 2)
## 2
## 442.59