library(readr) #Load library
airstat <- read_csv("R data/airline_passenger_satisfaction.csv") #import file
## Rows: 129880 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Customer Type, Type of Travel, Class, Satisfaction
## dbl (19): ID, Age, Flight Distance, Departure Delay, Arrival Delay, Departur...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(airstat, 5) #show first 5 rows
## # A tibble: 5 × 24
## ID Gender Age `Customer Type` `Type of Travel` Class `Flight Distance`
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Male 48 First-time Business Business 821
## 2 2 Female 35 Returning Business Business 821
## 3 3 Male 41 Returning Business Business 853
## 4 4 Male 50 Returning Business Business 1905
## 5 5 Female 49 Returning Business Business 3470
## # ℹ 17 more variables: `Departure Delay` <dbl>, `Arrival Delay` <dbl>,
## # `Departure and Arrival Time Convenience` <dbl>,
## # `Ease of Online Booking` <dbl>, `Check-in Service` <dbl>,
## # `Online Boarding` <dbl>, `Gate Location` <dbl>, `On-board Service` <dbl>,
## # `Seat Comfort` <dbl>, `Leg Room Service` <dbl>, Cleanliness <dbl>,
## # `Food and Drink` <dbl>, `In-flight Service` <dbl>,
## # `In-flight Wifi Service` <dbl>, `In-flight Entertainment` <dbl>, …
airstat2 <- airstat[-c(31:129880), -c(4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23)] #delete rows and collums
head(airstat2,30) #show the first 30 rows of a narrowed down dataset
## # A tibble: 30 × 9
## ID Gender Age `Type of Travel` `Flight Distance` `Departure Delay`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 Male 48 Business 821 2
## 2 2 Female 35 Business 821 26
## 3 3 Male 41 Business 853 0
## 4 4 Male 50 Business 1905 0
## 5 5 Female 49 Business 3470 0
## 6 6 Male 43 Business 3788 0
## 7 7 Male 43 Business 1963 0
## 8 8 Female 60 Business 853 0
## 9 9 Male 50 Business 2607 0
## 10 10 Female 38 Business 2822 13
## # ℹ 20 more rows
## # ℹ 3 more variables: `Arrival Delay` <dbl>, `Food and Drink` <dbl>,
## # Satisfaction <chr>
Description
airstat2$"Meal Satisfaction" <- cut(airstat2$`Food and Drink`,
breaks = c(-Inf,2,5, Inf),
labels = c("Unsatisfied", "OK", "Satisfied"),
right = FALSE) #Create 3 order classes and create Meal Satisfaction variable from Food and Drink variable
airstat2 <- airstat2[,-8] #Delete Food and Drink variable
Description
I transformed the Food and Drink variable into a categorical variable called Meal Satisfaction with three ordered classes: Unsatisfied (scores 1–2), OK (scores 3–4), and Satisfied (score 5).
names(airstat2)[8] <- "Overall Satisfaction" #Change Satisfaction into Overall Satisfaction
library(pastecs) #Load library
round(stat.desc(airstat2$`Flight Distance`), 2) #Provide descriptive statistics rounded to 2 decimal points
## nbr.val nbr.null nbr.na min max range
## 30.00 0.00 0.00 421.00 3788.00 3367.00
## sum median mean SE.mean CI.mean.0.95 var
## 47782.00 853.00 1592.73 201.29 411.68 1215498.55
## std.dev coef.var
## 1102.50 0.69
median(airstat$Age) #Provide median for Age in original dataset
## [1] 40
median(airstat2$Age) #Provide median for Age in adjusted dataset
## [1] 48
min(airstat2$Age) #Minimum value of miles traveled
## [1] 9
Description
Because I wanted to get as many descriptive statistics possible to help me understand Flight Distance variable I used stat.desc. The values are interpreted as (left to right in upper table):
I calculated median for Age from the original and adjusted data. I can observe that the median age varies by 8 years due to the change of sample size.
Lastly I calculated the minimal Age of a traveler (9 years).
boxplot(airstat2$Age,
main = "Age distribution in dataset", # Title of the plot
ylab = "Age", # Label for y-axis
col = "gray", # Color of the box
border = "black") # Border color
Explanation
The generated boxplot represents the distribution of age of airline passengers from the second dataset. The box shows the middle 50% of the data (with lower and upper edges being 1st and 3rd quartiles). The Whiskers (lines) extend to datapoints within 1,5X interquartile range (points outside this range are outliers - total of 4). The distribution is somewhat balanced, with some outliers.
library(ggplot2) #Load library
hist(airstat2$`Flight Distance`, #Histogram from 2nd dataset for Flight distance
ylab = "Number of passangers", #y-axis label
xlab = "Distance traveled (miles)", #X-axis label
main = "Flight distance", #Histogram title
breaks = seq(from = 300, to = 3900, by = 300)) #Categories from 300 to 3900 with 300 width
hist(airstat$`Flight Distance`, #Histogram from original dataset for Flight distance
ylab = "Number of passangers",
xlab = "Distance traveled (miles)",
main = "Flight distance - large dataset",
breaks = seq(from = 0, to =5100, by = 300), #Categories from 300 to 5100 with 300
col = "green", #Fill color of the bars
border = "red") #Outline colors of the bars
Interpretation of histograms
Flight distance: The majority of flight distances from the conveniently chosen smaller dataset (airstat2) are between 500 and 1000 miles. After that, the bars decrease in frequency as the distance increases.
Flight distance - large dataset: The majority of flights from the original dataset (airstat) are in the 0 to 1000 miles range, with a gradual decrease in frequency as the distance increases.
By comparing the two histograms from the two datasets (original and conveniently chosen smaller dataset), I aimed to determine how the chosen data differs from the original dataset. We can see that the original dataset shows a much wider range of distances and greater variance in flight distances (and passenger numbers).
It is clear that the histogram generated from the original dataset is positively skewed (to the right), which cannot be said with as much confidence for the histogram generated from the choosen data.
library(ggplot2) #Load library
ggplot(airstat2, aes(y = `Arrival Delay`, x = `Departure Delay`)) + #Define x and y axis
geom_point() + #Add scatter points
geom_smooth(method = "lm", se = FALSE, color = "blue") #Linear regression line without confidence intervals in blue colour
## `geom_smooth()` using formula = 'y ~ x'
Interpretation
The plot shows a positive relation between Departure Delay and Arrival delay, indicating that an increase in Departure Delay leads to and increase in Arrival delay (more clearly shown by the blue line). Most of the points cluster close to the line, although there are some outliers.
library(readxl) #Load library
MBA <- read_excel("C:/Users/janve/Downloads/R Take Home Exam 2025/R Take Home Exam 2025/Task 2/Business School.xlsx") #Import data
head(MBA, 5) #Display first five rows of the data set
## # A tibble: 5 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
Description (in consecutive order)
library(ggplot2) #Load library
ggplot(MBA, aes(x = `Undergrad Degree`)) + #Initialize plot with Undergrad Degree on x
geom_bar(fill="orange") + #Orange bars in bar chart
labs(title = "Distribution of Undergraduate Degrees", #Title and axis lables
x = "Undergraduate Degree",
y = "Number of Students")+
theme_minimal() #Minimal theme
Description
The most common degree prior to attending the MBA is a Business degree, followed by (in descending order) Computer Science and Finance degrees (with the same number of students), then Engineering, and lastly, with the fewest students, Art.
library(pastecs)
round(stat.desc(MBA$`Annual Salary`),)
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(MBA, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 20000, colour = "pink", fill = "green", alpha = 0.5) +
labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Count") +
scale_x_continuous(labels = scales::comma_format())
Descriptive statistics explanation
Histogram explanation
The histogram is positively skewed (to the right) with a higher concentration of salaries around 100000€. There are outliers pulling the mean to the right.
library(effectsize)
mean(MBA$`MBA Grade`) #MBA grade mean
## [1] 76.04055
sd(MBA$`MBA Grade`) #Standard deviation of MBA grade
## [1] 7.675114
t.test(MBA$`MBA Grade`, mu=74) #One sample t-test (Ho: mean MBA grade = 74)
##
## One Sample t-test
##
## data: MBA$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
cohens_d(MBA$`MBA Grade`, mu=74) #Effect size through Cohen's d
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
Explanation
I ran the the t-test with a hypothesis that the average MBA grade for this year students is equal to 74 (same as average grade of previous year).
Null hypothesis (Ho): Average MBA grade is 74. Alternative hypothesis (H1): Average MBA grade differs from 74.
The t-test results are:
t-statistic: 2,66 - average grade of this year is (76,04) is 2,66 standard errors away from previous year average of 74.
p-value: 0.00915 (p<0,05) - we can reject the null hypothesis as the difference between the average grade this year and 74 is statistically significant - we can confidently say the average grade is different from 74.
95 percent confidence interval: 74.51764 77.56346 - range where we believe the average MBA grade lies. 74 is not in range - we can confidently say the average grade differs from 74.
sample estimates: mean of x: 76.04055 - average MBA grade this year was 76,04.
I also checked the effect size with Cohen’s d which shows whether the difference is practically significant and not just statistically. With Cohen’s d = 0.27, the effect is small, indicating a statistically significant but only slight change in the average grade.
library(readxl) #Load library
Apartments <- read_excel("R Take Home Exam 2025/Task 3/Apartments.xlsx")
head(Apartments, 5) #First 5 rows of table
## # A tibble: 5 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
Description:
Apartments$Parking <- factor(Apartments$Parking, #Create factor
levels = c(0,1),
labels = c("NO", "YES"))
Apartments$Balcony <- factor(Apartments$Balcony,#Create factor
levels = c(0,1),
labels = c("NO", "YES"))
head(Apartments, 5) #First 5 rows of table
## # A tibble: 5 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 NO YES
## 2 18 1 2800 YES NO
## 3 7 28 1660 NO NO
## 4 28 29 1850 NO YES
## 5 18 18 1640 YES YES
t.test(Apartments$Price, mu=1900)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Explanation
The p value (0,004731) is significantly lower than 0,05 (we can reject the null hypothesis) and the confidence interval [1937,443 , 2100,440] does not include 1900. These two data mutually support the conclusion that the average price is not 1900€. The calculated sample mean of price is 2018,941.
fit1 <- lm(Price ~ Age, data = Apartments) #Simple regression function to fit1
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanation
Regression coefficient: The slope of -8,975 indicates that on average, with each additional unit of Age the price of a m2 of an apartment decreases by 8,975€
Coefficient of correlation: When we calculate the coefficient by rooting Multiple R-squared value (r is approximately 0,23) we can see a weak correlation between age and price (there is a weak linear relationship).
Coefficient of determination: 5,32% of the variance in Price is explained by Age - only a small proportion of variation in apartment prices is explained with Age.
pairs(Apartments[, c("Price", "Age", "Distance")], #Scatterplot matrix for 3 variables
main = "Scatterplot Matrix - Price, Age, Distance",
pch = 20)
Explanation: There is no obvious problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance, data = Apartments) #Multiple regression function to fit2
#install.packages("car") #install package (disabled after installation)
library(car)
## Loading required package: carData
vif(fit2) #Multicolinearity check with VIF
## Age Distance
## 1.001845 1.001845
Explination: VIF values for Age and Distance are very close to 1 which indicates there is no significant issue with multicolinearity in this regression model.
stand.res.fit2 <- rstandard(fit2) #Calculate standardized residuals and show in Global Environment
cook.dis.fit2 <- cooks.distance(fit2) #Calculate cooks distances and show in Global Environment
problematic <- which(abs(stand.res.fit2) > 2.5 | cook.dis.fit2 > 1) #Identify either outliers (standardized residual value > 2,5) either highly influential units (Cook distance value > 1)
print(problematic) #Print the units that could potentially be problematic
## 38
## 38
Explanation: I calculated the standardized residuals and Cooks distances and flagged those apartments that have values either higher than 2,5 (Standardized residuals) or higher than 1 (Cooks distance). There is only one - the 38th row in Apartments.
Apartments.1 <- Apartments[-problematic, ] #Delete 38th apartment (saved in value problematic) and save to fit2
fit2.1 <- lm(Price ~ Age + Distance, data = Apartments.1) #new model saved in fit2
library(car)
vif(fit2.1) #new VIF multicolinearity check
## Age Distance
## 1.008869 1.008869
Explanation: I deleted the 38th row (saved in problematic) and created a new model saved in fit2 which I checked for multicolinearity (still no).
stand.res.fit2.1 <- rstandard(fit2.1) #Standardised residuals withouth 38th apartment
stand.fit2.1 <- scale(fitted(fit2.1)) #Standardised fitted values withouth 38th apartment
plot(stand.fit2.1, stand.res.fit2.1, #Create scatterplot
xlab = "Standardized fitted values",
ylab = "Standardized residuals",
main = "Residuals vs Fitted values (heteroskedasticity check)",
pch = 20, col = "blue")
abline(h = 0, col = "red", lwd = 2)
Explanation: The scatterplott shows that the points are
spread fairly evenly around zero without a clear pattern. There is no
funnel shape which suggests that the spread of errors is roughly
consistent. This means the variance of errors is approximately constant
with no clear signs of heteroskedasticity in the model.
hist(stand.res.fit2.1, #Histogram of standardized residuals from fit2.1
breaks = 10,
main = "Histogram of standardized residuals",
xlab = "Standardized residuals",
col = "navy")
qqnorm(stand.res.fit2.1, # Q-Q plot of standardized residuals from fit2.1
main = "Q-Q Plot of standardized residuals")
qqline(stand.res.fit2.1, col = "red", lwd = 2)
shapiro.test(stand.res.fit2.1) #Formal test with Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: stand.res.fit2.1
## W = 0.9565, p-value = 0.00636
Explanation
Histogram: Most residuals cluster around 0, the histogram is not symmetric / is negatively skewed.
Q-Q Plot: The points are close to the middle line, but deviate at both ends. The distribution has heavier tails than a normal distribution.
Shapiro-Wilk Test: p is lower than 0,05 so we can reject Ho (Ho - residuals are normally distributed). This confirms the residuals deviate significantly from normality.
fit2 <- lm(Price ~ Age + Distance, data = Apartments) #Estimation of original fit2
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Explanation
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments) #Fit3 linear regression
anova(fit2, fit3) #Compare if fit3 fits better to the data than fit2 using anova f-test
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation: The p-value is less than 0,05 which means we can reject Ho (Ho: Adding additional variables (Parking, Balcony) does not have an affect on the model). This means that Fit3 fits the data better than fit2 (adding Parking and Balcony improves the model).
summary(fit3) #fit3 summary
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYES 196.168 62.868 3.120 0.00251 **
## BalconyYES 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Regression coefficient explanation:
Parking: Apartments with Parking are priced on average 196,168€ per m2 higher than apartments without. The effect is statistically significant (p=0,0025)
Balcony: Apartments with a balcony are priced 1,953€ higher per m2 than apartments without. The p-value of 0,97436 (p>0,05) indicates that in this model having a balcony does not ave a significant effect on price of the apartment.
Hypothesis: Ho: Age, Distance, Parking and Balcony together have no effect on apartment price. H1: At least one of Age, Distance, Parking and Balcony has an effect on apartment price.
fit3.fitt <- fitted(fit3) #Fitted values for fit3
residuals <- resid(fit3) #Residuals for fit3
fit3.fitt[2] #Fitted value for ID2
## 2
## 2357.411
residuals[2] #Residual value for ID2
## 2
## 442.5889
Explanation: The model predicts that apartment 2 should cost €2357 per m². In reality, the apartment costs €2800 per m², which is €443 higher than predicted. This means the model underestimates the price of this apartment.