Support session

1- Import the dataset, and display the first 6 rows What is the first thing we do??

employees_data <- read.csv("employee_performance_dataset.csv")

head(employees_data)

##   experience_years training_hours salary performance_score remote_days_per_week
## 1               28             18  43126                 9                    1
## 2               14             19 112194                 9                    4
## 3                7             56  51504                 2                    1
## 4               20             17 102945                 3                    4
## 5               18             46  97363                 5                    0
## 6               22            104  42887                 5                    0

2- Descriptive statistics, get the mean, median, sd, min and max for: training Hours and salary, and which varriable is the more dispersed?

colnames(employees_data)

## [1] "experience_years"     "training_hours"       "salary"              
## [4] "performance_score"    "remote_days_per_week"

attach(employees_data)
mean(training_hours)

## [1] 58.186

median(training_hours)

## [1] 59

sd(training_hours)

## [1] 35.58759

min(training_hours)

## [1] 0

max(training_hours)

## [1] 119

mean(salary)

## [1] 86160.5

median(salary)

## [1] 86126.5

sd(salary)

## [1] 18936.99

min(salary)

## [1] 20772

max(salary)

## [1] 142140

cv.training_hours <- sd(training_hours)/mean(training_hours)
cv.salary. <- sd(salary)/mean(salary)

training hours are more dispersed than salary

3- compute the correlation between experience years and salary, and the correlation between training hours and performance score (include the p-value)

cor.test(experience_years, salary)

## 
##  Pearson's product-moment correlation
## 
## data:  experience_years and salary
## t = -0.21634, df = 498, p-value = 0.8288
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09730177  0.07806305
## sample estimates:
##          cor 
## -0.009693899

cor.test(training_hours, performance_score, method= "spearman")

## Warning in cor.test.default(training_hours, performance_score, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  training_hours and performance_score
## S = 20926274, p-value = 0.9207
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.00446519

4- Create a correlation matrix including the following variables: experience_years, Training hours, performance score, and salary

a- first we need to subset the dataset

matrix <- employees_data[ ,c("experience_years", "training_hours", "salary", "performance_score")]

b- second we create the correlation matrix

cor.matrix <- cor(matrix, method="spearman")

5- Identify if there is any outlier in the variable salary

boxplot(salary)

yes we have outliers in the variable salary, we have four outliers that are outside the lower bound and upper bound (boxplot whiskers)

to identify which row is an outlier

Q1 <- quantile(salary,0.25)
Q3 <- quantile(salary, 0.75)
IQR_value <- IQR(salary)
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outliers <- which(salary < lower_bound | salary > upper_bound)
outliers

## [1]  29 102 458 472

6- Knit your Markdown into HTML

Support session

Mohamad Makki

2026-01-23