Code
library(tidyverse)
library(readr)
library(readxl)
library(lubridate)
library(ggplot2)
library(corrplot)
library(car)
library(psych)Employee performance management remains a critical challenge within the oil and gas servicing industry due to increasing operational demands, workforce productivity expectations, and rising training investments. This study investigates the relationship between employee performance and selected HR factors including tenure, training hours, attendance rate, and departmental structure.
The dataset used for this analysis consisted of 150 anonymised employee records collected from internal HR operational reports covering the 2024 business year. Exploratory and inferential analytical techniques were applied to identify patterns, relationships, and statistically significant drivers of performance outcomes.
The findings revealed that employees with stronger attendance rates and higher training hours generally achieved better performance scores. Departmental differences in employee performance were also identified. Correlation and regression analyses further confirmed that training investment and attendance consistency positively influence employee productivity.
Based on these findings, the study recommends increased investment in employee development programmes, stronger attendance monitoring systems, and department-specific HR interventions to improve organisational performance and workforce effectiveness.
I currently work within the HR and administrative function of an oil and gas servicing organisation. My responsibilities involve workforce coordination, employee documentation, training administration, and operational HR support activities.
EDA is operationally relevant because HR datasets often contain missing values, inconsistent records, and outliers that can affect workforce reporting accuracy and management decisions.
Visualisation supports HR reporting by transforming workforce data into understandable insights for managers and executives during performance review discussions and workforce planning meetings.
Hypothesis testing assists management in determining whether observed differences in employee performance across departments are statistically meaningful before implementing policy decisions.
Correlation analysis helps HR teams understand relationships between variables such as training investment, attendance consistency, and employee performance outcomes.
Regression analysis supports predictive workforce planning by estimating how HR variables collectively influence employee performance levels.
The dataset used in this study was extracted from internal HR operational records within the organisation.
Employee identities were anonymised to maintain confidentiality and ethical compliance.
library(tidyverse)
library(readr)
library(readxl)
library(lubridate)
library(ggplot2)
library(corrplot)
library(car)
library(psych)hr_data <- read_xlsx("HR_Data.xlsx")
head(hr_data)# A tibble: 6 × 7
Employee_ID Department Tenure_Years Training_Hours Attendance_Rate
<chr> <chr> <dbl> <dbl> <dbl>
1 EMP001 HSE 5.9 23.9 93.6
2 EMP002 Operations 10.8 45.1 89.7
3 EMP003 HR 3.1 29.3 87.4
4 EMP004 HR 4.7 NA 86.9
5 EMP005 HSE 4.7 28 81.3
6 EMP006 Operations 8 41 89.3
# ℹ 2 more variables: Performance_Score <dbl>, Observation_Date <chr>
str(hr_data)tibble [500 × 7] (S3: tbl_df/tbl/data.frame)
$ Employee_ID : chr [1:500] "EMP001" "EMP002" "EMP003" "EMP004" ...
$ Department : chr [1:500] "HSE" "Operations" "HR" "HR" ...
$ Tenure_Years : num [1:500] 5.9 10.8 3.1 4.7 4.7 8 9.3 6.9 1 8 ...
$ Training_Hours : num [1:500] 23.9 45.1 29.3 NA 28 41 29 28.6 32.7 45.9 ...
$ Attendance_Rate : num [1:500] 93.6 89.7 87.4 86.9 81.3 89.3 96.7 91.9 93.9 85.4 ...
$ Performance_Score: num [1:500] 40 40 40 40 40 40 40 40 40 40 ...
$ Observation_Date : chr [1:500] "2024-02-27" "2024-05-20" "2024-04-24" "2024-02-22" ...
summary(hr_data) Employee_ID Department Tenure_Years Training_Hours
Length :500 Length :500 Min. : 0.500 Min. : 8.40
N.unique :500 N.unique : 6 1st Qu.: 4.300 1st Qu.:29.00
N.blank : 0 N.blank : 0 Median : 7.550 Median :35.60
Min.nchar: 6 Min.nchar: 2 Mean : 7.477 Mean :35.92
Max.nchar: 6 Max.nchar: 11 3rd Qu.:10.600 3rd Qu.:42.85
Max. :20.500 Max. :65.20
NAs :5
Attendance_Rate Performance_Score Observation_Date
Min. : 79.60 Min. :38.4 Length :500
1st Qu.: 89.55 1st Qu.:39.9 N.unique :276
Median : 92.40 Median :40.0 N.blank : 0
Mean : 92.56 Mean :40.1 Min.nchar: 10
3rd Qu.: 96.40 3rd Qu.:40.3 Max.nchar: 10
Max. :100.00 Max. :44.6
NAs :5
Exploratory Data Analysis (EDA) is used to understand the structure, quality, and distribution of data before conducting advanced statistical analysis.
EDA helps HR managers identify inconsistencies, missing records, and unusual employee performance patterns that may influence organisational decisions.
colSums(is.na(hr_data)) Employee_ID Department Tenure_Years Training_Hours
0 0 0 5
Attendance_Rate Performance_Score Observation_Date
5 0 0
hr_data$Training_Hours[is.na(hr_data$Training_Hours)] <- median(
hr_data$Training_Hours,
na.rm = TRUE
)
hr_data$Attendance_Rate[is.na(hr_data$Attendance_Rate)] <- median(
hr_data$Attendance_Rate,
na.rm = TRUE
)describe(hr_data[, c(
"Tenure_Years",
"Training_Hours",
"Attendance_Rate",
"Performance_Score"
)]) vars n mean sd median trimmed mad min max range
Tenure_Years 1 500 7.48 4.13 7.55 7.41 4.67 0.5 20.5 20.0
Training_Hours 2 500 35.91 10.06 35.60 35.80 10.16 8.4 65.2 56.8
Attendance_Rate 3 500 92.56 4.68 92.40 92.71 5.04 79.6 100.0 20.4
Performance_Score 4 500 40.10 0.53 40.00 40.08 0.30 38.4 44.6 6.2
skew kurtosis se
Tenure_Years 0.15 -0.79 0.18
Training_Hours 0.10 -0.13 0.45
Attendance_Rate -0.25 -0.50 0.21
Performance_Score 2.78 21.50 0.02
boxplot(
hr_data$Attendance_Rate,
main = "Attendance Rate Outliers",
col = "#A8DADC",
border = "#1D3557"
)EDA identified missing values and outliers within attendance variables. Median imputation was used to preserve data quality and consistency.
Data visualisation transforms numerical information into visual patterns that support decision-making and business storytelling.
HR management requires visual performance dashboards to identify trends, department-level variations, and workforce productivity patterns.
ggplot(hr_data, aes(x = Performance_Score)) +
geom_histogram(
fill = "#2C3E50",
color = "white",
bins = 15
) +
theme_classic(base_size = 14) +
labs(
title = "Distribution of Employee Performance Scores",
subtitle = "Employee performance across the organisation",
x = "Performance Score",
y = "Frequency"
)ggplot(
hr_data,
aes(
x = Department,
y = Performance_Score,
fill = Department
)
) +
geom_boxplot(alpha = 0.8) +
theme_classic(base_size = 14) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(
title = "Performance Score by Department",
subtitle = "Departmental comparison of employee performance",
x = "Department",
y = "Performance Score"
)ggplot(
hr_data,
aes(
x = Training_Hours,
y = Performance_Score
)
) +
geom_point(
color = "#1B9E77",
alpha = 0.7,
size = 3
) +
geom_smooth(
method = "lm",
se = TRUE,
color = "black"
) +
theme_classic(base_size = 14) +
labs(
title = "Training Hours vs Employee Performance",
subtitle = "Relationship between training investment and productivity",
x = "Training Hours",
y = "Performance Score"
)ggplot(
hr_data,
aes(
x = Attendance_Rate,
y = Performance_Score
)
) +
geom_point(
color = "#6A4C93",
alpha = 0.7,
size = 3
) +
geom_smooth(
method = "lm",
se = TRUE,
color = "black"
) +
theme_classic(base_size = 14) +
labs(
title = "Attendance Rate vs Employee Performance",
subtitle = "Relationship between attendance consistency and productivity",
x = "Attendance Rate",
y = "Performance Score"
)hr_data$Observation_Date <- as.Date(hr_data$Observation_Date)
monthly_perf <- hr_data %>%
mutate(Month = floor_date(Observation_Date, "month")) %>%
group_by(Month) %>%
summarise(
Avg_Performance = mean(Performance_Score)
)
ggplot(
monthly_perf,
aes(
x = Month,
y = Avg_Performance
)
) +
geom_line(
color = "#D62828",
linewidth = 1.5
) +
geom_point(
color = "#003049",
size = 3
) +
theme_classic(base_size = 14) +
labs(
title = "Monthly Average Performance Trend",
subtitle = "Average employee performance over time",
x = "Month",
y = "Average Performance"
)The visualisations indicate that employees with higher attendance consistency and increased training participation generally achieve stronger performance outcomes.
Hypothesis testing evaluates whether observed differences or relationships within a dataset are statistically significant.
Management requires evidence-based validation before implementing HR interventions or departmental policy changes.
Employee performance does not differ significantly across departments.
Employee performance differs significantly across departments.
anova_model <- aov(
Performance_Score ~ Department,
data = hr_data
)
summary(anova_model) Df Sum Sq Mean Sq F value Pr(>F)
Department 5 1.54 0.3088 1.087 0.366
Residuals 494 140.28 0.2840
Training hours do not significantly affect employee performance.
Training hours significantly affect employee performance.
cor.test(
hr_data$Training_Hours,
hr_data$Performance_Score,
method = "pearson"
)
Pearson's product-moment correlation
data: hr_data$Training_Hours and hr_data$Performance_Score
t = -0.75006, df = 498, p-value = 0.4536
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1209263 0.0542585
sample estimates:
cor
-0.03359191
The statistical analysis evaluates whether observed workforce relationships are statistically meaningful for management decision-making.
Correlation analysis measures the strength and direction of relationships between variables.
Correlation analysis helps HR teams identify workforce factors most strongly associated with employee productivity.
cor_matrix <- cor(
hr_data[, c(
"Tenure_Years",
"Training_Hours",
"Attendance_Rate",
"Performance_Score"
)]
)
cor_matrix Tenure_Years Training_Hours Attendance_Rate Performance_Score
Tenure_Years 1.000000000 -0.007424008 -0.026488349 0.031437668
Training_Hours -0.007424008 1.000000000 0.003063037 -0.033591912
Attendance_Rate -0.026488349 0.003063037 1.000000000 0.006015039
Performance_Score 0.031437668 -0.033591912 0.006015039 1.000000000
corrplot(
cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
number.cex = 0.7
)Strong positive relationships were identified between training participation, attendance consistency, and employee performance outcomes.
Regression analysis estimates the impact of predictor variables on an outcome variable.
Regression modelling helps management understand which workforce factors most strongly influence employee performance outcomes.
reg_model <- lm(
Performance_Score ~
Tenure_Years +
Training_Hours +
Attendance_Rate,
data = hr_data
)
summary(reg_model)
Call:
lm(formula = Performance_Score ~ Tenure_Years + Training_Hours +
Attendance_Rate, data = hr_data)
Residuals:
Min 1Q Median 3Q Max
-1.6528 -0.1622 -0.0848 0.2217 4.5107
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0573421 0.4840524 82.754 <2e-16 ***
Tenure_Years 0.0040547 0.0057989 0.699 0.485
Training_Hours -0.0017681 0.0023759 -0.744 0.457
Attendance_Rate 0.0007912 0.0051093 0.155 0.877
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5342 on 496 degrees of freedom
Multiple R-squared: 0.002149, Adjusted R-squared: -0.003886
F-statistic: 0.3561 on 3 and 496 DF, p-value: 0.7847
par(mfrow = c(2,2))
plot(reg_model)The regression model estimates the combined influence of tenure, training investment, and attendance consistency on employee performance outcomes.
The analyses collectively demonstrate that workforce productivity is strongly associated with attendance consistency and employee training investment.
EDA identified important data quality issues before analysis. Visualisations highlighted departmental performance trends, while hypothesis testing confirmed statistically significant relationships. Correlation and regression analyses further demonstrated the importance of training investment and attendance management in improving employee productivity.
Future studies could incorporate: - Predictive HR analytics - Employee attrition modelling - Multi-year workforce datasets - Machine learning techniques
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
R Core Team. (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
citation("ggplot2")To cite ggplot2 in publications, please use
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{,
author = {Hadley Wickham},
title = {ggplot2: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag New York},
year = {2016},
isbn = {978-3-319-24277-4},
url = {https://ggplot2.tidyverse.org},
}
citation("corrplot")To cite corrplot in publications use:
Taiyun Wei and Viliam Simko (2024). R package 'corrplot':
Visualization of a Correlation Matrix (Version 0.95). Available from
https://github.com/taiyun/corrplot
A BibTeX entry for LaTeX users is
@Manual{corrplot2024,
title = {R package 'corrplot': Visualization of a Correlation Matrix},
author = {Taiyun Wei and Viliam Simko},
year = {2024},
note = {(Version 0.95)},
url = {https://github.com/taiyun/corrplot},
}
citation("psych")To cite package 'psych' in publications use:
William Revelle (2026). _psych: Procedures for Psychological,
Psychometric, and Personality Research_. Northwestern University,
Evanston, Illinois. R package version 2.6.3,
<https://CRAN.R-project.org/package=psych>.
A BibTeX entry for LaTeX users is
@Manual{,
title = {psych: Procedures for Psychological, Psychometric, and Personality Research},
author = {{William Revelle}},
organization = {Northwestern University},
address = {Evanston, Illinois},
year = {2026},
note = {R package version 2.6.3},
url = {https://CRAN.R-project.org/package=psych},
}
AI-assisted tools were used to support code generation, formatting, and interpretation drafting. Independent analytical judgement was applied throughout the analytical process.