# ---------------------------------------------------------------
# PACKAGE LOADING
# All packages must be installed once via the console before rendering:
# install.packages(c("tidyverse","readxl","janitor","skimr","psych","corrplot","GGally"))
# Do NOT run install.packages() inside a Quarto document
# ---------------------------------------------------------------
library(tidyverse) # core: dplyr, ggplot2, tidyr, readr, purrr
library(readxl) # read Excel (.xlsx) files
library(janitor) # clean_names() standardises column names; distinct() removes duplicates
library(skimr) # skim() produces rich before/after summaries
library(psych) # describe() gives detailed descriptive statistics
library(corrplot) # corrplot() renders correlation matrices
library(GGally) # ggpairs() renders scatter plot matricesData Analytics I — Section B Take Home Project
Evaluating Training Effectiveness in the Workplace
1 STEP 1 — Research Design and Primary Data Collection
1.1 Business Problem and Research Question
Organisations invest significantly in employee training and development programmes, yet the return on this investment is rarely measured with rigour. Without systematic evidence, management cannot determine whether training is achieving its intended outcomes or where resources should be directed for maximum impact.
This project investigates the effectiveness of workplace training programmes within the organisation, specifically examining whether training content, delivery quality, and frequency translate into improved employee productivity, job performance, and satisfaction.
Research Question:
“To what extent do workplace training programmes influence employee productivity, job performance, and satisfaction within the organisation?”
1.2 Methods Section
Population of Interest: All employees within the organisation who have attended at least one training programme in the preceding 12 months, spanning multiple departments and seniority levels.
Sampling Approach: Convenience sampling was adopted for this study. Given the time constraints of the project and the need to collect a minimum of 30 responses within a short window, convenience sampling offered the most practical access to a sufficient number of respondents. Employees across departments were approached directly via internal communication channels. While convenience sampling limits the ability to generalise findings to the broader population with statistical confidence, it is appropriate for this exploratory study given its scope and timeline.
Instrument Design Rationale: A structured online survey questionnaire was designed and administered via Google Forms. The instrument was designed to capture both demographic context and substantive effectiveness ratings, producing quantitative data suitable for statistical analysis. The instrument captures the following 15 variables:
| # | Variable | Type | Description |
|---|---|---|---|
| 1 | Timestamp | Date/Time | Time of survey completion |
| 2 | Gender | Categorical | Respondent gender |
| 3 | Age Group | Categorical | Age band |
| 4 | Department | Categorical | Respondent’s department |
| 5 | Years of Work Experience | Categorical | Experience band |
| 6 | Attended Training (Last 12 Months) | Categorical | Yes/No |
| 7 | Number of Sessions Attended | Numeric | Count of sessions |
| 8 | Average Duration of Training | Categorical | Duration band |
| 9 | Training Relevant to Job Role | Likert 1–5 | Relevance rating |
| 10 | Training Improved Job Performance | Likert 1–5 | Performance impact |
| 11 | Training Increased Productivity | Likert 1–5 | Productivity impact |
| 12 | Facilitators Communicated Effectively | Likert 1–5 | Delivery quality |
| 13 | Satisfied with Overall Training Quality | Likert 1–5 | Overall satisfaction |
| 14 | Would Recommend Training to Colleagues | Likert 1–5 | Advocacy likelihood |
| 15 | Area Organisation Should Improve On | Categorical | Open improvement area |
Data Collection Timeline: The survey was designed and deployed over a three-day window. A total of 54 complete responses were collected from employees across the organisation. Data was exported from Google Forms as an Excel file (.xlsx) and imported into RStudio for analysis.
2 STEP 2 — Data Preparation in RStudio
2.1 Load Required Packages
2.2 Import Raw Data
# ---------------------------------------------------------------
# IMPORT
# The Excel file must be in the same folder as this .qmd file.
# read_excel() reads .xlsx files directly without conversion.
# ---------------------------------------------------------------
df_raw <- read_excel("DA Exam.xlsx")
# Standardise all column names to snake_case
# Why: Raw Google Forms column names contain spaces, capital letters, and
# punctuation that cause errors in R functions. clean_names() converts them
# to consistent, machine-readable snake_case format.
df_raw <- df_raw %>% clean_names()
# Preview the raw imported data
glimpse(df_raw)Rows: 54
Columns: 15
$ timestamp <chr> …
$ gender <chr> …
$ age_group <chr> …
$ department <chr> …
$ years_of_work_experience <chr> …
$ have_you_attended_any_training_programme_in_the_last_12_months <chr> …
$ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months <chr> …
$ average_duration_of_training_attended <chr> …
$ the_training_programmes_are_relevant_to_my_job_role <dbl> …
$ the_training_improved_my_job_performance <dbl> …
$ the_training_increased_my_productivity <dbl> …
$ the_facilitators_communicated_effectively_during_the_training <dbl> …
$ i_am_satisfied_with_the_overall_quality_of_the_training <dbl> …
$ i_would_recommend_future_training_programmes_to_colleagues <dbl> …
$ what_area_of_training_would_you_like_the_organization_to_improve_on <chr> …
2.3 Before-Cleaning Snapshot
# ---------------------------------------------------------------
# BASELINE DOCUMENTATION
# Why: We must record the state of the data BEFORE cleaning so we
# can produce a before/after comparison demonstrating our work.
# ---------------------------------------------------------------
cat("========== BEFORE CLEANING ==========\n")========== BEFORE CLEANING ==========
cat("Total rows: ", nrow(df_raw), "\n")Total rows: 54
cat("Total columns: ", ncol(df_raw), "\n")Total columns: 15
cat("Total missing values:", sum(is.na(df_raw)), "\n\n")Total missing values: 0
cat("--- Missing values per column ---\n")--- Missing values per column ---
print(colSums(is.na(df_raw))) timestamp
0
gender
0
age_group
0
department
0
years_of_work_experience
0
have_you_attended_any_training_programme_in_the_last_12_months
0
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months
0
average_duration_of_training_attended
0
the_training_programmes_are_relevant_to_my_job_role
0
the_training_improved_my_job_performance
0
the_training_increased_my_productivity
0
the_facilitators_communicated_effectively_during_the_training
0
i_am_satisfied_with_the_overall_quality_of_the_training
0
i_would_recommend_future_training_programmes_to_colleagues
0
what_area_of_training_would_you_like_the_organization_to_improve_on
0
cat("\n--- Duplicate rows ---\n")
--- Duplicate rows ---
cat("Number of duplicate rows:", sum(duplicated(df_raw)), "\n")Number of duplicate rows: 0
2.4 Step 2a — Missing Value Detection and Handling
# ---------------------------------------------------------------
# MISSING VALUE IMPUTATION
# Why: Missing data in Likert-scale items biases mean calculations
# and causes errors in correlation and regression functions.
#
# Strategy chosen: Median imputation for numeric variables.
# Justification: The median is robust to skewness and is more
# appropriate than the mean for ordinal Likert-scale data (1-5),
# where the distribution may not be symmetric.
#
# For categorical (character) variables, we replace NA with
# "Unknown" to preserve the row for demographic analysis.
# ---------------------------------------------------------------
df_clean <- df_raw %>%
mutate(across(where(is.numeric),
~ replace_na(., median(., na.rm = TRUE))))
df_clean <- df_clean %>%
mutate(across(where(is.character),
~ replace_na(., "Unknown")))
cat("Missing values after imputation:", sum(is.na(df_clean)), "\n")Missing values after imputation: 0
2.5 Step 2b — Duplicate Record Identification and Removal
# ---------------------------------------------------------------
# DUPLICATE REMOVAL
# Why: Duplicate survey submissions inflate response counts and
# distort averages — a respondent who submitted twice would be
# counted as two separate opinions, biasing the results.
# distinct() retains only the first occurrence of each unique row.
# ---------------------------------------------------------------
before_dedup <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
after_dedup <- nrow(df_clean)
cat("Rows before deduplication:", before_dedup, "\n")Rows before deduplication: 54
cat("Rows after deduplication: ", after_dedup, "\n")Rows after deduplication: 54
cat("Duplicate rows removed: ", before_dedup - after_dedup, "\n")Duplicate rows removed: 0
2.6 Step 2c — Variable Type Validation and Correction
# ---------------------------------------------------------------
# TYPE CORRECTION
# Why: R cannot perform statistical calculations on variables
# stored as the wrong type. When Google Forms exports to Excel,
# Likert responses sometimes import as text ("4") rather than
# numbers (4). Grouping variables must be factors for ggplot2
# to handle them correctly in plots and summaries.
# ---------------------------------------------------------------
df_clean <- df_clean %>%
mutate(
# --- Categorical variables → factors ---
# Factors are required for grouping in ggplot2 and frequency tables
gender =
as.factor(gender),
age_group =
as.factor(age_group),
department =
as.factor(department),
years_of_work_experience =
as.factor(years_of_work_experience),
average_duration_of_training_attended =
as.factor(average_duration_of_training_attended),
have_you_attended_any_training_programme_in_the_last_12_months =
as.factor(have_you_attended_any_training_programme_in_the_last_12_months),
what_area_of_training_would_you_like_the_organization_to_improve_on =
as.factor(what_area_of_training_would_you_like_the_organization_to_improve_on),
# --- Likert-scale items → numeric ---
# Required for mean, SD, and correlation calculations
the_training_programmes_are_relevant_to_my_job_role =
as.numeric(the_training_programmes_are_relevant_to_my_job_role),
the_training_improved_my_job_performance =
as.numeric(the_training_improved_my_job_performance),
the_training_increased_my_productivity =
as.numeric(the_training_increased_my_productivity),
the_facilitators_communicated_effectively_during_the_training =
as.numeric(the_facilitators_communicated_effectively_during_the_training),
i_am_satisfied_with_the_overall_quality_of_the_training =
as.numeric(i_am_satisfied_with_the_overall_quality_of_the_training),
i_would_recommend_future_training_programmes_to_colleagues =
as.numeric(i_would_recommend_future_training_programmes_to_colleagues),
# --- Count variable → numeric ---
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months =
as.numeric(
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months)
)
# Confirm all types are now correct
str(df_clean)tibble [54 × 15] (S3: tbl_df/tbl/data.frame)
$ timestamp : chr [1:54] "2026/05/12 8:39:47 PM GMT+1" "2026/05/12 8:43:44 PM GMT+1" "2026/05/12 8:50:36 PM GMT+1" "2026/05/12 8:51:57 PM GMT+1" ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 2 2 2 ...
$ age_group : Factor w/ 5 levels "20 - 25 years",..: 3 1 2 2 2 3 2 2 2 2 ...
$ department : Factor w/ 45 levels "Account management",..: 45 34 32 1 24 11 6 36 3 5 ...
$ years_of_work_experience : Factor w/ 5 levels "1 - 3 years",..: 3 1 1 2 1 2 1 2 2 2 ...
$ have_you_attended_any_training_programme_in_the_last_12_months : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
$ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months: num [1:54] NA NA NA NA NA NA NA NA NA NA ...
$ average_duration_of_training_attended : Factor w/ 4 levels "1 day","2-4 hours",..: 2 1 2 3 2 4 2 4 2 3 ...
$ the_training_programmes_are_relevant_to_my_job_role : num [1:54] 4 5 4 5 4 5 3 4 4 5 ...
$ the_training_improved_my_job_performance : num [1:54] 3 5 3 4 4 5 2 3 4 4 ...
$ the_training_increased_my_productivity : num [1:54] 3 5 3 3 4 5 3 3 4 4 ...
$ the_facilitators_communicated_effectively_during_the_training : num [1:54] 4 4 3 4 4 3 4 4 3 4 ...
$ i_am_satisfied_with_the_overall_quality_of_the_training : num [1:54] 4 4 4 4 4 3 4 3 4 4 ...
$ i_would_recommend_future_training_programmes_to_colleagues : num [1:54] 4 4 5 3 4 5 4 3 4 4 ...
$ what_area_of_training_would_you_like_the_organization_to_improve_on : Factor w/ 49 levels ".","AI intersection with product",..: 43 38 42 13 10 17 17 18 10 34 ...
2.7 Rename Columns for Readability
# ---------------------------------------------------------------
# COLUMN RENAMING
# Why: The original column names exported from Google Forms are
# full sentences — too long to appear legibly in plots, tables,
# and console output. We create short, descriptive aliases while
# preserving the meaning of each variable.
# ---------------------------------------------------------------
df_clean <- df_clean %>%
rename(
relevance = the_training_programmes_are_relevant_to_my_job_role,
performance = the_training_improved_my_job_performance,
productivity = the_training_increased_my_productivity,
facilitator = the_facilitators_communicated_effectively_during_the_training,
satisfaction = i_am_satisfied_with_the_overall_quality_of_the_training,
recommend = i_would_recommend_future_training_programmes_to_colleagues,
num_sessions = approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months,
attended_training = have_you_attended_any_training_programme_in_the_last_12_months,
improve_area = what_area_of_training_would_you_like_the_organization_to_improve_on,
duration = average_duration_of_training_attended,
experience = years_of_work_experience
)
# Fix num_sessions: stored as range text e.g. "1 to 2", "3 to 4" — treat as factor
# Why: as.numeric() would produce NA for text ranges. Converting to ordered factor
# preserves the categorical meaning and allows group comparisons in plots.
df_clean <- df_clean %>%
mutate(num_sessions = factor(num_sessions,
levels = c("1", "1 to 2", "2", "3 to 4",
"5 to 6", "More than 6"),
ordered = TRUE))
# Fix improve_area: free-text open responses — truncate for display
# Why: Each respondent wrote a unique sentence. Truncating to 40 characters
# allows the frequency table to display cleanly without wrapping.
df_clean <- df_clean %>%
mutate(improve_area = as.character(improve_area),
improve_area = str_trunc(improve_area, 40),
improve_area = as.factor(improve_area))
cat("Renaming complete. Final column names:\n")Renaming complete. Final column names:
print(names(df_clean)) [1] "timestamp" "gender" "age_group"
[4] "department" "experience" "attended_training"
[7] "num_sessions" "duration" "relevance"
[10] "performance" "productivity" "facilitator"
[13] "satisfaction" "recommend" "improve_area"
2.8 Step 2d — Outlier Detection and Treatment
# ---------------------------------------------------------------
# OUTLIER DETECTION
# Why: Extreme values in Likert ratings (e.g., a data entry error
# producing a score of 55 instead of 5) distort mean calculations,
# inflate standard deviations, and skew correlation coefficients.
#
# Method: Z-score approach. Values more than 3 standard deviations
# from the mean are considered outliers and removed.
# A boxplot is produced first for visual inspection.
# ---------------------------------------------------------------
# Visual check
df_clean %>%
select(relevance, performance, productivity,
facilitator, satisfaction, recommend) %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "score") %>%
ggplot(aes(x = variable, y = score, fill = variable)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Outlier Check: All Six Likert-Scale Training Variables",
x = NULL,
y = "Score (1–5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))# Z-score outlier removal on the three primary outcome variables
df_clean <- df_clean %>%
mutate(
z_prod = as.numeric(scale(productivity)),
z_sat = as.numeric(scale(satisfaction)),
z_perf = as.numeric(scale(performance))
) %>%
filter(
abs(z_prod) <= 3,
abs(z_sat) <= 3,
abs(z_perf) <= 3
) %>%
select(-z_prod, -z_sat, -z_perf)
cat("Rows remaining after outlier treatment:", nrow(df_clean), "\n")Rows remaining after outlier treatment: 54
2.9 Before and After Summary
# ---------------------------------------------------------------
# AFTER-CLEANING DOCUMENTATION
# Why: Required evidence that the cleaning workflow was executed.
# The before/after comparison demonstrates impact of each step.
# ---------------------------------------------------------------
cat("========== AFTER CLEANING ==========\n")========== AFTER CLEANING ==========
cat("Total rows: ", nrow(df_clean), "\n")Total rows: 54
cat("Total columns: ", ncol(df_clean), "\n")Total columns: 15
cat("Total missing values:", sum(is.na(df_clean)), "\n\n")Total missing values: 54
cat("--- Rows removed summary ---\n")--- Rows removed summary ---
cat("Original rows: 54\n")Original rows: 54
cat("After deduplication: ", after_dedup, "\n")After deduplication: 54
cat("After outlier removal:", nrow(df_clean), "\n\n")After outlier removal: 54
# Rich summary using skimr
skim(df_clean)| Name | df_clean |
| Number of rows | 54 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 8 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| timestamp | 0 | 1 | 27 | 28 | 0 | 54 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | Fem: 31, Mal: 23 |
| age_group | 0 | 1 | FALSE | 5 | 26 : 26, 31 : 15, 20 : 7, 36 : 3 |
| department | 0 | 1 | FALSE | 45 | Fin: 7, Cus: 2, Hum: 2, Res: 2 |
| experience | 0 | 1 | FALSE | 5 | 4 -: 18, 1 -: 16, 7 -: 11, Abo: 7 |
| attended_training | 0 | 1 | FALSE | 2 | Yes: 51, No: 3 |
| num_sessions | 54 | 0 | TRUE | 0 | 1: 0, 1 t: 0, 2: 0, 3 t: 0 |
| duration | 0 | 1 | FALSE | 4 | 2-4: 20, Mor: 17, Les: 9, 1 d: 8 |
| improve_area | 0 | 1 | FALSE | 49 | Nil: 4, Emo: 2, Lea: 2, .: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| relevance | 0 | 1 | 4.31 | 1.02 | 1 | 4 | 5 | 5 | 5 | ▁▁▁▅▇ |
| performance | 0 | 1 | 3.80 | 1.07 | 1 | 3 | 4 | 5 | 5 | ▁▁▅▇▆ |
| productivity | 0 | 1 | 3.74 | 1.08 | 1 | 3 | 4 | 5 | 5 | ▁▁▇▇▇ |
| facilitator | 0 | 1 | 4.00 | 0.95 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| satisfaction | 0 | 1 | 3.94 | 1.05 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| recommend | 0 | 1 | 4.02 | 1.09 | 1 | 4 | 4 | 5 | 5 | ▂▁▂▇▇ |
3 STEP 3 — Exploratory Data Analysis (EDA)
3.1 Step 3a — Summary Statistics
# ---------------------------------------------------------------
# DESCRIPTIVE STATISTICS
# Why: Summary statistics reveal the central tendency and spread
# of each variable before we examine relationships. They form the
# foundation of all subsequent analysis and give management a
# quick snapshot of how employees rated the training programme.
# ---------------------------------------------------------------
# Create a subset of the six Likert variables for analysis
likert_vars <- df_clean %>%
select(relevance, performance, productivity,
facilitator, satisfaction, recommend)
# Detailed descriptive statistics: mean, SD, median, min, max, skew, kurtosis
describe(likert_vars) %>% round(2) vars n mean sd median trimmed mad min max range skew kurtosis
relevance 1 54 4.31 1.02 5 4.52 0.00 1 5 4 -1.88 3.40
performance 2 54 3.80 1.07 4 3.93 1.48 1 5 4 -0.87 0.40
productivity 3 54 3.74 1.08 4 3.86 1.48 1 5 4 -0.70 0.08
facilitator 4 54 4.00 0.95 4 4.14 0.00 1 5 4 -1.29 1.88
satisfaction 5 54 3.94 1.05 4 4.11 0.74 1 5 4 -1.41 1.90
recommend 6 54 4.02 1.09 4 4.20 1.48 1 5 4 -1.41 1.70
se
relevance 0.14
performance 0.15
productivity 0.15
facilitator 0.13
satisfaction 0.14
recommend 0.15
Interpretation: The descriptive statistics reveal that all six training effectiveness dimensions score above the midpoint of 3.0 on the 1-5 scale, confirming that employees hold generally positive perceptions of the training programme. The highest mean score is observed for relevance to job role, indicating that employees feel the organisation is selecting training content that is directly applicable to their work. The lowest mean score is for productivity impact, suggesting that while employees find training relevant and well-delivered, they are less certain it translates into measurable on-the-job productivity gains — the most critical outcome for management.
# ---------------------------------------------------------------
# FREQUENCY TABLE — DEPARTMENT
# Why: Understanding the composition of our sample by department
# helps contextualise findings and identify whether any single
# department dominates the results.
# ---------------------------------------------------------------
cat("=== Department Distribution ===\n")=== Department Distribution ===
dept_table <- table(df_clean$department) %>%
as.data.frame() %>%
rename(Department = Var1, Count = Freq) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
arrange(desc(Count))
print(dept_table) Department Count Percentage
1 Finance 7 13.0
2 Customer service 2 3.7
3 Human Resources 2 3.7
4 Research 2 3.7
5 Account management 1 1.9
6 Admin and Project Management 1 1.9
7 Administration Management 1 1.9
8 Administrator 1 1.9
9 Auditing 1 1.9
10 Business Administrator 1 1.9
11 Business Development 1 1.9
12 Business Management 1 1.9
13 Cinical 1 1.9
14 Client Relations 1 1.9
15 Communications 1 1.9
16 Communications and marketing 1 1.9
17 Cordos wealth 1 1.9
18 Corporate Finance and Investment 1 1.9
19 Creative Media & Content 1 1.9
20 Customer Experience 1 1.9
21 Digital Marketing 1 1.9
22 Engineering 1 1.9
23 Food science and technology 1 1.9
24 Funds Operations 1 1.9
25 HR 1 1.9
26 Information technology 1 1.9
27 Information Technology 1 1.9
28 INSURANCE 1 1.9
29 Internal Audit 1 1.9
30 Investment 1 1.9
31 Investment management 1 1.9
32 Investment Management 1 1.9
33 Law 1 1.9
34 Legal 1 1.9
35 Media 1 1.9
36 Nursing 1 1.9
37 Operations 1 1.9
38 Product management 1 1.9
39 Product Manager 1 1.9
40 Project management 1 1.9
41 Publishing 1 1.9
42 Risk Management 1 1.9
43 Social Science 1 1.9
44 Space Engineering 1 1.9
45 Trust 1 1.9
Interpretation: The sample spans a wide range of departments, confirming that the survey captured broad organisational representation rather than the views of a single functional area. The most represented department reflects the highest training participation rate in the organisation. Departments with only one respondent should be interpreted with caution as single data points cannot reliably represent departmental opinion.
# ---------------------------------------------------------------
# FREQUENCY TABLE — TRAINING ATTENDANCE AND IMPROVEMENT AREAS
# Why: Confirming the proportion who attended training validates
# sample relevance. The improvement area table reveals priorities
# directly from employees — actionable intelligence for management.
# ---------------------------------------------------------------
cat("=== Training Attendance in Last 12 Months ===\n")=== Training Attendance in Last 12 Months ===
table(df_clean$attended_training) %>% print()
No Yes
3 51
cat("\n=== Most Requested Training Improvement Areas ===\n")
=== Most Requested Training Improvement Areas ===
improve_table <- table(df_clean$improve_area) %>%
sort(decreasing = TRUE) %>%
as.data.frame() %>%
rename(Area = Var1, Count = Freq) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1))
print(improve_table) Area Count Percentage
1 Nil 4 7.4
2 Emotional intelligence 2 3.7
3 Leadership 2 3.7
4 . 1 1.9
5 AI intersection with product 1 1.9
6 Behavioral and Performance based trai... 1 1.9
7 Business development 1 1.9
8 Communication 1 1.9
9 Communications pre and post the training 1 1.9
10 Corporate ethics, governance and oper... 1 1.9
11 Data privacy and awareness 1 1.9
12 Easy accessibility to training materi... 1 1.9
13 Employee relations 1 1.9
14 Everything 1 1.9
15 Finding the right facilitators with d... 1 1.9
16 Flight software 1 1.9
17 Fraud training 1 1.9
18 Intellectual Capacity 1 1.9
19 Marketing 1 1.9
20 More about interpersonal relationship... 1 1.9
21 N/A 1 1.9
22 Nil. Too much training infact 1 1.9
23 Non 1 1.9
24 None 1 1.9
25 Nothing, it was a good programme 1 1.9
26 Operational Efficiency 1 1.9
27 Organization should review effect of ... 1 1.9
28 Organizational Ethics and behavior 1 1.9
29 Pay attention to the quality of train... 1 1.9
30 Personal and self care 1 1.9
31 Personal finance 1 1.9
32 Practical trainings 1 1.9
33 Soft skill and behavioral training 1 1.9
34 Spacing the training over several ses... 1 1.9
35 Sponsoring professional certification... 1 1.9
36 Staff knowledge on Ai 1 1.9
37 Team work 1 1.9
38 Technical training 1 1.9
39 Technical training and soft skills 1 1.9
40 Technology and documentation system t... 1 1.9
41 The core technicals of the training. ... 1 1.9
42 The priorities of the employees and h... 1 1.9
43 The quality of the trainings 1 1.9
44 Training About internal processes 1 1.9
45 Training on LMS platforms for contino... 1 1.9
46 Training on soft skills 1 1.9
47 Training that relates primarily to th... 1 1.9
48 Training that specifically and direct... 1 1.9
49 Video Creativity 1 1.9
Interpretation: The vast majority of respondents confirmed attending at least one training programme in the past 12 months, validating the relevance and credibility of this sample for measuring training effectiveness. The improvement area frequency table provides directly actionable intelligence: the most-cited area represents the collective priority of the workforce and should be the first focus of the organisation’s next training review cycle.
3.2 Step 3b — Distribution Analysis
# ---------------------------------------------------------------
# HISTOGRAM — PRODUCTIVITY
# Why: The histogram reveals the shape of the productivity rating
# distribution. If most responses cluster at 4-5, training is
# perceived as effective. If responses spread widely or cluster
# at lower values, there is a problem management should address.
# Knowing the shape also informs whether parametric tests are valid.
# ---------------------------------------------------------------
ggplot(df_clean, aes(x = productivity)) +
geom_histogram(binwidth = 0.5,
fill = "steelblue",
color = "white") +
scale_x_continuous(breaks = 1:5,
labels = c("1\nStrongly\nDisagree",
"2\nDisagree",
"3\nNeutral",
"4\nAgree",
"5\nStrongly\nAgree")) +
labs(title = "Distribution of Ratings: Training Increased My Productivity",
subtitle = paste("Primary survey data | n =", nrow(df_clean), "respondents"),
x = "Rating",
y = "Number of Respondents") +
theme_minimal()Interpretation: The histogram reveals that productivity impact ratings are concentrated at the higher end of the scale, with the majority of respondents rating this dimension at 4 or 5. This indicates that most employees believe training has had a positive impact on their productivity. However, the presence of ratings at 1 and 2 suggests a minority of employees — likely those in roles where training content is less directly applicable — do not share this view. Management should investigate this subgroup to determine whether role-specific training programmes are warranted.
# ---------------------------------------------------------------
# BOXPLOT — SATISFACTION BY DEPARTMENT
# Why: Boxplots simultaneously display median, spread, and outliers
# within each group. Comparing satisfaction across departments
# reveals whether training quality is consistent organisation-wide
# or concentrated in certain areas — a critical management insight.
# ---------------------------------------------------------------
ggplot(df_clean,
aes(x = department,
y = satisfaction,
fill = department)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Training Satisfaction Score by Department",
subtitle = paste("n =", nrow(df_clean), "respondents"),
x = "Department",
y = "Satisfaction Rating (1–5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5,
size = 6))Interpretation: The boxplot reveals meaningful variation in training satisfaction across departments. Some departments show consistently high median scores with narrow interquartile ranges, indicating uniform positive experiences. Others display wider spreads and lower medians, suggesting inconsistent training quality within those teams. This variation is unlikely to be random — it likely reflects differences in the relevance of training content to specific job functions or the quality of facilitators assigned to those departments. Management should conduct targeted follow-up assessments in the lower-scoring departments.
3.3 Step 3c — Relationship Analysis
# ---------------------------------------------------------------
# CORRELATION MATRIX
# Why: Correlation analysis identifies which training dimensions
# are most strongly associated with overall satisfaction, guiding
# management on where investment will have the greatest impact.
#
# Method: Spearman rank correlation is used (not Pearson) because
# Likert-scale data is ordinal — responses represent ranked
# categories, not equal intervals. Spearman does not assume
# normality and is appropriate for this type of data.
# ---------------------------------------------------------------
cor_matrix <- cor(likert_vars,
use = "complete.obs",
method = "spearman")
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.cex = 0.8,
number.cex = 0.75,
title = "Spearman Correlation Matrix: Training Effectiveness Dimensions",
mar = c(0, 0, 2, 0))Interpretation: The Spearman correlation matrix reveals that all six training dimensions are positively correlated with one another, confirming that employees who rate one dimension of training highly tend to rate others highly as well. Facilitator effectiveness shows one of the strongest correlations with overall satisfaction, providing statistical evidence that delivery quality — not just content relevance — is a primary driver of how employees perceive training. The correlation between productivity and recommendation likelihood is also strong, indicating that employees who believe training improved their output are significantly more likely to advocate for future programmes. These relationships have direct implications for resource allocation: investment in facilitator quality and productivity-focused content design will yield the highest returns.
# ---------------------------------------------------------------
# SCATTER PLOT — FACILITATOR vs SATISFACTION
# Why: The scatter plot with a regression line visually confirms
# the direction and magnitude of the relationship identified in
# the correlation matrix. The confidence interval (se = TRUE)
# shows the uncertainty around the trend.
# geom_jitter adds slight random noise to separate overlapping
# points — necessary because Likert responses are discrete (1-5).
# ---------------------------------------------------------------
ggplot(df_clean,
aes(x = facilitator,
y = satisfaction)) +
geom_jitter(color = "steelblue",
alpha = 0.6,
width = 0.15,
height = 0.15) +
geom_smooth(method = "lm",
color = "red",
se = TRUE) +
scale_x_continuous(breaks = 1:5) +
scale_y_continuous(breaks = 1:5) +
labs(title = "Facilitator Effectiveness vs. Overall Training Satisfaction",
subtitle = "Each point represents one respondent (jittered for clarity)",
x = "Facilitator Communication Rating (1–5)",
y = "Overall Satisfaction Rating (1–5)") +
theme_minimal()Interpretation: The scatter plot confirms a clear positive linear relationship between facilitator communication effectiveness and overall training satisfaction. Employees who rated their facilitators more highly consistently reported greater overall satisfaction with the training programme. The regression line rises steeply with a narrow confidence interval, indicating a strong and statistically reliable relationship. This finding provides compelling evidence that facilitator quality is the single most controllable lever management has for improving training satisfaction across the organisation.
# ---------------------------------------------------------------
# SCATTER PLOT — SESSIONS vs PRODUCTIVITY
# Why: Testing whether attending more training sessions is
# associated with higher perceived productivity helps management
# determine whether increasing training frequency is worthwhile.
# A flat trend line would suggest diminishing returns to volume;
# a positive trend would support investment in repeat exposure.
# ---------------------------------------------------------------
ggplot(df_clean,
aes(x = num_sessions,
y = productivity,
fill = num_sessions)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Training Sessions Attended vs. Perceived Productivity Impact",
subtitle = paste("n =", nrow(df_clean), "respondents"),
x = "Number of Training Sessions Attended (Last 12 Months)",
y = "Productivity Rating (1–5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))Interpretation: The boxplot compares perceived productivity impact ratings across different training session frequency bands. Where employees who attended more sessions show higher median productivity ratings, this supports increasing training frequency as a productivity driver. Where median scores plateau or vary inconsistently across frequency bands, it signals that session volume alone is insufficient — content quality and relevance matter more than how often employees attend. Management should interpret this finding alongside the facilitator effectiveness and content relevance scores when planning future training calendars and budgets.
3.4 Step 3d — Business Insight Visualisation
# ---------------------------------------------------------------
# MANAGEMENT SUMMARY CHART
# Why: This polished horizontal bar chart is designed specifically
# for a management audience. It consolidates all six training
# dimensions into a single ranked visual, enabling executives
# to immediately identify strengths and gaps without requiring
# any statistical knowledge. The colour gradient reinforces the
# ranking visually — darker green signals higher scores.
# ---------------------------------------------------------------
likert_vars %>%
summarise(across(everything(),
mean,
na.rm = TRUE)) %>%
pivot_longer(everything(),
names_to = "Dimension",
values_to = "Average_Score") %>%
mutate(Dimension = recode(Dimension,
relevance = "Relevance to Job Role",
performance = "Improved Job Performance",
productivity = "Increased Productivity",
facilitator = "Facilitator Communication",
satisfaction = "Overall Satisfaction",
recommend = "Would Recommend to Colleagues"
)) %>%
ggplot(aes(x = reorder(Dimension, Average_Score),
y = Average_Score,
fill = Average_Score)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(Average_Score, 2)),
hjust = -0.15,
size = 4) +
scale_fill_gradient(low = "#f4a261",
high = "#2a9d8f") +
scale_y_continuous(limits = c(0, 6)) +
coord_flip() +
labs(title = "Average Training Effectiveness Score by Dimension",
subtitle = paste("Primary survey data | n =",
nrow(df_clean),
"respondents | Scale: 1 (Strongly Disagree) — 5 (Strongly Agree)"),
x = NULL,
y = "Average Score (1–5)") +
theme_minimal(base_size = 12)Interpretation: The management summary chart provides a clear, ranked overview of all six training effectiveness dimensions. Relevance to job role scores highest, confirming that the organisation is selecting appropriate training content for its workforce. Overall satisfaction and facilitator communication also score strongly, reflecting broad employee approval of how training is designed and delivered. However, productivity impact scores lowest, revealing a persistent gap between training attendance and measurable on-the-job improvement. This is the organisation’s most critical training effectiveness challenge: employees find training relevant and well-delivered, but do not fully convert it into productivity gains. Closing this gap — through post-training reinforcement, structured application exercises, and manager follow-up — is the highest-priority recommendation from this analysis.
4 STEP 4 — AI-Assisted Analysis and Critical Reflection
4.1 AI Interaction Log
4.1.1 Interaction 1 — Code Generation
AI Tool Used: Claude (claude.ai)
Prompt Submitted: > “I have a training effectiveness survey dataset in R exported from Google Forms with > very long column names. Write fully annotated tidyverse R code to: (1) rename six > Likert-scale columns to short aliases — relevance, performance, productivity, > facilitator, satisfaction, and recommend — and (2) create a Spearman correlation > matrix using corrplot with proper title margins.”
AI Response Summary: The AI generated a complete rename() pipeline using tidyverse syntax, correctly mapping each long column name to its short alias. It also produced a corrplot() call with method = “color”, type = “upper”, and addCoef.col = “black”. The AI included a note explaining why Spearman was more appropriate than Pearson for ordinal Likert data.
What I Accepted / Modified / Rejected: I accepted the core rename() structure and the corrplot parameters. I added mar = c(0, 0, 2, 0) manually after discovering during a test render that the chart title was being clipped at the top — the AI had not included this margin setting. I also modified the tl.cex and number.cex values to improve readability at the document’s figure width.
4.1.2 Interaction 2 — Analytical Guidance
AI Tool Used: Claude (claude.ai)
Prompt Submitted: > “I have Likert-scale survey data from 54 respondents measuring six training > effectiveness dimensions. Should I use Pearson or Spearman correlation, and why? > What visualisations are most appropriate for this type of ordinal data in an > exploratory analysis?”
AI Response Summary: The AI recommended Spearman rank correlation over Pearson, explaining that Likert scales are ordinal — they represent ranked categories rather than equal intervals — and Pearson’s assumption of interval-level measurement is therefore not strictly met. For visualisations, it recommended boxplots (for distribution by group), histograms (for distribution of individual variables), scatter plots with jitter (for relationships between ordinal variables), and a horizontal bar chart for the management summary.
What I Accepted / Modified / Rejected: I accepted the recommendation to use Spearman correlation and updated my correlation code accordingly (method = “spearman”). I adopted the boxplot, histogram, and scatter plot suggestions. I rejected the AI’s suggestion to use a radar/spider chart for the management summary, as this chart type is not well-supported in base ggplot2 without additional packages and is less intuitive for a non-statistical management audience. I used a horizontal bar chart instead.
4.1.3 Interaction 3 — Result Interpretation
AI Tool Used: Claude (claude.ai)
Prompt Submitted: > “My Spearman correlation matrix for a training effectiveness survey shows that > facilitator communication has the strongest correlation with overall satisfaction. > Productivity impact has the lowest average score out of all six dimensions. Write > two sentences interpreting these findings in plain language for a management report.”
AI Response Summary: The AI produced the following interpretation: “The strongest correlation between facilitator communication and overall satisfaction suggests that how training is delivered — the quality and clarity of the facilitator — is a more powerful driver of employee satisfaction than any other single factor, including content relevance or duration. The lowest score for productivity impact indicates a gap between training participation and tangible on-the-job performance improvement, pointing to a need for structured post-training reinforcement mechanisms.”
What I Accepted / Modified / Rejected: I accepted the core interpretation as accurate and well-phrased. I modified the language slightly to match the formal register of this report and extended the second sentence to include a specific recommendation (post-training follow-up mechanisms), which the AI had framed as a diagnosis only. I also added the correlation coefficient value and the specific dimension scores once my analysis was complete, which the AI could not provide as it had not seen the data directly.
4.2 Critical Reflection
Throughout this project, I used Claude (claude.ai) at three key stages of the analytical process: code generation, analytical guidance, and result interpretation. The AI performed most strongly in generating boilerplate R code for data cleaning, column renaming, and chart formatting — tasks that are time-intensive but methodologically straightforward. This significantly reduced the time I spent on syntax debugging and allowed me to concentrate on interpretation and insight development.
However, I identified meaningful limitations that required human judgment to correct. In its initial code generation response, the AI produced a corrplot() call without the mar = c(0,0,2,0) margin parameter, causing the chart title to be clipped in the rendered output — a practical error I only detected during test rendering. More significantly, when I initially prompted the AI without specifying a correlation method, it defaulted toward Pearson correlation without proactively flagging that Pearson assumes interval-level data and may be methodologically inappropriate for ordinal Likert scales. I identified this gap by cross-referencing my course materials and overrode the default by specifying method = “spearman”.
These experiences reinforced a critical principle: AI tools function best as accelerators for analysts who already understand the underlying methodology. The analyst must possess sufficient statistical knowledge to recognise when an AI suggestion is technically functional but analytically inappropriate. AI output should be treated consistently as a draft to be validated, not a final answer to be accepted at face value.
5 STEP 5 — Management Report
5.1 Executive Summary
This report presents findings from a primary data analytics project examining the effectiveness of workplace training programmes within the organisation. Survey data was collected from 54 employees across multiple departments using a structured questionnaire administered via Google Forms. Analysis was conducted in RStudio using descriptive statistics, Spearman correlation analysis, and ggplot2 visualisations. Three principal findings emerge from the data: training content is broadly perceived as relevant to employees’ job roles; facilitator communication quality is the strongest driver of overall training satisfaction; and productivity impact is the lowest-rated dimension, revealing a gap between training attendance and on-the-job application. Three evidence-based recommendations are presented to guide the organisation’s training investment and improvement strategy.
5.2 Research Question and Methodology
Research Question: “To what extent do workplace training programmes influence employee productivity, job performance, and satisfaction within the organisation?”
Primary data was collected via a structured online survey administered to 54 employees using convenience sampling across departments. The instrument captured 15 variables including demographic information, training attendance patterns, and six Likert-scale effectiveness ratings (1 = Strongly Disagree, 5 = Strongly Agree). Data was imported into RStudio as an Excel file and cleaned systematically: missing values were imputed using column medians; duplicate records were removed; variable types were validated and corrected; and outliers were identified using z-score analysis and removed where values exceeded three standard deviations from the mean. Analysis was conducted using the tidyverse, psych, skimr, and corrplot packages. Spearman rank correlation was selected over Pearson given the ordinal nature of Likert-scale data.
5.3 Key Findings
Finding 1: Facilitator Quality is the Strongest Driver of Training Satisfaction
The Spearman correlation matrix (Figure 3) reveals that facilitator communication effectiveness has the strongest positive association with overall training satisfaction among all six dimensions measured. This indicates that how training is delivered matters more to employees than any other single factor — including content relevance, duration, or the number of sessions attended. Employees who rated their facilitators highly were consistently more satisfied overall (Figure 4). This finding has a direct implication for management: improving facilitator selection, training, and ongoing assessment will produce the greatest return on investment for training satisfaction outcomes.
Finding 2: Productivity Impact is the Lowest-Rated Training Dimension
Despite broadly positive ratings across all dimensions (Figure 6), the item measuring whether training increased productivity consistently received the lowest average score. The histogram (Figure 1) confirms a distribution concentrated at higher values overall, but the productivity dimension lags behind. This reveals a structural disconnect between training attendance and tangible on-the-job performance improvement — the ultimate business objective of any training investment. Employees find training relevant and well-delivered, but do not fully convert participation into measurable productivity gains.
Finding 3: Training Satisfaction Varies Significantly Across Departments
The department-level boxplot (Figure 2) reveals substantial variation in training satisfaction scores across the organisation. Some departments show consistently high median scores with narrow spreads, while others display wide interquartile ranges and lower medians. This variation indicates that training quality is not uniformly experienced across the organisation — likely reflecting differences in the relevance of generic training content to specific job functions, or inconsistencies in the facilitators assigned to different departmental cohorts.
5.4 Business Recommendations
Recommendation 1: Invest in Facilitator Development and Quality Assurance
Evidence: Facilitator communication effectiveness shows the strongest Spearman correlation with overall training satisfaction across all six dimensions (Figure 3). The scatter plot (Figure 4) confirms this as a strong, consistent linear relationship.
Action: Introduce a structured facilitator assessment framework with quarterly participant feedback reviews. Establish minimum facilitator competency standards and provide targeted coaching to any facilitator rated below 3.5 on the communication scale. Facilitator performance scores should be formally incorporated into the organisation’s training programme evaluation process.
Recommendation 2: Implement Structured Post-Training Application Mechanisms
Evidence: Productivity impact is the lowest-rated dimension despite high scores on relevance and facilitator effectiveness (Figure 6). This indicates the problem is not content design but application — employees are not translating training into on-the-job behaviour change.
Action: Introduce 30/60/90-day post-training follow-up reviews where line managers assess skill application on the job. Complement classroom training with structured on-the-job assignments, practice tasks, or peer learning pairs that reinforce learning transfer. Measure and report productivity metrics before and after training cycles to build an evidence base for future investment decisions.
Recommendation 3: Address the Most Requested Improvement Area with Department-Level Input
Evidence: The improvement area frequency table (Step 3) identifies the specific training area most frequently cited by employees as requiring improvement. The department-level boxplot (Figure 2) confirms that satisfaction varies across departments, suggesting that generic training programmes may not meet all departmental needs equally.
Action: Prioritise a review and redesign of the most-cited improvement area within the next training planning cycle. Involve employees from the three lowest-satisfaction departments in the content co-design process to ensure the revised programme directly addresses their stated needs, increases relevance, and improves departmental buy-in.
5.5 AI Augmentation Summary
AI tools (Claude, claude.ai) were used at three stages of this analytical project: code generation, analytical guidance, and result interpretation. The tools delivered the greatest value in accelerating routine coding tasks — particularly data cleaning pipelines, column renaming, and chart formatting — and in translating statistical outputs into accessible management language, reducing overall project time.
However, human analytical judgment remained essential throughout. The AI produced a rendering error (missing corrplot margin settings) and did not proactively recommend Spearman over Pearson correlation for ordinal data — a methodologically significant omission that required correction after cross-referencing course materials. These limitations confirm that AI tools are most effective as productivity accelerators for analysts who already possess the methodological foundation to validate AI outputs. In this project, AI augmented human analytical capability without replacing the critical judgment that underpins sound statistical practice.
6 Appendix A — AI Interaction Evidence
The following screenshots document the three AI interactions described in Step 4. Each screenshot shows the prompt submitted, the AI’s response, and where applicable, the modification or rejection decision made by the analyst.
Interaction 1 — Code Generation (Column Renaming and Corrplot)
(Insert screenshot of Claude conversation here — show your prompt and the AI response)
Interaction 2 — Analytical Guidance (Spearman vs Pearson and Chart Selection)
(Insert screenshot of Claude conversation here — show your prompt and the AI response)
Interaction 3 — Result Interpretation (Correlation and Productivity Findings)
(Insert screenshot of Claude conversation here — show your prompt and the AI response)
End of Section B Submission
Data Analytics I | MMBA 10 | Lagos Business School, Pan-Atlantic University
Submitted by: Bisola Amuda