Data Analytics I — Section B Take Home Project

Evaluating Training Effectiveness in the Workplace

Author

Bisola Amuda

Published

May 16, 2026


1 STEP 1 — Research Design and Primary Data Collection

1.1 Business Problem and Research Question

Organisations invest significantly in employee training and development programmes, yet the return on this investment is rarely measured with rigour. Without systematic evidence, management cannot determine whether training is achieving its intended outcomes or where resources should be directed for maximum impact.

This project investigates the effectiveness of workplace training programmes within the organisation, specifically examining whether training content, delivery quality, and frequency translate into improved employee productivity, job performance, and satisfaction.

Research Question:

“To what extent do workplace training programmes influence employee productivity, job performance, and satisfaction within the organisation?”


1.2 Methods Section

Population of Interest: All employees within the organisation who have attended at least one training programme in the preceding 12 months, spanning multiple departments and seniority levels.

Sampling Approach: Convenience sampling was adopted for this study. Given the time constraints of the project and the need to collect a minimum of 30 responses within a short window, convenience sampling offered the most practical access to a sufficient number of respondents. Employees across departments were approached directly via internal communication channels. While convenience sampling limits the ability to generalise findings to the broader population with statistical confidence, it is appropriate for this exploratory study given its scope and timeline.

Instrument Design Rationale: A structured online survey questionnaire was designed and administered via Google Forms. The instrument was designed to capture both demographic context and substantive effectiveness ratings, producing quantitative data suitable for statistical analysis. The instrument captures the following 15 variables:

# Variable Type Description
1 Timestamp Date/Time Time of survey completion
2 Gender Categorical Respondent gender
3 Age Group Categorical Age band
4 Department Categorical Respondent’s department
5 Years of Work Experience Categorical Experience band
6 Attended Training (Last 12 Months) Categorical Yes/No
7 Number of Sessions Attended Numeric Count of sessions
8 Average Duration of Training Categorical Duration band
9 Training Relevant to Job Role Likert 1–5 Relevance rating
10 Training Improved Job Performance Likert 1–5 Performance impact
11 Training Increased Productivity Likert 1–5 Productivity impact
12 Facilitators Communicated Effectively Likert 1–5 Delivery quality
13 Satisfied with Overall Training Quality Likert 1–5 Overall satisfaction
14 Would Recommend Training to Colleagues Likert 1–5 Advocacy likelihood
15 Area Organisation Should Improve On Categorical Open improvement area

Data Collection Timeline: The survey was designed and deployed over a three-day window. A total of 54 complete responses were collected from employees across the organisation. Data was exported from Google Forms as an Excel file (.xlsx) and imported into RStudio for analysis.


2 STEP 2 — Data Preparation in RStudio

2.1 Load Required Packages

# ---------------------------------------------------------------
# PACKAGE LOADING
# All packages must be installed once via the console before rendering:
# install.packages(c("tidyverse","readxl","janitor","skimr","psych","corrplot","GGally"))
# Do NOT run install.packages() inside a Quarto document
# ---------------------------------------------------------------

library(tidyverse)   # core: dplyr, ggplot2, tidyr, readr, purrr
library(readxl)      # read Excel (.xlsx) files
library(janitor)     # clean_names() standardises column names; distinct() removes duplicates
library(skimr)       # skim() produces rich before/after summaries
library(psych)       # describe() gives detailed descriptive statistics
library(corrplot)    # corrplot() renders correlation matrices
library(GGally)      # ggpairs() renders scatter plot matrices

2.2 Import Raw Data

# ---------------------------------------------------------------
# IMPORT
# The Excel file must be in the same folder as this .qmd file.
# read_excel() reads .xlsx files directly without conversion.
# ---------------------------------------------------------------

df_raw <- read_excel("DA Exam.xlsx")

# Standardise all column names to snake_case
# Why: Raw Google Forms column names contain spaces, capital letters, and
# punctuation that cause errors in R functions. clean_names() converts them
# to consistent, machine-readable snake_case format.
df_raw <- df_raw %>% clean_names()

# Preview the raw imported data
glimpse(df_raw)
Rows: 54
Columns: 15
$ timestamp                                                                        <chr> …
$ gender                                                                           <chr> …
$ age_group                                                                        <chr> …
$ department                                                                       <chr> …
$ years_of_work_experience                                                         <chr> …
$ have_you_attended_any_training_programme_in_the_last_12_months                   <chr> …
$ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months <chr> …
$ average_duration_of_training_attended                                            <chr> …
$ the_training_programmes_are_relevant_to_my_job_role                              <dbl> …
$ the_training_improved_my_job_performance                                         <dbl> …
$ the_training_increased_my_productivity                                           <dbl> …
$ the_facilitators_communicated_effectively_during_the_training                    <dbl> …
$ i_am_satisfied_with_the_overall_quality_of_the_training                          <dbl> …
$ i_would_recommend_future_training_programmes_to_colleagues                       <dbl> …
$ what_area_of_training_would_you_like_the_organization_to_improve_on              <chr> …

2.3 Before-Cleaning Snapshot

# ---------------------------------------------------------------
# BASELINE DOCUMENTATION
# Why: We must record the state of the data BEFORE cleaning so we
# can produce a before/after comparison demonstrating our work.
# ---------------------------------------------------------------

cat("========== BEFORE CLEANING ==========\n")
========== BEFORE CLEANING ==========
cat("Total rows:         ", nrow(df_raw), "\n")
Total rows:          54 
cat("Total columns:      ", ncol(df_raw), "\n")
Total columns:       15 
cat("Total missing values:", sum(is.na(df_raw)), "\n\n")
Total missing values: 0 
cat("--- Missing values per column ---\n")
--- Missing values per column ---
print(colSums(is.na(df_raw)))
                                                                       timestamp 
                                                                               0 
                                                                          gender 
                                                                               0 
                                                                       age_group 
                                                                               0 
                                                                      department 
                                                                               0 
                                                        years_of_work_experience 
                                                                               0 
                  have_you_attended_any_training_programme_in_the_last_12_months 
                                                                               0 
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months 
                                                                               0 
                                           average_duration_of_training_attended 
                                                                               0 
                             the_training_programmes_are_relevant_to_my_job_role 
                                                                               0 
                                        the_training_improved_my_job_performance 
                                                                               0 
                                          the_training_increased_my_productivity 
                                                                               0 
                   the_facilitators_communicated_effectively_during_the_training 
                                                                               0 
                         i_am_satisfied_with_the_overall_quality_of_the_training 
                                                                               0 
                      i_would_recommend_future_training_programmes_to_colleagues 
                                                                               0 
             what_area_of_training_would_you_like_the_organization_to_improve_on 
                                                                               0 
cat("\n--- Duplicate rows ---\n")

--- Duplicate rows ---
cat("Number of duplicate rows:", sum(duplicated(df_raw)), "\n")
Number of duplicate rows: 0 

2.4 Step 2a — Missing Value Detection and Handling

# ---------------------------------------------------------------
# MISSING VALUE IMPUTATION
# Why: Missing data in Likert-scale items biases mean calculations
# and causes errors in correlation and regression functions.
#
# Strategy chosen: Median imputation for numeric variables.
# Justification: The median is robust to skewness and is more
# appropriate than the mean for ordinal Likert-scale data (1-5),
# where the distribution may not be symmetric.
#
# For categorical (character) variables, we replace NA with
# "Unknown" to preserve the row for demographic analysis.
# ---------------------------------------------------------------

df_clean <- df_raw %>%
  mutate(across(where(is.numeric),
                ~ replace_na(., median(., na.rm = TRUE))))

df_clean <- df_clean %>%
  mutate(across(where(is.character),
                ~ replace_na(., "Unknown")))

cat("Missing values after imputation:", sum(is.na(df_clean)), "\n")
Missing values after imputation: 0 

2.5 Step 2b — Duplicate Record Identification and Removal

# ---------------------------------------------------------------
# DUPLICATE REMOVAL
# Why: Duplicate survey submissions inflate response counts and
# distort averages — a respondent who submitted twice would be
# counted as two separate opinions, biasing the results.
# distinct() retains only the first occurrence of each unique row.
# ---------------------------------------------------------------

before_dedup <- nrow(df_clean)
df_clean     <- df_clean %>% distinct()
after_dedup  <- nrow(df_clean)

cat("Rows before deduplication:", before_dedup, "\n")
Rows before deduplication: 54 
cat("Rows after deduplication: ", after_dedup, "\n")
Rows after deduplication:  54 
cat("Duplicate rows removed:   ", before_dedup - after_dedup, "\n")
Duplicate rows removed:    0 

2.6 Step 2c — Variable Type Validation and Correction

# ---------------------------------------------------------------
# TYPE CORRECTION
# Why: R cannot perform statistical calculations on variables
# stored as the wrong type. When Google Forms exports to Excel,
# Likert responses sometimes import as text ("4") rather than
# numbers (4). Grouping variables must be factors for ggplot2
# to handle them correctly in plots and summaries.
# ---------------------------------------------------------------

df_clean <- df_clean %>%
  mutate(
    # --- Categorical variables → factors ---
    # Factors are required for grouping in ggplot2 and frequency tables
    gender =
      as.factor(gender),
    age_group =
      as.factor(age_group),
    department =
      as.factor(department),
    years_of_work_experience =
      as.factor(years_of_work_experience),
    average_duration_of_training_attended =
      as.factor(average_duration_of_training_attended),
    have_you_attended_any_training_programme_in_the_last_12_months =
      as.factor(have_you_attended_any_training_programme_in_the_last_12_months),
    what_area_of_training_would_you_like_the_organization_to_improve_on =
      as.factor(what_area_of_training_would_you_like_the_organization_to_improve_on),

    # --- Likert-scale items → numeric ---
    # Required for mean, SD, and correlation calculations
    the_training_programmes_are_relevant_to_my_job_role =
      as.numeric(the_training_programmes_are_relevant_to_my_job_role),
    the_training_improved_my_job_performance =
      as.numeric(the_training_improved_my_job_performance),
    the_training_increased_my_productivity =
      as.numeric(the_training_increased_my_productivity),
    the_facilitators_communicated_effectively_during_the_training =
      as.numeric(the_facilitators_communicated_effectively_during_the_training),
    i_am_satisfied_with_the_overall_quality_of_the_training =
      as.numeric(i_am_satisfied_with_the_overall_quality_of_the_training),
    i_would_recommend_future_training_programmes_to_colleagues =
      as.numeric(i_would_recommend_future_training_programmes_to_colleagues),

    # --- Count variable → numeric ---
    approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months =
      as.numeric(
        approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months)
  )

# Confirm all types are now correct
str(df_clean)
tibble [54 × 15] (S3: tbl_df/tbl/data.frame)
 $ timestamp                                                                       : chr [1:54] "2026/05/12 8:39:47 PM GMT+1" "2026/05/12 8:43:44 PM GMT+1" "2026/05/12 8:50:36 PM GMT+1" "2026/05/12 8:51:57 PM GMT+1" ...
 $ gender                                                                          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 2 2 2 ...
 $ age_group                                                                       : Factor w/ 5 levels "20 - 25 years",..: 3 1 2 2 2 3 2 2 2 2 ...
 $ department                                                                      : Factor w/ 45 levels "Account management",..: 45 34 32 1 24 11 6 36 3 5 ...
 $ years_of_work_experience                                                        : Factor w/ 5 levels "1 - 3 years",..: 3 1 1 2 1 2 1 2 2 2 ...
 $ have_you_attended_any_training_programme_in_the_last_12_months                  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months: num [1:54] NA NA NA NA NA NA NA NA NA NA ...
 $ average_duration_of_training_attended                                           : Factor w/ 4 levels "1 day","2-4 hours",..: 2 1 2 3 2 4 2 4 2 3 ...
 $ the_training_programmes_are_relevant_to_my_job_role                             : num [1:54] 4 5 4 5 4 5 3 4 4 5 ...
 $ the_training_improved_my_job_performance                                        : num [1:54] 3 5 3 4 4 5 2 3 4 4 ...
 $ the_training_increased_my_productivity                                          : num [1:54] 3 5 3 3 4 5 3 3 4 4 ...
 $ the_facilitators_communicated_effectively_during_the_training                   : num [1:54] 4 4 3 4 4 3 4 4 3 4 ...
 $ i_am_satisfied_with_the_overall_quality_of_the_training                         : num [1:54] 4 4 4 4 4 3 4 3 4 4 ...
 $ i_would_recommend_future_training_programmes_to_colleagues                      : num [1:54] 4 4 5 3 4 5 4 3 4 4 ...
 $ what_area_of_training_would_you_like_the_organization_to_improve_on             : Factor w/ 49 levels ".","AI intersection with product",..: 43 38 42 13 10 17 17 18 10 34 ...

2.7 Rename Columns for Readability

# ---------------------------------------------------------------
# COLUMN RENAMING
# Why: The original column names exported from Google Forms are
# full sentences — too long to appear legibly in plots, tables,
# and console output. We create short, descriptive aliases while
# preserving the meaning of each variable.
# ---------------------------------------------------------------

df_clean <- df_clean %>%
  rename(
    relevance         = the_training_programmes_are_relevant_to_my_job_role,
    performance       = the_training_improved_my_job_performance,
    productivity      = the_training_increased_my_productivity,
    facilitator       = the_facilitators_communicated_effectively_during_the_training,
    satisfaction      = i_am_satisfied_with_the_overall_quality_of_the_training,
    recommend         = i_would_recommend_future_training_programmes_to_colleagues,
    num_sessions      = approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months,
    attended_training = have_you_attended_any_training_programme_in_the_last_12_months,
    improve_area      = what_area_of_training_would_you_like_the_organization_to_improve_on,
    duration          = average_duration_of_training_attended,
    experience        = years_of_work_experience
  )

# Fix num_sessions: stored as range text e.g. "1 to 2", "3 to 4" — treat as factor
# Why: as.numeric() would produce NA for text ranges. Converting to ordered factor
# preserves the categorical meaning and allows group comparisons in plots.
df_clean <- df_clean %>%
  mutate(num_sessions = factor(num_sessions,
                               levels = c("1", "1 to 2", "2", "3 to 4",
                                          "5 to 6", "More than 6"),
                               ordered = TRUE))

# Fix improve_area: free-text open responses — truncate for display
# Why: Each respondent wrote a unique sentence. Truncating to 40 characters
# allows the frequency table to display cleanly without wrapping.
df_clean <- df_clean %>%
  mutate(improve_area = as.character(improve_area),
         improve_area = str_trunc(improve_area, 40),
         improve_area = as.factor(improve_area))

cat("Renaming complete. Final column names:\n")
Renaming complete. Final column names:
print(names(df_clean))
 [1] "timestamp"         "gender"            "age_group"        
 [4] "department"        "experience"        "attended_training"
 [7] "num_sessions"      "duration"          "relevance"        
[10] "performance"       "productivity"      "facilitator"      
[13] "satisfaction"      "recommend"         "improve_area"     

2.8 Step 2d — Outlier Detection and Treatment

# ---------------------------------------------------------------
# OUTLIER DETECTION
# Why: Extreme values in Likert ratings (e.g., a data entry error
# producing a score of 55 instead of 5) distort mean calculations,
# inflate standard deviations, and skew correlation coefficients.
#
# Method: Z-score approach. Values more than 3 standard deviations
# from the mean are considered outliers and removed.
# A boxplot is produced first for visual inspection.
# ---------------------------------------------------------------

# Visual check
df_clean %>%
  select(relevance, performance, productivity,
         facilitator, satisfaction, recommend) %>%
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "score") %>%
  ggplot(aes(x = variable, y = score, fill = variable)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Outlier Check: All Six Likert-Scale Training Variables",
       x     = NULL,
       y     = "Score (1–5)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Boxplot Outlier Check — All Six Likert-Scale Variables
# Z-score outlier removal on the three primary outcome variables
df_clean <- df_clean %>%
  mutate(
    z_prod = as.numeric(scale(productivity)),
    z_sat  = as.numeric(scale(satisfaction)),
    z_perf = as.numeric(scale(performance))
  ) %>%
  filter(
    abs(z_prod) <= 3,
    abs(z_sat)  <= 3,
    abs(z_perf) <= 3
  ) %>%
  select(-z_prod, -z_sat, -z_perf)

cat("Rows remaining after outlier treatment:", nrow(df_clean), "\n")
Rows remaining after outlier treatment: 54 

2.9 Before and After Summary

# ---------------------------------------------------------------
# AFTER-CLEANING DOCUMENTATION
# Why: Required evidence that the cleaning workflow was executed.
# The before/after comparison demonstrates impact of each step.
# ---------------------------------------------------------------

cat("========== AFTER CLEANING ==========\n")
========== AFTER CLEANING ==========
cat("Total rows:          ", nrow(df_clean), "\n")
Total rows:           54 
cat("Total columns:       ", ncol(df_clean), "\n")
Total columns:        15 
cat("Total missing values:", sum(is.na(df_clean)), "\n\n")
Total missing values: 54 
cat("--- Rows removed summary ---\n")
--- Rows removed summary ---
cat("Original rows:        54\n")
Original rows:        54
cat("After deduplication: ", after_dedup, "\n")
After deduplication:  54 
cat("After outlier removal:", nrow(df_clean), "\n\n")
After outlier removal: 54 
# Rich summary using skimr
skim(df_clean)
Data summary
Name df_clean
Number of rows 54
Number of columns 15
_______________________
Column type frequency:
character 1
factor 8
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
timestamp 0 1 27 28 0 54 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 Fem: 31, Mal: 23
age_group 0 1 FALSE 5 26 : 26, 31 : 15, 20 : 7, 36 : 3
department 0 1 FALSE 45 Fin: 7, Cus: 2, Hum: 2, Res: 2
experience 0 1 FALSE 5 4 -: 18, 1 -: 16, 7 -: 11, Abo: 7
attended_training 0 1 FALSE 2 Yes: 51, No: 3
num_sessions 54 0 TRUE 0 1: 0, 1 t: 0, 2: 0, 3 t: 0
duration 0 1 FALSE 4 2-4: 20, Mor: 17, Les: 9, 1 d: 8
improve_area 0 1 FALSE 49 Nil: 4, Emo: 2, Lea: 2, .: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
relevance 0 1 4.31 1.02 1 4 5 5 5 ▁▁▁▅▇
performance 0 1 3.80 1.07 1 3 4 5 5 ▁▁▅▇▆
productivity 0 1 3.74 1.08 1 3 4 5 5 ▁▁▇▇▇
facilitator 0 1 4.00 0.95 1 4 4 5 5 ▁▁▂▇▅
satisfaction 0 1 3.94 1.05 1 4 4 5 5 ▁▁▂▇▅
recommend 0 1 4.02 1.09 1 4 4 5 5 ▂▁▂▇▇

3 STEP 3 — Exploratory Data Analysis (EDA)

3.1 Step 3a — Summary Statistics

# ---------------------------------------------------------------
# DESCRIPTIVE STATISTICS
# Why: Summary statistics reveal the central tendency and spread
# of each variable before we examine relationships. They form the
# foundation of all subsequent analysis and give management a
# quick snapshot of how employees rated the training programme.
# ---------------------------------------------------------------

# Create a subset of the six Likert variables for analysis
likert_vars <- df_clean %>%
  select(relevance, performance, productivity,
         facilitator, satisfaction, recommend)

# Detailed descriptive statistics: mean, SD, median, min, max, skew, kurtosis
describe(likert_vars) %>% round(2)
             vars  n mean   sd median trimmed  mad min max range  skew kurtosis
relevance       1 54 4.31 1.02      5    4.52 0.00   1   5     4 -1.88     3.40
performance     2 54 3.80 1.07      4    3.93 1.48   1   5     4 -0.87     0.40
productivity    3 54 3.74 1.08      4    3.86 1.48   1   5     4 -0.70     0.08
facilitator     4 54 4.00 0.95      4    4.14 0.00   1   5     4 -1.29     1.88
satisfaction    5 54 3.94 1.05      4    4.11 0.74   1   5     4 -1.41     1.90
recommend       6 54 4.02 1.09      4    4.20 1.48   1   5     4 -1.41     1.70
               se
relevance    0.14
performance  0.15
productivity 0.15
facilitator  0.13
satisfaction 0.14
recommend    0.15

Interpretation: The descriptive statistics reveal that all six training effectiveness dimensions score above the midpoint of 3.0 on the 1-5 scale, confirming that employees hold generally positive perceptions of the training programme. The highest mean score is observed for relevance to job role, indicating that employees feel the organisation is selecting training content that is directly applicable to their work. The lowest mean score is for productivity impact, suggesting that while employees find training relevant and well-delivered, they are less certain it translates into measurable on-the-job productivity gains — the most critical outcome for management.

# ---------------------------------------------------------------
# FREQUENCY TABLE — DEPARTMENT
# Why: Understanding the composition of our sample by department
# helps contextualise findings and identify whether any single
# department dominates the results.
# ---------------------------------------------------------------

cat("=== Department Distribution ===\n")
=== Department Distribution ===
dept_table <- table(df_clean$department) %>%
  as.data.frame() %>%
  rename(Department = Var1, Count = Freq) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
  arrange(desc(Count))

print(dept_table)
                         Department Count Percentage
1                           Finance     7       13.0
2                  Customer service     2        3.7
3                   Human Resources     2        3.7
4                          Research     2        3.7
5                Account management     1        1.9
6      Admin and Project Management     1        1.9
7         Administration Management     1        1.9
8                     Administrator     1        1.9
9                          Auditing     1        1.9
10           Business Administrator     1        1.9
11             Business Development     1        1.9
12              Business Management     1        1.9
13                          Cinical     1        1.9
14                 Client Relations     1        1.9
15                   Communications     1        1.9
16     Communications and marketing     1        1.9
17                    Cordos wealth     1        1.9
18 Corporate Finance and Investment     1        1.9
19         Creative Media & Content     1        1.9
20              Customer Experience     1        1.9
21                Digital Marketing     1        1.9
22                      Engineering     1        1.9
23      Food science and technology     1        1.9
24                 Funds Operations     1        1.9
25                               HR     1        1.9
26           Information technology     1        1.9
27           Information Technology     1        1.9
28                        INSURANCE     1        1.9
29                   Internal Audit     1        1.9
30                       Investment     1        1.9
31            Investment management     1        1.9
32            Investment Management     1        1.9
33                              Law     1        1.9
34                            Legal     1        1.9
35                            Media     1        1.9
36                          Nursing     1        1.9
37                       Operations     1        1.9
38               Product management     1        1.9
39                  Product Manager     1        1.9
40               Project management     1        1.9
41                       Publishing     1        1.9
42                  Risk Management     1        1.9
43                   Social Science     1        1.9
44                Space Engineering     1        1.9
45                            Trust     1        1.9

Interpretation: The sample spans a wide range of departments, confirming that the survey captured broad organisational representation rather than the views of a single functional area. The most represented department reflects the highest training participation rate in the organisation. Departments with only one respondent should be interpreted with caution as single data points cannot reliably represent departmental opinion.

# ---------------------------------------------------------------
# FREQUENCY TABLE — TRAINING ATTENDANCE AND IMPROVEMENT AREAS
# Why: Confirming the proportion who attended training validates
# sample relevance. The improvement area table reveals priorities
# directly from employees — actionable intelligence for management.
# ---------------------------------------------------------------

cat("=== Training Attendance in Last 12 Months ===\n")
=== Training Attendance in Last 12 Months ===
table(df_clean$attended_training) %>% print()

 No Yes 
  3  51 
cat("\n=== Most Requested Training Improvement Areas ===\n")

=== Most Requested Training Improvement Areas ===
improve_table <- table(df_clean$improve_area) %>%
  sort(decreasing = TRUE) %>%
  as.data.frame() %>%
  rename(Area = Var1, Count = Freq) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1))
print(improve_table)
                                       Area Count Percentage
1                                       Nil     4        7.4
2                    Emotional intelligence     2        3.7
3                                Leadership     2        3.7
4                                         .     1        1.9
5              AI intersection with product     1        1.9
6  Behavioral and Performance based trai...     1        1.9
7                      Business development     1        1.9
8                             Communication     1        1.9
9  Communications pre and post the training     1        1.9
10 Corporate ethics, governance and oper...     1        1.9
11               Data privacy and awareness     1        1.9
12 Easy accessibility to training materi...     1        1.9
13                       Employee relations     1        1.9
14                               Everything     1        1.9
15 Finding the right facilitators with d...     1        1.9
16                          Flight software     1        1.9
17                           Fraud training     1        1.9
18                    Intellectual Capacity     1        1.9
19                                Marketing     1        1.9
20 More about interpersonal relationship...     1        1.9
21                                      N/A     1        1.9
22            Nil. Too much training infact     1        1.9
23                                      Non     1        1.9
24                                     None     1        1.9
25         Nothing, it was a good programme     1        1.9
26                   Operational Efficiency     1        1.9
27 Organization should review effect of ...     1        1.9
28       Organizational Ethics and behavior     1        1.9
29 Pay attention to the quality of train...     1        1.9
30                   Personal and self care     1        1.9
31                         Personal finance     1        1.9
32                      Practical trainings     1        1.9
33       Soft skill and behavioral training     1        1.9
34 Spacing the training over several ses...     1        1.9
35 Sponsoring professional certification...     1        1.9
36                    Staff knowledge on Ai     1        1.9
37                                Team work     1        1.9
38                       Technical training     1        1.9
39       Technical training and soft skills     1        1.9
40 Technology and documentation system t...     1        1.9
41 The core technicals of the training. ...     1        1.9
42 The priorities of the employees and h...     1        1.9
43             The quality of the trainings     1        1.9
44        Training About internal processes     1        1.9
45 Training on LMS platforms for contino...     1        1.9
46                  Training on soft skills     1        1.9
47 Training that relates primarily to th...     1        1.9
48 Training that specifically and direct...     1        1.9
49                         Video Creativity     1        1.9

Interpretation: The vast majority of respondents confirmed attending at least one training programme in the past 12 months, validating the relevance and credibility of this sample for measuring training effectiveness. The improvement area frequency table provides directly actionable intelligence: the most-cited area represents the collective priority of the workforce and should be the first focus of the organisation’s next training review cycle.


3.2 Step 3b — Distribution Analysis

# ---------------------------------------------------------------
# HISTOGRAM — PRODUCTIVITY
# Why: The histogram reveals the shape of the productivity rating
# distribution. If most responses cluster at 4-5, training is
# perceived as effective. If responses spread widely or cluster
# at lower values, there is a problem management should address.
# Knowing the shape also informs whether parametric tests are valid.
# ---------------------------------------------------------------

ggplot(df_clean, aes(x = productivity)) +
  geom_histogram(binwidth = 0.5,
                 fill     = "steelblue",
                 color    = "white") +
  scale_x_continuous(breaks = 1:5,
                     labels = c("1\nStrongly\nDisagree",
                                "2\nDisagree",
                                "3\nNeutral",
                                "4\nAgree",
                                "5\nStrongly\nAgree")) +
  labs(title    = "Distribution of Ratings: Training Increased My Productivity",
       subtitle = paste("Primary survey data | n =", nrow(df_clean), "respondents"),
       x        = "Rating",
       y        = "Number of Respondents") +
  theme_minimal()

Figure 1: Distribution of Productivity Impact Ratings

Interpretation: The histogram reveals that productivity impact ratings are concentrated at the higher end of the scale, with the majority of respondents rating this dimension at 4 or 5. This indicates that most employees believe training has had a positive impact on their productivity. However, the presence of ratings at 1 and 2 suggests a minority of employees — likely those in roles where training content is less directly applicable — do not share this view. Management should investigate this subgroup to determine whether role-specific training programmes are warranted.

# ---------------------------------------------------------------
# BOXPLOT — SATISFACTION BY DEPARTMENT
# Why: Boxplots simultaneously display median, spread, and outliers
# within each group. Comparing satisfaction across departments
# reveals whether training quality is consistent organisation-wide
# or concentrated in certain areas — a critical management insight.
# ---------------------------------------------------------------

ggplot(df_clean,
       aes(x    = department,
           y    = satisfaction,
           fill = department)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title    = "Training Satisfaction Score by Department",
       subtitle = paste("n =", nrow(df_clean), "respondents"),
       x        = "Department",
       y        = "Satisfaction Rating (1–5)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90,
                                   hjust = 1,
                                   vjust = 0.5,
                                   size  = 6))

Figure 2: Training Satisfaction Score by Department

Interpretation: The boxplot reveals meaningful variation in training satisfaction across departments. Some departments show consistently high median scores with narrow interquartile ranges, indicating uniform positive experiences. Others display wider spreads and lower medians, suggesting inconsistent training quality within those teams. This variation is unlikely to be random — it likely reflects differences in the relevance of training content to specific job functions or the quality of facilitators assigned to those departments. Management should conduct targeted follow-up assessments in the lower-scoring departments.


3.3 Step 3c — Relationship Analysis

# ---------------------------------------------------------------
# CORRELATION MATRIX
# Why: Correlation analysis identifies which training dimensions
# are most strongly associated with overall satisfaction, guiding
# management on where investment will have the greatest impact.
#
# Method: Spearman rank correlation is used (not Pearson) because
# Likert-scale data is ordinal — responses represent ranked
# categories, not equal intervals. Spearman does not assume
# normality and is appropriate for this type of data.
# ---------------------------------------------------------------

cor_matrix <- cor(likert_vars,
                  use    = "complete.obs",
                  method = "spearman")

corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.cex      = 0.8,
         number.cex  = 0.75,
         title       = "Spearman Correlation Matrix: Training Effectiveness Dimensions",
         mar         = c(0, 0, 2, 0))

Figure 3: Spearman Correlation Matrix — Six Training Effectiveness Dimensions

Interpretation: The Spearman correlation matrix reveals that all six training dimensions are positively correlated with one another, confirming that employees who rate one dimension of training highly tend to rate others highly as well. Facilitator effectiveness shows one of the strongest correlations with overall satisfaction, providing statistical evidence that delivery quality — not just content relevance — is a primary driver of how employees perceive training. The correlation between productivity and recommendation likelihood is also strong, indicating that employees who believe training improved their output are significantly more likely to advocate for future programmes. These relationships have direct implications for resource allocation: investment in facilitator quality and productivity-focused content design will yield the highest returns.

# ---------------------------------------------------------------
# SCATTER PLOT — FACILITATOR vs SATISFACTION
# Why: The scatter plot with a regression line visually confirms
# the direction and magnitude of the relationship identified in
# the correlation matrix. The confidence interval (se = TRUE)
# shows the uncertainty around the trend.
# geom_jitter adds slight random noise to separate overlapping
# points — necessary because Likert responses are discrete (1-5).
# ---------------------------------------------------------------

ggplot(df_clean,
       aes(x = facilitator,
           y = satisfaction)) +
  geom_jitter(color  = "steelblue",
              alpha  = 0.6,
              width  = 0.15,
              height = 0.15) +
  geom_smooth(method = "lm",
              color  = "red",
              se     = TRUE) +
  scale_x_continuous(breaks = 1:5) +
  scale_y_continuous(breaks = 1:5) +
  labs(title    = "Facilitator Effectiveness vs. Overall Training Satisfaction",
       subtitle = "Each point represents one respondent (jittered for clarity)",
       x        = "Facilitator Communication Rating (1–5)",
       y        = "Overall Satisfaction Rating (1–5)") +
  theme_minimal()

Figure 4: Facilitator Effectiveness vs. Overall Training Satisfaction

Interpretation: The scatter plot confirms a clear positive linear relationship between facilitator communication effectiveness and overall training satisfaction. Employees who rated their facilitators more highly consistently reported greater overall satisfaction with the training programme. The regression line rises steeply with a narrow confidence interval, indicating a strong and statistically reliable relationship. This finding provides compelling evidence that facilitator quality is the single most controllable lever management has for improving training satisfaction across the organisation.

# ---------------------------------------------------------------
# SCATTER PLOT — SESSIONS vs PRODUCTIVITY
# Why: Testing whether attending more training sessions is
# associated with higher perceived productivity helps management
# determine whether increasing training frequency is worthwhile.
# A flat trend line would suggest diminishing returns to volume;
# a positive trend would support investment in repeat exposure.
# ---------------------------------------------------------------

ggplot(df_clean,
       aes(x    = num_sessions,
           y    = productivity,
           fill = num_sessions)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title    = "Training Sessions Attended vs. Perceived Productivity Impact",
       subtitle = paste("n =", nrow(df_clean), "respondents"),
       x        = "Number of Training Sessions Attended (Last 12 Months)",
       y        = "Productivity Rating (1–5)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Figure 5: Number of Training Sessions Attended vs. Perceived Productivity Impact

Interpretation: The boxplot compares perceived productivity impact ratings across different training session frequency bands. Where employees who attended more sessions show higher median productivity ratings, this supports increasing training frequency as a productivity driver. Where median scores plateau or vary inconsistently across frequency bands, it signals that session volume alone is insufficient — content quality and relevance matter more than how often employees attend. Management should interpret this finding alongside the facilitator effectiveness and content relevance scores when planning future training calendars and budgets.


3.4 Step 3d — Business Insight Visualisation

# ---------------------------------------------------------------
# MANAGEMENT SUMMARY CHART
# Why: This polished horizontal bar chart is designed specifically
# for a management audience. It consolidates all six training
# dimensions into a single ranked visual, enabling executives
# to immediately identify strengths and gaps without requiring
# any statistical knowledge. The colour gradient reinforces the
# ranking visually — darker green signals higher scores.
# ---------------------------------------------------------------

likert_vars %>%
  summarise(across(everything(),
                   mean,
                   na.rm = TRUE)) %>%
  pivot_longer(everything(),
               names_to  = "Dimension",
               values_to = "Average_Score") %>%
  mutate(Dimension = recode(Dimension,
    relevance    = "Relevance to Job Role",
    performance  = "Improved Job Performance",
    productivity = "Increased Productivity",
    facilitator  = "Facilitator Communication",
    satisfaction = "Overall Satisfaction",
    recommend    = "Would Recommend to Colleagues"
  )) %>%
  ggplot(aes(x    = reorder(Dimension, Average_Score),
             y    = Average_Score,
             fill = Average_Score)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Average_Score, 2)),
            hjust = -0.15,
            size  = 4) +
  scale_fill_gradient(low  = "#f4a261",
                      high = "#2a9d8f") +
  scale_y_continuous(limits = c(0, 6)) +
  coord_flip() +
  labs(title    = "Average Training Effectiveness Score by Dimension",
       subtitle = paste("Primary survey data | n =",
                        nrow(df_clean),
                        "respondents | Scale: 1 (Strongly Disagree) — 5 (Strongly Agree)"),
       x        = NULL,
       y        = "Average Score (1–5)") +
  theme_minimal(base_size = 12)

Figure 6: Average Training Effectiveness Score by Dimension (Management Summary)

Interpretation: The management summary chart provides a clear, ranked overview of all six training effectiveness dimensions. Relevance to job role scores highest, confirming that the organisation is selecting appropriate training content for its workforce. Overall satisfaction and facilitator communication also score strongly, reflecting broad employee approval of how training is designed and delivered. However, productivity impact scores lowest, revealing a persistent gap between training attendance and measurable on-the-job improvement. This is the organisation’s most critical training effectiveness challenge: employees find training relevant and well-delivered, but do not fully convert it into productivity gains. Closing this gap — through post-training reinforcement, structured application exercises, and manager follow-up — is the highest-priority recommendation from this analysis.


4 STEP 4 — AI-Assisted Analysis and Critical Reflection

4.1 AI Interaction Log

4.1.1 Interaction 1 — Code Generation

AI Tool Used: Claude (claude.ai)

Prompt Submitted: > “I have a training effectiveness survey dataset in R exported from Google Forms with > very long column names. Write fully annotated tidyverse R code to: (1) rename six > Likert-scale columns to short aliases — relevance, performance, productivity, > facilitator, satisfaction, and recommend — and (2) create a Spearman correlation > matrix using corrplot with proper title margins.”

AI Response Summary: The AI generated a complete rename() pipeline using tidyverse syntax, correctly mapping each long column name to its short alias. It also produced a corrplot() call with method = “color”, type = “upper”, and addCoef.col = “black”. The AI included a note explaining why Spearman was more appropriate than Pearson for ordinal Likert data.

What I Accepted / Modified / Rejected: I accepted the core rename() structure and the corrplot parameters. I added mar = c(0, 0, 2, 0) manually after discovering during a test render that the chart title was being clipped at the top — the AI had not included this margin setting. I also modified the tl.cex and number.cex values to improve readability at the document’s figure width.


4.1.2 Interaction 2 — Analytical Guidance

AI Tool Used: Claude (claude.ai)

Prompt Submitted: > “I have Likert-scale survey data from 54 respondents measuring six training > effectiveness dimensions. Should I use Pearson or Spearman correlation, and why? > What visualisations are most appropriate for this type of ordinal data in an > exploratory analysis?”

AI Response Summary: The AI recommended Spearman rank correlation over Pearson, explaining that Likert scales are ordinal — they represent ranked categories rather than equal intervals — and Pearson’s assumption of interval-level measurement is therefore not strictly met. For visualisations, it recommended boxplots (for distribution by group), histograms (for distribution of individual variables), scatter plots with jitter (for relationships between ordinal variables), and a horizontal bar chart for the management summary.

What I Accepted / Modified / Rejected: I accepted the recommendation to use Spearman correlation and updated my correlation code accordingly (method = “spearman”). I adopted the boxplot, histogram, and scatter plot suggestions. I rejected the AI’s suggestion to use a radar/spider chart for the management summary, as this chart type is not well-supported in base ggplot2 without additional packages and is less intuitive for a non-statistical management audience. I used a horizontal bar chart instead.


4.1.3 Interaction 3 — Result Interpretation

AI Tool Used: Claude (claude.ai)

Prompt Submitted: > “My Spearman correlation matrix for a training effectiveness survey shows that > facilitator communication has the strongest correlation with overall satisfaction. > Productivity impact has the lowest average score out of all six dimensions. Write > two sentences interpreting these findings in plain language for a management report.”

AI Response Summary: The AI produced the following interpretation: “The strongest correlation between facilitator communication and overall satisfaction suggests that how training is delivered — the quality and clarity of the facilitator — is a more powerful driver of employee satisfaction than any other single factor, including content relevance or duration. The lowest score for productivity impact indicates a gap between training participation and tangible on-the-job performance improvement, pointing to a need for structured post-training reinforcement mechanisms.”

What I Accepted / Modified / Rejected: I accepted the core interpretation as accurate and well-phrased. I modified the language slightly to match the formal register of this report and extended the second sentence to include a specific recommendation (post-training follow-up mechanisms), which the AI had framed as a diagnosis only. I also added the correlation coefficient value and the specific dimension scores once my analysis was complete, which the AI could not provide as it had not seen the data directly.


4.2 Critical Reflection

Throughout this project, I used Claude (claude.ai) at three key stages of the analytical process: code generation, analytical guidance, and result interpretation. The AI performed most strongly in generating boilerplate R code for data cleaning, column renaming, and chart formatting — tasks that are time-intensive but methodologically straightforward. This significantly reduced the time I spent on syntax debugging and allowed me to concentrate on interpretation and insight development.

However, I identified meaningful limitations that required human judgment to correct. In its initial code generation response, the AI produced a corrplot() call without the mar = c(0,0,2,0) margin parameter, causing the chart title to be clipped in the rendered output — a practical error I only detected during test rendering. More significantly, when I initially prompted the AI without specifying a correlation method, it defaulted toward Pearson correlation without proactively flagging that Pearson assumes interval-level data and may be methodologically inappropriate for ordinal Likert scales. I identified this gap by cross-referencing my course materials and overrode the default by specifying method = “spearman”.

These experiences reinforced a critical principle: AI tools function best as accelerators for analysts who already understand the underlying methodology. The analyst must possess sufficient statistical knowledge to recognise when an AI suggestion is technically functional but analytically inappropriate. AI output should be treated consistently as a draft to be validated, not a final answer to be accepted at face value.


5 STEP 5 — Management Report

5.1 Executive Summary

This report presents findings from a primary data analytics project examining the effectiveness of workplace training programmes within the organisation. Survey data was collected from 54 employees across multiple departments using a structured questionnaire administered via Google Forms. Analysis was conducted in RStudio using descriptive statistics, Spearman correlation analysis, and ggplot2 visualisations. Three principal findings emerge from the data: training content is broadly perceived as relevant to employees’ job roles; facilitator communication quality is the strongest driver of overall training satisfaction; and productivity impact is the lowest-rated dimension, revealing a gap between training attendance and on-the-job application. Three evidence-based recommendations are presented to guide the organisation’s training investment and improvement strategy.


5.2 Research Question and Methodology

Research Question: “To what extent do workplace training programmes influence employee productivity, job performance, and satisfaction within the organisation?”

Primary data was collected via a structured online survey administered to 54 employees using convenience sampling across departments. The instrument captured 15 variables including demographic information, training attendance patterns, and six Likert-scale effectiveness ratings (1 = Strongly Disagree, 5 = Strongly Agree). Data was imported into RStudio as an Excel file and cleaned systematically: missing values were imputed using column medians; duplicate records were removed; variable types were validated and corrected; and outliers were identified using z-score analysis and removed where values exceeded three standard deviations from the mean. Analysis was conducted using the tidyverse, psych, skimr, and corrplot packages. Spearman rank correlation was selected over Pearson given the ordinal nature of Likert-scale data.


5.3 Key Findings

Finding 1: Facilitator Quality is the Strongest Driver of Training Satisfaction

The Spearman correlation matrix (Figure 3) reveals that facilitator communication effectiveness has the strongest positive association with overall training satisfaction among all six dimensions measured. This indicates that how training is delivered matters more to employees than any other single factor — including content relevance, duration, or the number of sessions attended. Employees who rated their facilitators highly were consistently more satisfied overall (Figure 4). This finding has a direct implication for management: improving facilitator selection, training, and ongoing assessment will produce the greatest return on investment for training satisfaction outcomes.


Finding 2: Productivity Impact is the Lowest-Rated Training Dimension

Despite broadly positive ratings across all dimensions (Figure 6), the item measuring whether training increased productivity consistently received the lowest average score. The histogram (Figure 1) confirms a distribution concentrated at higher values overall, but the productivity dimension lags behind. This reveals a structural disconnect between training attendance and tangible on-the-job performance improvement — the ultimate business objective of any training investment. Employees find training relevant and well-delivered, but do not fully convert participation into measurable productivity gains.


Finding 3: Training Satisfaction Varies Significantly Across Departments

The department-level boxplot (Figure 2) reveals substantial variation in training satisfaction scores across the organisation. Some departments show consistently high median scores with narrow spreads, while others display wide interquartile ranges and lower medians. This variation indicates that training quality is not uniformly experienced across the organisation — likely reflecting differences in the relevance of generic training content to specific job functions, or inconsistencies in the facilitators assigned to different departmental cohorts.


5.4 Business Recommendations

Recommendation 1: Invest in Facilitator Development and Quality Assurance

Evidence: Facilitator communication effectiveness shows the strongest Spearman correlation with overall training satisfaction across all six dimensions (Figure 3). The scatter plot (Figure 4) confirms this as a strong, consistent linear relationship.

Action: Introduce a structured facilitator assessment framework with quarterly participant feedback reviews. Establish minimum facilitator competency standards and provide targeted coaching to any facilitator rated below 3.5 on the communication scale. Facilitator performance scores should be formally incorporated into the organisation’s training programme evaluation process.


Recommendation 2: Implement Structured Post-Training Application Mechanisms

Evidence: Productivity impact is the lowest-rated dimension despite high scores on relevance and facilitator effectiveness (Figure 6). This indicates the problem is not content design but application — employees are not translating training into on-the-job behaviour change.

Action: Introduce 30/60/90-day post-training follow-up reviews where line managers assess skill application on the job. Complement classroom training with structured on-the-job assignments, practice tasks, or peer learning pairs that reinforce learning transfer. Measure and report productivity metrics before and after training cycles to build an evidence base for future investment decisions.


Recommendation 3: Address the Most Requested Improvement Area with Department-Level Input

Evidence: The improvement area frequency table (Step 3) identifies the specific training area most frequently cited by employees as requiring improvement. The department-level boxplot (Figure 2) confirms that satisfaction varies across departments, suggesting that generic training programmes may not meet all departmental needs equally.

Action: Prioritise a review and redesign of the most-cited improvement area within the next training planning cycle. Involve employees from the three lowest-satisfaction departments in the content co-design process to ensure the revised programme directly addresses their stated needs, increases relevance, and improves departmental buy-in.


5.5 AI Augmentation Summary

AI tools (Claude, claude.ai) were used at three stages of this analytical project: code generation, analytical guidance, and result interpretation. The tools delivered the greatest value in accelerating routine coding tasks — particularly data cleaning pipelines, column renaming, and chart formatting — and in translating statistical outputs into accessible management language, reducing overall project time.

However, human analytical judgment remained essential throughout. The AI produced a rendering error (missing corrplot margin settings) and did not proactively recommend Spearman over Pearson correlation for ordinal data — a methodologically significant omission that required correction after cross-referencing course materials. These limitations confirm that AI tools are most effective as productivity accelerators for analysts who already possess the methodological foundation to validate AI outputs. In this project, AI augmented human analytical capability without replacing the critical judgment that underpins sound statistical practice.


6 Appendix A — AI Interaction Evidence

The following screenshots document the three AI interactions described in Step 4. Each screenshot shows the prompt submitted, the AI’s response, and where applicable, the modification or rejection decision made by the analyst.

Interaction 1 — Code Generation (Column Renaming and Corrplot)

(Insert screenshot of Claude conversation here — show your prompt and the AI response)


Interaction 2 — Analytical Guidance (Spearman vs Pearson and Chart Selection)

(Insert screenshot of Claude conversation here — show your prompt and the AI response)


Interaction 3 — Result Interpretation (Correlation and Productivity Findings)

(Insert screenshot of Claude conversation here — show your prompt and the AI response)


End of Section B Submission

Data Analytics I | MMBA 10 | Lagos Business School, Pan-Atlantic University

Submitted by: Bisola Amuda