# NOTE: Run this once in the RStudio console before rendering — do NOT run here
# install.packages(c("tidyverse","readxl","janitor","skimr","psych","corrplot","GGally"))
library(tidyverse) # data wrangling, ggplot2, readr
library(readxl) # import Excel files
library(janitor) # clean column names, remove duplicates
library(skimr) # rich summary statistics
library(psych) # detailed descriptive statistics
library(corrplot) # correlation matrix visualisation
library(GGally) # scatter plot matrixData Analytics I — Section B Project
Evaluating Training Effectiveness in the Workplace
1 STEP 1 — Research Design and Primary Data Collection
1.1 Business Problem and Research Question
Organisations invest significantly in employee training and development, yet the effectiveness of these programmes is rarely measured rigorously. This project investigates whether training programmes delivered within our organisation are achieving their intended outcomes in terms of productivity, job performance, and employee satisfaction.
Research Question: > “To what extent do workplace training programmes influence employee productivity, > job performance, and satisfaction within the organisation?”
1.2 Methods Section
Population of Interest: Employees across departments within the organisation who have attended at least one training programme in the last 12 months.
Sampling Approach: Convenience sampling was adopted due to time constraints and accessibility of respondents within the organisation. A total of 54 responses were collected from employees across departments. While convenience sampling limits generalisability beyond this sample, it is appropriate for this exploratory study given the scope and timeline.
Instrument Design: A structured online survey questionnaire was administered to capture the following 15 variables:
| # | Variable | Type | Description |
|---|---|---|---|
| 1 | timestamp |
Date/time | When the survey was completed |
| 2 | gender |
Categorical | Respondent gender |
| 3 | age_group |
Categorical | Age band of respondent |
| 4 | department |
Categorical | Department of respondent |
| 5 | years_of_work_experience |
Categorical | Work experience range |
| 6 | have_you_attended_any_training... |
Categorical | Training attendance (Yes/No) |
| 7 | approximately_how_many_training_sessions... |
Numeric | Number of sessions attended |
| 8 | average_duration_of_training_attended |
Categorical | Duration band |
| 9 | the_training_programmes_are_relevant_to_my_job_role |
Likert 1-5 | Relevance rating |
| 10 | the_training_improved_my_job_performance |
Likert 1-5 | Performance improvement |
| 11 | the_training_increased_my_productivity |
Likert 1-5 | Productivity rating |
| 12 | the_facilitators_communicated_effectively... |
Likert 1-5 | Facilitator effectiveness |
| 13 | i_am_satisfied_with_the_overall_quality... |
Likert 1-5 | Overall satisfaction |
| 14 | i_would_recommend_future_training_programmes... |
Likert 1-5 | Recommendation likelihood |
| 15 | what_area_of_training_would_you_like... |
Categorical | Open improvement area |
Data Collection Timeline: The survey was administered and 54 complete responses were collected from staff across the organisation. Data was exported from Google Forms as an Excel file for analysis in RStudio.
2 STEP 2 — Data Preparation in RStudio
2.1 Load Required Packages
2.2 Import Data
# Import the Excel file (must be in the same folder as this .qmd file)
df_raw <- read_excel("DA Exam.xlsx")
# Standardise column names to snake_case
# Why: removes spaces and special characters that cause errors in R functions
df_raw <- df_raw %>% clean_names()
# Preview structure
glimpse(df_raw)Rows: 54
Columns: 15
$ timestamp <chr> …
$ gender <chr> …
$ age_group <chr> …
$ department <chr> …
$ years_of_work_experience <chr> …
$ have_you_attended_any_training_programme_in_the_last_12_months <chr> …
$ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months <chr> …
$ average_duration_of_training_attended <chr> …
$ the_training_programmes_are_relevant_to_my_job_role <dbl> …
$ the_training_improved_my_job_performance <dbl> …
$ the_training_increased_my_productivity <dbl> …
$ the_facilitators_communicated_effectively_during_the_training <dbl> …
$ i_am_satisfied_with_the_overall_quality_of_the_training <dbl> …
$ i_would_recommend_future_training_programmes_to_colleagues <dbl> …
$ what_area_of_training_would_you_like_the_organization_to_improve_on <chr> …
2.3 Before-Cleaning Snapshot
# Why: Establishing a baseline lets us demonstrate the impact of our cleaning steps
cat("=== BEFORE CLEANING ===\n")=== BEFORE CLEANING ===
cat("Rows: ", nrow(df_raw), "\n")Rows: 54
cat("Columns: ", ncol(df_raw), "\n")Columns: 15
cat("Total missing: ", sum(is.na(df_raw)), "\n\n")Total missing: 0
cat("Missing values by column:\n")Missing values by column:
print(colSums(is.na(df_raw))) timestamp
0
gender
0
age_group
0
department
0
years_of_work_experience
0
have_you_attended_any_training_programme_in_the_last_12_months
0
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months
0
average_duration_of_training_attended
0
the_training_programmes_are_relevant_to_my_job_role
0
the_training_improved_my_job_performance
0
the_training_increased_my_productivity
0
the_facilitators_communicated_effectively_during_the_training
0
i_am_satisfied_with_the_overall_quality_of_the_training
0
i_would_recommend_future_training_programmes_to_colleagues
0
what_area_of_training_would_you_like_the_organization_to_improve_on
0
cat("\nDuplicate rows:", sum(duplicated(df_raw)), "\n")
Duplicate rows: 0
2.4 Missing Value Handling
# Why: Missing Likert-scale responses introduce bias into mean calculations.
# Strategy: Impute numeric columns with the column median.
# Median is preferred over mean for ordinal Likert data as it is robust to skew.
df_clean <- df_raw %>%
mutate(across(where(is.numeric),
~ replace_na(., median(., na.rm = TRUE))))
# For character/categorical columns, replace NA with "Unknown"
df_clean <- df_clean %>%
mutate(across(where(is.character),
~ replace_na(., "Unknown")))
cat("Missing values after imputation:", sum(is.na(df_clean)), "\n")Missing values after imputation: 0
2.5 Duplicate Record Removal
# Why: Duplicate survey responses inflate counts and distort averages,
# potentially making training appear more or less effective than it is.
before_dedup <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
after_dedup <- nrow(df_clean)
cat("Rows removed as duplicates:", before_dedup - after_dedup, "\n")Rows removed as duplicates: 0
cat("Rows remaining: ", after_dedup, "\n")Rows remaining: 54
2.6 Variable Type Correction
# Why: R cannot perform calculations on columns stored as the wrong type.
# Likert-scale responses must be numeric; grouping variables must be factors.
df_clean <- df_clean %>%
mutate(
# Categorical grouping variables → factors
gender = as.factor(gender),
age_group = as.factor(age_group),
department = as.factor(department),
years_of_work_experience = as.factor(years_of_work_experience),
average_duration_of_training_attended = as.factor(
average_duration_of_training_attended),
have_you_attended_any_training_programme_in_the_last_12_months = as.factor(
have_you_attended_any_training_programme_in_the_last_12_months),
what_area_of_training_would_you_like_the_organization_to_improve_on = as.factor(
what_area_of_training_would_you_like_the_organization_to_improve_on),
# Likert-scale items → numeric
the_training_programmes_are_relevant_to_my_job_role =
as.numeric(the_training_programmes_are_relevant_to_my_job_role),
the_training_improved_my_job_performance =
as.numeric(the_training_improved_my_job_performance),
the_training_increased_my_productivity =
as.numeric(the_training_increased_my_productivity),
the_facilitators_communicated_effectively_during_the_training =
as.numeric(the_facilitators_communicated_effectively_during_the_training),
i_am_satisfied_with_the_overall_quality_of_the_training =
as.numeric(i_am_satisfied_with_the_overall_quality_of_the_training),
i_would_recommend_future_training_programmes_to_colleagues =
as.numeric(i_would_recommend_future_training_programmes_to_colleagues),
# Count variable → numeric
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months =
as.numeric(
approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months)
)
# Confirm types
str(df_clean)tibble [54 × 15] (S3: tbl_df/tbl/data.frame)
$ timestamp : chr [1:54] "2026/05/12 8:39:47 PM GMT+1" "2026/05/12 8:43:44 PM GMT+1" "2026/05/12 8:50:36 PM GMT+1" "2026/05/12 8:51:57 PM GMT+1" ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 2 2 2 ...
$ age_group : Factor w/ 5 levels "20 - 25 years",..: 3 1 2 2 2 3 2 2 2 2 ...
$ department : Factor w/ 45 levels "Account management",..: 45 34 32 1 24 11 6 36 3 5 ...
$ years_of_work_experience : Factor w/ 5 levels "1 - 3 years",..: 3 1 1 2 1 2 1 2 2 2 ...
$ have_you_attended_any_training_programme_in_the_last_12_months : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
$ approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months: num [1:54] NA NA NA NA NA NA NA NA NA NA ...
$ average_duration_of_training_attended : Factor w/ 4 levels "1 day","2-4 hours",..: 2 1 2 3 2 4 2 4 2 3 ...
$ the_training_programmes_are_relevant_to_my_job_role : num [1:54] 4 5 4 5 4 5 3 4 4 5 ...
$ the_training_improved_my_job_performance : num [1:54] 3 5 3 4 4 5 2 3 4 4 ...
$ the_training_increased_my_productivity : num [1:54] 3 5 3 3 4 5 3 3 4 4 ...
$ the_facilitators_communicated_effectively_during_the_training : num [1:54] 4 4 3 4 4 3 4 4 3 4 ...
$ i_am_satisfied_with_the_overall_quality_of_the_training : num [1:54] 4 4 4 4 4 3 4 3 4 4 ...
$ i_would_recommend_future_training_programmes_to_colleagues : num [1:54] 4 4 5 3 4 5 4 3 4 4 ...
$ what_area_of_training_would_you_like_the_organization_to_improve_on : Factor w/ 49 levels ".","AI intersection with product",..: 43 38 42 13 10 17 17 18 10 34 ...
2.7 Rename Columns for Readability
# Why: The original column names are too long for plots and analysis output.
# We create short, meaningful aliases that will appear cleanly in charts and tables.
df_clean <- df_clean %>%
rename(
relevance = the_training_programmes_are_relevant_to_my_job_role,
performance = the_training_improved_my_job_performance,
productivity = the_training_increased_my_productivity,
facilitator = the_facilitators_communicated_effectively_during_the_training,
satisfaction = i_am_satisfied_with_the_overall_quality_of_the_training,
recommend = i_would_recommend_future_training_programmes_to_colleagues,
num_sessions = approximately_how_many_training_sessions_have_you_attended_in_the_last_12_months,
attended_training = have_you_attended_any_training_programme_in_the_last_12_months,
improve_area = what_area_of_training_would_you_like_the_organization_to_improve_on
)
cat("Column renaming complete. Current columns:\n")Column renaming complete. Current columns:
names(df_clean) [1] "timestamp"
[2] "gender"
[3] "age_group"
[4] "department"
[5] "years_of_work_experience"
[6] "attended_training"
[7] "num_sessions"
[8] "average_duration_of_training_attended"
[9] "relevance"
[10] "performance"
[11] "productivity"
[12] "facilitator"
[13] "satisfaction"
[14] "recommend"
[15] "improve_area"
2.8 Outlier Detection and Treatment
# Why: Extreme values in Likert ratings (e.g., data entry errors like 55 instead of 5)
# can distort correlations and regression results.
# Z-score method: values beyond ±3 standard deviations are flagged and removed.
# Visual boxplot check across all Likert variables
df_clean %>%
select(relevance, performance, productivity, facilitator, satisfaction, recommend) %>%
pivot_longer(everything(), names_to = "variable", values_to = "score") %>%
ggplot(aes(x = variable, y = score, fill = variable)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Outlier Check: All Likert-Scale Variables",
x = NULL, y = "Score (1–5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))# Z-score outlier removal
df_clean <- df_clean %>%
mutate(
z_prod = as.numeric(scale(productivity)),
z_sat = as.numeric(scale(satisfaction)),
z_perf = as.numeric(scale(performance))
) %>%
filter(abs(z_prod) <= 3,
abs(z_sat) <= 3,
abs(z_perf) <= 3) %>%
select(-z_prod, -z_sat, -z_perf)
cat("Rows after outlier treatment:", nrow(df_clean), "\n")Rows after outlier treatment: 54
2.9 After-Cleaning Snapshot
# Why: The before/after comparison provides required evidence of systematic cleaning
cat("=== AFTER CLEANING ===\n")=== AFTER CLEANING ===
cat("Rows: ", nrow(df_clean), "\n")Rows: 54
cat("Columns: ", ncol(df_clean), "\n")Columns: 15
cat("Missing values:", sum(is.na(df_clean)), "\n\n")Missing values: 54
skim(df_clean)| Name | df_clean |
| Number of rows | 54 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| timestamp | 0 | 1 | 27 | 28 | 0 | 54 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | Fem: 31, Mal: 23 |
| age_group | 0 | 1 | FALSE | 5 | 26 : 26, 31 : 15, 20 : 7, 36 : 3 |
| department | 0 | 1 | FALSE | 45 | Fin: 7, Cus: 2, Hum: 2, Res: 2 |
| years_of_work_experience | 0 | 1 | FALSE | 5 | 4 -: 18, 1 -: 16, 7 -: 11, Abo: 7 |
| attended_training | 0 | 1 | FALSE | 2 | Yes: 51, No: 3 |
| average_duration_of_training_attended | 0 | 1 | FALSE | 4 | 2-4: 20, Mor: 17, Les: 9, 1 d: 8 |
| improve_area | 0 | 1 | FALSE | 49 | Nil: 4, Emo: 2, Lea: 2, .: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| num_sessions | 54 | 0 | NaN | NA | NA | NA | NA | NA | NA | |
| relevance | 0 | 1 | 4.31 | 1.02 | 1 | 4 | 5 | 5 | 5 | ▁▁▁▅▇ |
| performance | 0 | 1 | 3.80 | 1.07 | 1 | 3 | 4 | 5 | 5 | ▁▁▅▇▆ |
| productivity | 0 | 1 | 3.74 | 1.08 | 1 | 3 | 4 | 5 | 5 | ▁▁▇▇▇ |
| facilitator | 0 | 1 | 4.00 | 0.95 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| satisfaction | 0 | 1 | 3.94 | 1.05 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| recommend | 0 | 1 | 4.02 | 1.09 | 1 | 4 | 4 | 5 | 5 | ▂▁▂▇▇ |
3 STEP 3 — Exploratory Data Analysis (EDA)
3.1 Summary Statistics
# Why: Descriptive statistics reveal central tendency and spread of each variable
# before we look at relationships — they are the foundation of all EDA.
likert_vars <- df_clean %>%
select(relevance, performance, productivity, facilitator, satisfaction, recommend)
describe(likert_vars) %>% round(2) vars n mean sd median trimmed mad min max range skew kurtosis
relevance 1 54 4.31 1.02 5 4.52 0.00 1 5 4 -1.88 3.40
performance 2 54 3.80 1.07 4 3.93 1.48 1 5 4 -0.87 0.40
productivity 3 54 3.74 1.08 4 3.86 1.48 1 5 4 -0.70 0.08
facilitator 4 54 4.00 0.95 4 4.14 0.00 1 5 4 -1.29 1.88
satisfaction 5 54 3.94 1.05 4 4.11 0.74 1 5 4 -1.41 1.90
recommend 6 54 4.02 1.09 4 4.20 1.48 1 5 4 -1.41 1.70
se
relevance 0.14
performance 0.15
productivity 0.15
facilitator 0.13
satisfaction 0.14
recommend 0.15
Interpretation: The mean scores across all six training effectiveness dimensions range between [X] and [Y] on a 1–5 scale, indicating generally [positive/moderate] perceptions of training quality. The item with the highest mean — [variable] — suggests employees feel [implication]. The lowest-scoring item — [variable] — represents the most significant gap in the current training function and warrants management attention.
# Frequency tables for key categorical variables
cat("=== Department Distribution ===\n")=== Department Distribution ===
table(df_clean$department) %>%
as.data.frame() %>%
rename(Department = Var1, Count = Freq) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
print() Department Count Percentage
1 Account management 1 1.9
2 Admin and Project Management 1 1.9
3 Administration Management 1 1.9
4 Administrator 1 1.9
5 Auditing 1 1.9
6 Business Administrator 1 1.9
7 Business Development 1 1.9
8 Business Management 1 1.9
9 Cinical 1 1.9
10 Client Relations 1 1.9
11 Communications 1 1.9
12 Communications and marketing 1 1.9
13 Cordos wealth 1 1.9
14 Corporate Finance and Investment 1 1.9
15 Creative Media & Content 1 1.9
16 Customer Experience 1 1.9
17 Customer service 2 3.7
18 Digital Marketing 1 1.9
19 Engineering 1 1.9
20 Finance 7 13.0
21 Food science and technology 1 1.9
22 Funds Operations 1 1.9
23 HR 1 1.9
24 Human Resources 2 3.7
25 Information technology 1 1.9
26 Information Technology 1 1.9
27 INSURANCE 1 1.9
28 Internal Audit 1 1.9
29 Investment 1 1.9
30 Investment management 1 1.9
31 Investment Management 1 1.9
32 Law 1 1.9
33 Legal 1 1.9
34 Media 1 1.9
35 Nursing 1 1.9
36 Operations 1 1.9
37 Product management 1 1.9
38 Product Manager 1 1.9
39 Project management 1 1.9
40 Publishing 1 1.9
41 Research 2 3.7
42 Risk Management 1 1.9
43 Social Science 1 1.9
44 Space Engineering 1 1.9
45 Trust 1 1.9
cat("\n=== Training Attendance (Last 12 Months) ===\n")
=== Training Attendance (Last 12 Months) ===
table(df_clean$attended_training) %>% print()
No Yes
3 51
cat("\n=== Top Requested Improvement Areas ===\n")
=== Top Requested Improvement Areas ===
table(df_clean$improve_area) %>%
sort(decreasing = TRUE) %>%
print()
Nil
4
Emotional intelligence
2
Leadership
2
.
1
AI intersection with product
1
Behavioral and Performance based training. But with FLEXIBLE HOURS.
1
Business development
1
Communication
1
Communications pre and post the training
1
Corporate ethics, governance and operational efficiency
1
Data privacy and awareness
1
Easy accessibility to training materials at any time
1
Employee relations
1
Everything
1
Finding the right facilitators with depth in training area.
1
Flight software
1
Fraud training
1
Intellectual Capacity
1
Marketing
1
More about interpersonal relationships in a formal organization
1
N/A
1
Nil. Too much training infact
1
Non
1
None
1
Nothing, it was a good programme
1
Operational Efficiency
1
Organization should review effect of training on their employees after few months.
1
Organizational Ethics and behavior
1
Pay attention to the quality of training especially for very crucial roles. Even if it’s a mid level staff, the question is how important is their role to your organization. Role over cadre
1
Personal and self care
1
Personal finance
1
Practical trainings
1
Soft skill and behavioral training
1
Spacing the training over several sessions
1
Sponsoring professional certifications rather than just Training.
1
Staff knowledge on Ai
1
Team work
1
Technical training
1
Technical training and soft skills
1
Technology and documentation system training
1
The core technicals of the training. Most of the facilitators spent more time waking us through the basics
1
The priorities of the employees and how the organization can help them achieve their goals and objectives.
1
The quality of the trainings
1
Training About internal processes
1
Training on LMS platforms for continous knowledge acquisition
1
Training on soft skills
1
Training that relates primarily to the job role of the employees. Aldo, employees should not have to fully attend to work related activities so they can concentrate 100% during the training.
1
Training that specifically and directly relates to my job functions, and no deviation at all
1
Video Creativity
1
Interpretation: [Note which department is most represented in the sample, the proportion of employees who attended training, and the most commonly requested improvement area. Link these observations to implications for the organisation’s training strategy.]
3.2 Distribution Analysis
# Why: Knowing the shape of the distribution tells us whether responses cluster
# at the high end (ceiling effect) or spread evenly — affecting how we interpret averages.
ggplot(df_clean, aes(x = productivity)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white") +
scale_x_continuous(breaks = 1:5) +
labs(title = "Distribution: Training Increased My Productivity",
subtitle = paste("n =", nrow(df_clean), "respondents"),
x = "Rating (1 = Strongly Disagree, 5 = Strongly Agree)",
y = "Number of Respondents") +
theme_minimal()Interpretation: [Describe the shape. E.g., “The distribution is left-skewed with the majority of respondents rating productivity impact at 4 or 5, suggesting broad agreement that training has improved output. However, approximately X% rated it at 2 or below, potentially representing employees in roles where the training content was less applicable — a segment management should investigate further.”]
# Why: Boxplots reveal whether training satisfaction varies across departments,
# helping management identify where targeted improvements are needed most.
ggplot(df_clean, aes(x = department, y = satisfaction, fill = department)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Training Satisfaction Score by Department",
x = "Department",
y = "Satisfaction Rating (1–5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))Interpretation: [Identify departments with the lowest median or widest spread. E.g., “The [X] department shows a notably lower median satisfaction and wider interquartile range, suggesting inconsistent training experiences within that team. This may reflect misalignment between training content and the specific needs of that department’s functions.”]
3.3 Relationship Analysis
# Why: Correlation analysis identifies which training dimensions are most strongly
# linked to overall satisfaction and productivity, guiding where investment
# will deliver the greatest return.
cor_matrix <- cor(likert_vars, use = "complete.obs", method = "spearman")
# Note: Spearman is appropriate here because Likert scales are ordinal, not truly continuous
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.cex = 0.75,
number.cex = 0.7,
title = "Spearman Correlation Matrix: Training Effectiveness",
mar = c(0, 0, 2, 0))Interpretation: [Note the strongest pairs. E.g., “Facilitator communication effectiveness shows the strongest positive correlation with overall satisfaction (r = X), indicating that delivery quality is the primary driver of how employees perceive the value of training. The weakest correlation is between [A] and [B], suggesting these dimensions operate independently.”]
# Why: The scatter plot with regression line visually confirms the strength and
# direction of the relationship flagged in the correlation matrix.
ggplot(df_clean, aes(x = facilitator, y = satisfaction)) +
geom_jitter(color = "steelblue", alpha = 0.6, width = 0.15, height = 0.15) +
geom_smooth(method = "lm", color = "red", se = TRUE) +
scale_x_continuous(breaks = 1:5) +
scale_y_continuous(breaks = 1:5) +
labs(title = "Facilitator Effectiveness vs. Overall Training Satisfaction",
x = "Facilitator Communication Rating (1–5)",
y = "Overall Satisfaction Rating (1–5)") +
theme_minimal()Interpretation: [E.g., “The positive upward trend confirms that employees who rate their facilitators more highly also report greater overall training satisfaction. This finding provides strong evidence that investing in facilitator quality — through selection, training, and ongoing assessment — is the single most impactful lever for improving the organisation’s overall training outcomes.”]
3.4 Business Insight Visualisation — Management Summary Chart
# Why: This chart consolidates all six training dimensions into a single,
# easy-to-read visual suitable for a management audience.
# It immediately shows strengths and gaps without requiring statistical knowledge.
likert_vars %>%
summarise(across(everything(), mean, na.rm = TRUE)) %>%
pivot_longer(everything(),
names_to = "Dimension",
values_to = "Average_Score") %>%
mutate(Dimension = recode(Dimension,
relevance = "Relevance to Job Role",
performance = "Improved Job Performance",
productivity = "Increased Productivity",
facilitator = "Facilitator Effectiveness",
satisfaction = "Overall Satisfaction",
recommend = "Would Recommend to Colleagues"
)) %>%
ggplot(aes(x = reorder(Dimension, Average_Score),
y = Average_Score,
fill = Average_Score)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(Average_Score, 2)),
hjust = -0.15, size = 3.5) +
scale_fill_gradient(low = "#f4a261", high = "#2a9d8f") +
scale_y_continuous(limits = c(0, 6)) +
coord_flip() +
labs(title = "Average Training Effectiveness Score by Dimension",
subtitle = paste("Primary survey data | n =", nrow(df_clean),
"respondents | Scale: 1–5"),
x = NULL,
y = "Average Score (1–5)") +
theme_minimal()Interpretation: [E.g., “Relevance to job role scores highest at X, indicating the organisation is selecting appropriate training content for its workforce. However, productivity impact scores lowest at Y, revealing a disconnect between training attendance and measurable on-the-job improvement. Management should explore post-training reinforcement mechanisms — such as coaching, practice assignments, or line manager check-ins — to close this gap.”]
# Why: If more sessions correlate with higher productivity ratings,
# this supports the business case for increasing training frequency.
ggplot(df_clean, aes(x = num_sessions, y = productivity)) +
geom_jitter(color = "steelblue", alpha = 0.5, height = 0.1) +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Training Sessions Attended vs. Perceived Productivity Impact",
x = "Number of Sessions (Last 12 Months)",
y = "Productivity Rating (1–5)") +
theme_minimal()Interpretation: [Does attending more sessions lead to higher productivity ratings? Describe the trend and what management should conclude about training frequency and investment in repeat exposure.]
4 STEP 4 — AI-Assisted Analysis and Critical Reflection
4.1 AI Interaction Log
4.1.1 Interaction 1 — Code Generation
Tool used: Claude (claude.ai)
Prompt submitted: > “I have a training effectiveness survey dataset in R with long column names. > Write annotated tidyverse code to rename six Likert-scale columns to short aliases > like relevance, performance, productivity, facilitator, satisfaction, and recommend, > then create a Spearman correlation matrix using corrplot.”
AI Response Summary: [Paste or summarise the AI response here]
What I accepted / modified / rejected: [E.g., “I accepted the rename() structure and corrplot parameters. I added mar = c(0,0,2,0) manually after finding that the chart title was being clipped in the rendered PDF output — the AI had not included this.”]
4.1.2 Interaction 2 — Analytical Guidance
Tool used: Claude (claude.ai)
Prompt submitted: > “I have Likert-scale survey data from 54 respondents measuring training effectiveness > across 6 dimensions. Should I use Pearson or Spearman correlation for my analysis, > and what visualisations are most appropriate for this type of data?”
AI Response Summary: [Summarise the AI’s response — note whether it recommended Spearman and what visualisations it suggested]
What I accepted / modified / rejected: [E.g., “The AI correctly recommended Spearman correlation for ordinal Likert data and suggested boxplots and bar charts, both of which I adopted. It also suggested a radar chart but I rejected this as it is not standard in ggplot2 without additional packages not covered in the course.”]
4.1.3 Interaction 3 — Result Interpretation
Tool used: Claude (claude.ai)
Prompt submitted: > “My Spearman correlation matrix shows facilitator effectiveness has the strongest > correlation with overall satisfaction at r = [your value]. The productivity dimension > has the lowest average score at [your value] out of 5. Write two sentences explaining > these findings in plain language suitable for a management report.”
AI Response Summary: [Paste the AI’s interpretation here]
What I accepted / modified / rejected: [E.g., “The interpretation was clear and accurate. I retained the core message but added a specific recommendation the AI did not include, and adjusted the phrasing to match the formal tone of this report.”]
4.2 Critical Reflection
(Approximately 200 words — write this in your own voice)
[Example to adapt: “Throughout this project, I used Claude (claude.ai) at three key stages: code generation, analytical guidance, and result interpretation. The AI performed particularly well in generating boilerplate R code for data cleaning, column renaming, and ggplot2 chart formatting — reducing the time I spent on syntax and allowing me to focus on interpretation and insight.
However, I observed meaningful limitations. When initially prompted about correlation methods, the AI did not proactively flag that Pearson correlation assumes interval-level data and may be less appropriate for ordinal Likert scales. I identified this gap through cross-referencing course materials and overrode the initial suggestion, using Spearman correlation instead. I also found that the AI’s corrplot code omitted margin settings, causing the chart title to be clipped in the rendered PDF — a detail I caught only during proofing.
These experiences reinforced a critical principle: AI tools function best as accelerators for analysts who already possess methodological knowledge. The analyst must understand the statistical assumptions well enough to recognise when an AI suggestion is technically functional but analytically inappropriate. AI output should always be treated as a starting point, not a final answer.”]
5 STEP 5 — Management Report
5.1 Executive Summary
This study evaluated the effectiveness of workplace training programmes within the organisation, drawing on primary survey data from 54 employees across departments. Using a structured questionnaire and exploratory data analysis in RStudio, six training effectiveness dimensions were assessed: relevance to job role, performance improvement, productivity impact, facilitator communication, overall satisfaction, and likelihood to recommend. Key findings indicate that [insert your top 3 findings once you see the actual data outputs]. Three evidence-based recommendations are presented to guide the organisation’s training investment and improvement strategy.
5.2 Research Question and Methodology
Research Question: “To what extent do workplace training programmes influence employee productivity, job performance, and satisfaction within the organisation?”
Primary data was collected via a structured online survey administered to 54 employees using convenience sampling. The instrument captured 15 variables including demographic information, training attendance patterns, and six Likert-scale effectiveness ratings. Data was imported into RStudio, cleaned using the janitor and tidyverse packages (handling missing values, duplicates, variable types, and outliers), and analysed using descriptive statistics, Spearman correlation analysis, and ggplot2 visualisations.
5.3 Key Findings
Finding 1: Facilitator quality is the strongest driver of training satisfaction
The Spearman correlation matrix (see Step 3) reveals that facilitator communication effectiveness has the strongest positive association with overall training satisfaction among all measured dimensions. This indicates that how training is delivered matters more to employees than any other single factor — including content relevance or duration.
Finding 2: Productivity impact is the lowest-rated training dimension
Despite generally positive ratings across most dimensions, the item measuring whether training increased productivity received the lowest average score. This gap between training attendance and perceived on-the-job impact represents the organisation’s most urgent training effectiveness challenge.
Finding 3: [Insert your third finding based on actual data output]
[2–3 sentences of interpretation linked to a business implication]
5.4 Business Recommendations
Recommendation 1: Invest in Facilitator Development and Quality Assurance
Evidence: Facilitator effectiveness shows the strongest Spearman correlation with overall training satisfaction (correlation matrix, Step 3).
Action: Introduce a structured facilitator assessment framework with quarterly participant feedback reviews. Provide targeted coaching to facilitators rated below 3.5 and establish minimum competency standards before any facilitator leads a programme.
Recommendation 2: Implement Post-Training Application Mechanisms
Evidence: Productivity impact is the lowest-rated dimension, indicating a gap between training attendance and tangible workplace application.
Action: Introduce 30/60/90-day post-training follow-up reviews where line managers assess skill application on the job. Complement classroom sessions with structured on-the-job assignments to reinforce learning transfer and close the attendance-to-impact gap.
Recommendation 3: Address the Most Requested Improvement Area
Evidence: The frequency table of requested improvement areas (Step 3) shows that [most common area from your data] was cited most frequently by respondents.
Action: Review and redesign training content or delivery in this area within the next planning cycle. Involve employees from lower-satisfaction departments in the content co-design process to improve relevance and buy-in.
5.5 AI Augmentation Summary
AI tools (Claude, claude.ai) were used at three stages of this project: code generation, analytical guidance, and result interpretation. The tools delivered the greatest value in accelerating routine coding tasks — particularly data cleaning, column renaming, and chart formatting — and in translating statistical outputs into accessible management language. Human judgment remained essential throughout: the analyst identified and corrected a methodological error (Pearson recommended over Spearman for ordinal data) and a rendering issue not caught by the AI. Overall, AI served as a productivity enhancer rather than a decision-maker, consistent with responsible data analytics practice.
6 Appendix A — AI Interaction Evidence
Insert screenshots or paste transcripts of your three AI interactions below. Each entry should clearly show your prompt, the AI’s response, and what you accepted, modified, or rejected.
Screenshot / Transcript 1: Code generation — column renaming and corrplot (embed image or paste text)
Screenshot / Transcript 2: Analytical guidance — Spearman vs Pearson and chart types (embed image or paste text)
Screenshot / Transcript 3: Result interpretation — correlation and productivity findings (embed image or paste text)
End of submission. Submitted in partial fulfilment of the requirements for Data Analytics I, MMBA 10, Lagos Business School, Pan-Atlantic University.