Data Cleaning

This section prepares the dataset for analysis by addressing inconsistencies, missing values, and formatting issues across all three data sources. The goal is to ensure the data is sufficiently standardized and reliable for summary and reporting, while maintaining transparency around any limitations.

Cleaning and Standardizing TTA_Data tab

# Trim whitespace on all character columns
# Normalize empty strings to NA
# Standardize Text across columns

TTAData_clean <- TTAData_Raw %>%
  mutate(across(where(is.character), ~ stringr::str_squish(.x))) %>%
  mutate(across(where(is.character), ~ na_if(.x, ""))) %>%
  mutate(across(c("Campaign Name", "Campaign Type", "Delivery Method", "T/TA State"), ~ str_to_title(.x)))


 
## Campaign Names
## Removing TA/Technical Assistance Markers
TTAData_clean <- TTAData_clean %>%
  mutate( `Campaign Name` = str_trim(str_remove(`Campaign Name`, regex("^(TA:|Training:)\\s*", ignore_case = TRUE))))



## Campaign Type
## Identify and change non-categorical variables
TTAData_clean <- TTAData_clean %>% 
  mutate(`Campaign Type` = replace(`Campaign Type`, `Campaign Name` == "Program Coaching Call 18" & `Campaign Type` == "Ta", "Technical Assistance")) %>%
  mutate(`Campaign Type` = replace(`Campaign Type`, `Campaign Name` == "Introductory Training 10" & is.na(`Campaign Type`), "Training")) %>% 
  mutate(`Campaign Name` = str_trim(str_remove(`Campaign Name`, regex("^ta\\s*;\\s*", ignore_case = TRUE))))



## Delivery Method
## Remove filler "Face to Face" information and standardize

TTAData_clean <- TTAData_clean %>% 
  mutate(`Delivery Method` = str_trim(str_remove(`Delivery Method`, regex("^Face To Face\\s*-?\\s*", ignore_case = TRUE)))) %>% 
  mutate(`Delivery Method` = if_else(`Delivery Method` %in% c("Video Call", "In Person", "Conference Workshop", "Webinar"), `Delivery Method`, `Delivery Method`))

TTAData_clean %>% 
  filter(!`Delivery Method` %in% c("Video Call", "In Person", "Conference Workshop", "Webinar")) %>% 
  distinct(`Delivery Method`)%>%
  View()

TTAData_clean <- TTAData_clean %>% 
  mutate(`Delivery Method` = replace(`Delivery Method`, `Delivery Method` == "Conf. Workshop", "Conference Workshop"))



## Dates
## Removing 2 entries because further details are needed to correct them. 
 TTAData_clean <- TTAData_clean %>% filter(!is.na(`Start Date`) & !is.na(`End Date`) & `End Date` >= `Start Date`)
 
 
 
## Number Trained
 TTAData_clean %>% 
  filter(is.na(suppressWarnings(as.numeric(`Number Trained`))) & !is.na(`Number Trained`)) %>% 
  View()

 TTAData_clean <- TTAData_clean %>% 
  mutate(`Number Trained` = replace(`Number Trained`, str_to_lower(str_trim(`Number Trained`)) == "ten", "10"))
 
 
##T/TA State
## Clean and Standardize column 
TTAData_clean <- TTAData_clean %>%
   mutate(`T/TA State` = if_else( str_to_lower(str_trim(`T/TA State`)) == "remote","Remote", str_to_upper(str_trim(`T/TA State`)))) %>%
   mutate (`T/TA State` = if_else( str_to_upper(str_trim(`T/TA State`)) == "N/A", NA_character_, `T/TA State`))



## Drop Duplicates

duplicates <- TTAData_clean %>% 
  group_by(`Campaign Name`, `Start Date`, `End Date`, `Number Trained`, `T/TA State`) %>% 
  filter(n() > 1)

View(duplicates)

TTAData_clean <- TTAData_clean %>% 
  distinct(`Campaign Name`, `Start Date`, `End Date`, `T/TA State`, .keep_all = TRUE)



View(TTAData_clean)

Cleaning and Standardizing RegistrationData_Raw

glimpse(RegistrationData_Raw)

## Rows: 85
## Columns: 5
## $ Participant_Name  <chr> "Participant_LWM", "Participant_VBY", "Participant_U…
## $ Organization      <chr> "nonprofit org", "County Health Dept", NA, NA, "coun…
## $ Training_Title    <chr> "Intro Training", "Special Topic", "Intro Training",…
## $ Date              <dttm> 2025-03-25, 2025-03-03, 2025-01-31, 2025-03-10, 202…
## $ Attendance_Status <chr> NA, "Attended", NA, "attended", "attended", NA, NA, …

# Trim whitespace on all character columns
# Normalize empty strings to NA
# Standardize Text across columns


RegistrationData_clean <- RegistrationData_Raw %>%
  mutate(across(where(is.character), ~ stringr::str_squish(.x))) %>%
  mutate(across(where(is.character), ~ na_if(.x, ""))) %>%
  mutate(across(c("Participant_Name", "Organization", "Training_Title", "Attendance_Status"), ~ str_to_title(.x)))

View(RegistrationData_clean)



## Organization
## Clean and Standardize column 

RegistrationData_clean %>% 
  distinct(Organization) %>% 
  View()

RegistrationData_clean <- RegistrationData_clean %>% 
  mutate(Organization = str_to_title(str_replace_all(Organization, regex("Dept\\.?|\\.", ignore_case = TRUE), function(x) ifelse(grepl("Dept", x, ignore.case = TRUE), "Department", ""))))

View(RegistrationData_clean)


## Training Title
## Clean and Standardize column - No Change needed

RegistrationData_clean %>% 
  distinct(Training_Title) %>% 
  View()



## Attendance_Status
## Clean and Standardize column 
RegistrationData_clean %>% 
  distinct(Attendance_Status) %>% 
  View()



## Drop Duplicate

duplicates <- RegistrationData_clean %>% 
  group_by(Participant_Name, Training_Title, Date) %>% 
  filter(n() > 1)

View(duplicates)

RegistrationData_clean <- RegistrationData_clean %>% 
  distinct(Participant_Name, Training_Title, Date, .keep_all = TRUE)


View(RegistrationData_clean)

Cleaning and Standardizing SurveyData_Raw

glimpse(SurveyData_Raw)

## Rows: 60
## Columns: 4
## $ Participant_Name   <chr> "Participant_OUA", "Participant_YHN", "Participant_…
## $ Training_Title     <chr> "Intro Training", "Intro Training", "Advanced Train…
## $ Satisfaction_Score <dbl> 2, 1, 3, 3, 4, 1, 5, 5, NA, 1, 4, 3, NA, 5, 5, 5, 5…
## $ Would_Recommend    <chr> "Yes", "yes", "yes", "Yes", "yes", "yes", "Yes", "y…

# Trim whitespace on all character columns
# Normalize empty strings to NA
# Standardize Text across columns


SurveyData_clean <- SurveyData_Raw %>%
  mutate(across(where(is.character), ~ stringr::str_squish(.x))) %>%
  mutate(across(where(is.character), ~ na_if(.x, ""))) %>%
  mutate(across(c("Participant_Name", "Training_Title", "Would_Recommend"), ~ str_to_title(.x)))

View(SurveyData_clean)

## Training Title
## Clean and Standardize column - No Change needed

SurveyData_clean %>% 
  distinct(Training_Title) %>% 
  View()

## Would Recommend
## Clean and Standardize column - No Change needed

SurveyData_clean %>% 
  distinct(Would_Recommend) %>% 
  View()

Save File

View(TTAData_clean)

View(RegistrationData_clean)

View(SurveyData_clean)

##writexl::write_xlsx(list(TTA_Data = TTAData_clean,Registration_Data = RegistrationData_clean, Survey_Data = SurveyData_clean), "PCA_board_report_sample_dataset_clean.xlsx")

Data Analysis

Following cleaning and standardization, the datasets are structured for analysis across training delivery, attendance, and survey feedback. While key inconsistencies have been addressed, some gaps remain, particularly in linking records across datasets, which informs how results are interpreted in later sections.

TTA Data Tab Summary

Number of Trainings Summarize total sessions and total participants by Campaign Type

# 1. Remove rows where Campaign Type is missing
# 2. Count the number of Technical Assistance and Training sessions
# 3. Sum the total number of people trained in each category
# 4. Reshape the summary so sessions and participants appear as rows
# 5. Add a Total column combining Technical Assistance and Training

View(TTAData_clean)

TrainingsNumber <- TTAData_clean %>%
  filter(!is.na(`Campaign Type`)) %>% 
  summarise(
    TA_Sessions = sum(`Campaign Type` == "Technical Assistance"),
    Training_Sessions = sum(`Campaign Type` == "Training"),
    TA_People = sum(as.numeric(`Number Trained`)[`Campaign Type` == "Technical Assistance"], na.rm = TRUE),
    Training_People = sum(as.numeric(`Number Trained`)[`Campaign Type` == "Training"], na.rm = TRUE)
  ) %>% 
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>% 
  separate(Metric, into = c("Type", "Measure"), sep = "_") %>% 
  pivot_wider(names_from = Type, values_from = Value) %>% 
  mutate(
    Measure = recode(Measure, Sessions = "Sessions", People = "Participants"),
    Total = TA + Training
  )

  TrainingsNumber %>%
  kable(
    caption = "Training and Technical Assistance Summary",
    col.names = c("Measure", "Technical Assistance", "Training", "Total"),
    align = "lccc"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#4A6FA5"
  )

Training and Technical Assistance Summary
Measure	Technical Assistance	Training	Total
Sessions	41	12	53
Participants	440	428	868

The data shows the total number of trainings delivered and the number of participants reached through each activity type. Training sessions account for about half of participants.

Registration Data Tab Summary

Attendance by rate Compare number of people registered with actual attendance outcomes

# 1. Group data by each training title
# 2. Count total registrations for each training
# 3. Count how many participants attended
# 4. Count how many were no-shows
# 5. Count missing attendance status entries
# 6. Calculate attendance rate using attended vs attended + no-shows
# 7. Add a total row summarizing across all trainings

View(RegistrationData_clean)

Attendance <- RegistrationData_clean %>% 
  group_by(Training_Title) %>% 
  summarise(
    Registered = n(),
    Attended = sum(Attendance_Status == "Attended", na.rm = TRUE),
    No_Show = sum(Attendance_Status == "No Show", na.rm = TRUE),
    Missing = sum(is.na(Attendance_Status)),
    Attendance_Rate = Attended / (Attended + No_Show),
    .groups = "drop"
  ) %>% 
  bind_rows(
    summarise(
      .,
      Training_Title = "Total",
      Registered = sum(Registered),
      Attended = sum(Attended),
      No_Show = sum(No_Show),
      Missing = sum(Missing),
      Attendance_Rate = Attended / (Attended + No_Show)
    )
  )

Attendance %>%
  kable(
    caption = "Attendance by Training",
    col.names = c("Training", "Registered", "Attended", "No Show", "Missing", "Attendance"),
    align = "lccccc"
  ) %>%
 kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#4A6FA5"
  )

Attendance by Training
Training	Registered	Attended	No Show	Missing	Attendance
Advanced Training	17	3	6	8	0.3333333
Intro Training	44	24	11	9	0.6857143
Special Topic	19	10	6	3	0.6250000
Total	80	37	23	20	0.6166667

The data shows that not all registered participants attend trainings, with a noticeable drop-off between registration and attendance. Attendance rates vary by training, suggesting differences in participant engagement or scheduling effectiveness across sessions.

Note: Missing data can change what these results indicate.

Survey Data Tab Summary

Summary of Satisfaction Summarize participant satisfaction and recommendation rates by training

# 1. Group survey data by training title
# 2. Calculate average satisfaction score for each training
# 3. Calculate recommendation rate based on Yes vs No responses
# 4. Count total number of survey responses
# 5. Format satisfaction scores and recommendation rates for readability
# 6. Select and organize final columns for reporting

View(SurveyData_clean)

SatisfactionTable <- SurveyData_clean %>% 
  group_by(Training_Title) %>% 
  summarise(
    Avg_Satisfaction = mean(Satisfaction_Score, na.rm = TRUE),
    Recommend_Rate = sum(Would_Recommend == "Yes", na.rm = TRUE) /
      sum(Would_Recommend %in% c("Yes", "No"), na.rm = TRUE),
    Responses = n(),
    .groups = "drop"
  ) %>% 
  mutate(
    Satisfaction = round(Avg_Satisfaction, 2),
    Recommendation = paste0(round(Recommend_Rate * 100, 1), "%")
  ) %>% 
  select(
    Training_Title,
    Responses,
    Satisfaction,
    Recommendation
  )

SatisfactionTable %>%
  kable(
    caption = "Survey Satisfaction by Training",
    col.names = c("Training", "Responses", "Avg Satisfaction", "Recommend (%)"),
    align = "lccc"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#4A6FA5"
  )

Survey Satisfaction by Training
Training	Responses	Avg Satisfaction	Recommend (%)
Advanced Training	14	3.09	81.8%
Intro Training	33	2.67	65.5%
Special Topic	13	3.67	72.7%

The data shows generally moderate to high satisfaction among respondents, with most participants indicating they would recommend the training. However, satisfaction levels vary across training types, suggesting differences in perceived quality or relevance.

Participants vs Survey Respondents

Compare total training participants with total survey respondents

# 1. Pull the total number of training participants from the TrainingsNumber table

Total_Participants <- TrainingsNumber %>% 
  filter(Measure == "Participants") %>% 
  pull(Training)

# 2. Count the number of completed survey responses
# 3. Calculate the number of participants who did not respond to the survey

Total_Respondents <- nrow(SurveyData_clean)
Non_Respondents <- Total_Participants - Total_Respondents

# 4. Build a small dataset for respondent vs non respondent groups
PieData <- data.frame(
  Group = c("Respondents", "Non-Respondents"),
  Count = c(Total_Respondents, Non_Respondents)
)

# 5. Plot the data as a pie chart to show survey coverage visually
# 6. Add percentage labels to show the share of participants represented in the survey

ggplot(PieData, aes(x = "", y = Count, fill = Group)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  geom_text(
    aes(label = paste0(round(Count / sum(Count) * 100, 1), "%")),
    position = position_stack(vjust = 0.5),
    color = "white"
  ) +
  labs(
    title = "Survey Response Coverage",
    fill = ""
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  ) +
  scale_fill_manual(values = c("#4A6FA5", "#DCE6F5"))

This distribution highlights limited survey coverage, meaning that satisfaction results are based on a relatively small subset of participants.

Coverage Adjusted Satisfaction

Compare average satisfaction with satisfaction adjusted by survey response coverage

# 1. Count total registered participants for each training
# 2. Count survey responses and average satisfaction score for each training

TrainingTotals <- RegistrationData_clean %>%
  group_by(Training_Title) %>%
  summarise(Total_Participants = n(), .groups = "drop")

# 3. Join registration totals with survey results by training title
# 4. Calculate response rate for each training
TrainingSurvey <- SurveyData_clean %>%
  group_by(Training_Title) %>%
  summarise(
    Responses = n(),
    Avg_Satisfaction = mean(Satisfaction_Score, na.rm = TRUE),
    .groups = "drop")

# 5. Adjust satisfaction by multiplying average satisfaction by response rate
PlotData <- TrainingTotals %>%
  left_join(TrainingSurvey, by = "Training_Title") %>%
  mutate(
    Response_Rate = Responses / Total_Participants,
    Adjusted_Satisfaction = Avg_Satisfaction * Response_Rate
  )


# 6. Reshape the data so average and adjusted satisfaction can be plotted together
PlotData_long <- PlotData %>%
  select(Training_Title, Avg_Satisfaction, Adjusted_Satisfaction) %>%
  pivot_longer(
    cols = c(Avg_Satisfaction, Adjusted_Satisfaction),
    names_to = "Metric",
    values_to = "Value"
  )

# 7. Plot both scores side by side for comparison
ggplot(PlotData_long, aes(x = Training_Title, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(
    title = "Satisfaction by Training Type",
    subtitle = "Adjusted satisfaction accounts for survey response coverage",
    x = "Training Type",
    y = "Score"
  ) +
  scale_fill_manual(
    values = c("#4A6FA5", "#A5B8D8"),
    labels = c("Adjusted Satisfaction", "Average Satisfaction")
  ) +
  theme_minimal() +
    theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

The data shows that only a subset of participants completed the survey, and in some cases survey responses exceed recorded attendance. This indicates inconsistencies across datasets and suggests that survey results may not fully represent all participants.

key Insights

The main insight is that while the program reaches a large number of participants, survey response rates are relatively low compared to total participation. This means that satisfaction results reflect only a small subset of participants, limiting how representative those findings are. As a result, feedback metrics should be interpreted with caution, and improving survey coverage would strengthen future analysis.

Recomendation

Improve survey response rates by integrating survey completion into training workflows or follow ups.This could include setting aside time at the end of each session for participants to complete the survey, sharing the link while attendees are still present, or having facilitators prompt completion before participants leave. Follow-up reminders, such as emails or text messages shortly after the training, can reinforce this. Integrating surveys into the workflow increases the likelihood of response while the experience is still fresh, leading to more complete and representative feedback.

Brief Notes

Data cleaning and standardization: Cleaning was done in two steps. First, standard cleaning was applied across all datasets, including trimming extra spaces, converting empty cells to missing values, and standardizing text fields to ensure consistency. Then, each dataset was reviewed column by column to address specific issues.

Assumptions made: The analysis assumes that survey respondents are broadly representative of training participants, despite limited coverage. It also treats registration and attendance data as reasonable proxies for participation, given the lack of consistent identifiers across datasets.

Limitations: Survey responses represent only a small share of total participants (approximately 14%), meaning the data does not fully capture the experience of the broader training population. As a result, satisfaction and feedback metrics should be interpreted with caution, as they reflect only a limited subset of participants.

Handling missing data: Missing values were addressed on a case-by-case basis. Records with missing or misaligned dates were removed where dates were essential for the analysis. For other fields, entries were retained if they contained relevant information in other columns. Missing values were then included or excluded in calculations as appropriate, ensuring that summaries were not unnecessarily reduced while still avoiding distortion in key metrics.

key decision Key Decision: The datasets were not fully joined across training, registration, and survey data because the counts did not align across sources. In some cases, the number of survey responses exceeded the number of registered or attended participants, indicating that records were not consistently linked. Forcing a merge in this situation would have created misleading results. Instead, each dataset was summarized separately and compared at a high level, ensuring the analysis reflects what the data can reliably support.

Training Participation, Attendance, and Feedback Summary

NRCAC Data Coordinator Take-Home Assignment

Adina Aamir

2026-04-29

Overall Summary

Import Data