Empowering Community Voices Through Videography

For the Month of February - March Viz for social good posted a project for Video Volunteers - India and I want to detail how I gathered, wrangled, and analyzed the provided data to create a visualization dashboard and recommendations from my findings.

Note I have already created a dashboard condensing this Rmarkdown file into a TABLEAU visualization view it HERE

SCENARIO

Video Volunteers model over the last 12+ years has been simple. They empower a community member in every district to become a community journalist and social worker and we call them Community Correspondents. They identify a community issue (could be health related, water, gender, etc) and make the first video called the issue video capturing the issue details. Post this, they show this to relevant local officials and work on solving the issue the video is about. As the issue is resolved, they create another video calling it the impact video. This video inspires others to act. Each video is tagged to multiple data relating to the issue and impact. 1 out of every 5 videos today are resolved. And we believe, being able to show our community data in a powerful visualization would help us double our impact rate.

Knowing this:

HOW CAN DATA AMPLIFY THE STORIES OF THESE MARGENALIZED COMMUNITIES?

TARGET AUDIENCE

The Community Correspondent, the community and local government officials. We want to create a visualization that a community when looking at feels that their voices are captured effectively on the platform and officials are able to act on insights of the community.

PREPARE

Lets load some packages:

library(tidyverse)
library(dplyr)
library(readxl)
library(lubridate)

The data source was provided my Video Volunteers and Viz for Social Good and I’ll use the (R.O.C.C.C) method to varify the Reliability, Comprehensiveness, currency, and Cited.

RELIABLE: This data sourece was provided by Video Volunteers and Viz for Social Good.

ORIGINAL: This data source is original because it is gathered first hand through the Community Correspondents and Organization.

COMPREHENSIVE: This data contains an expansive amount of fields and varaibles needed to analyze key aspects of the data. Many fields are multi-choice, yes/no, and descriptions so there are many NULL values as related to these fields.

CURRENT: The data provided is current due to the noted impact and publish dates along with the insigts gained needed to utilized for near future decision making.

CITED: The data as far as I know is publicly available provided by the Viz for social Good Project page.

Now lets load the provided data over last few years: HERE

Lests assign the data to a data frame

df <- read_excel("Data_Video Volunteers.xlsx")

print(df)

## # A tibble: 2,002 × 133
##       id uid    state…¹ state…² distr…³ cc_id cc_name mentor story_pitch_date   
##    <dbl> <chr>    <dbl> <chr>   <chr>   <dbl> <chr>   <chr>  <dttm>             
##  1  2465 JH_52…      16 Jharkh… Palamu    438 Bhaska… Shikh… 2021-08-10 00:00:00
##  2  2476 JH_52…      16 Jharkh… Khunti    409 Ravind… Shikh… 2021-08-13 00:00:00
##  3  2879 JH_52…      16 Jharkh… Garhwa    440 Prasha… Gradu… 2021-10-22 00:00:00
##  4  2881 JH_52…      16 Jharkh… Garhwa    440 Prasha… Gradu… 2021-10-23 00:00:00
##  5  2639 JH_52…      16 Jharkh… Dhanbad    83 Halima… Gurup… 2021-08-01 00:00:00
##  6   953 JH_50…      16 Jharkh… Dhanbad    83 Halima… Gurup… 2020-10-15 00:00:00
##  7  4364 JH_53…      16 Jharkh… Godda     452 Mohd G… Shikh… 2022-08-25 00:00:00
##  8  5057 JH_54…      16 Jharkh… Giridih    86 Beena … Shikh… 2022-12-20 00:00:00
##  9  2543 JH_52…      16 Jharkh… Giridih    86 Beena … Shikh… 2021-08-19 00:00:00
## 10  4345 JH_53…      16 Jharkh… Chatra    407 Saroj … Gradu… 2022-08-22 00:00:00
## # … with 1,992 more rows, 124 more variables: description <chr>,
## #   village_name <chr>, block_name <chr>, panchayat_name <chr>, approval <chr>,
## #   reason_to_hold <chr>, cc_creative_idea_for_issue_video <chr>,
## #   shoot_plan <chr>, topic_identification_reason <chr>,
## #   other_topic_identification_reason <chr>, issue <chr>, issue_2 <chr>,
## #   themes <chr>, government_program_details <chr>, project <chr>,
## #   problem_identification_by_cc <chr>, impact_problem_reporting <chr>, …

This data source contains 2002 observations and 133 variables.

That is a lot of variables and many are not needed for my findings so I will trim it down to the variables that I need.

vv_df <- df%>%
  select(id,state_name, district, cc_name, mentor, story_pitch_date,issue, issue_2, themes, government_program_details, approval, impact_in_process, duration_of_problem, underlying_reason_of_issue, primary_affected_groups, actors_who_need_to_take_action, no_of_individuals_affected, impact_govt_official_support_details, impact_youtube_publish_date )

print(vv_df)

## # A tibble: 2,002 × 19
##       id state…¹ distr…² cc_name mentor story_pitch_date    issue issue_2 themes
##    <dbl> <chr>   <chr>   <chr>   <chr>  <dttm>              <chr> <chr>   <chr> 
##  1  2465 Jharkh… Palamu  Bhaska… Shikh… 2021-08-10 00:00:00 Anti… Health  A Mat…
##  2  2476 Jharkh… Khunti  Ravind… Shikh… 2021-08-13 00:00:00 Heal… Liveli… A Mat…
##  3  2879 Jharkh… Garhwa  Prasha… Gradu… 2021-10-22 00:00:00 Anti… Liveli… An en…
##  4  2881 Jharkh… Garhwa  Prasha… Gradu… 2021-10-23 00:00:00 Live… Anti P… An en…
##  5  2639 Jharkh… Dhanbad Halima… Gurup… 2021-08-01 00:00:00 Heal… Anti P… A Mat…
##  6   953 Jharkh… Dhanbad Halima… Gurup… 2020-10-15 00:00:00 Heal… Anti P… A Mat…
##  7  4364 Jharkh… Godda   Mohd G… Shikh… 2022-08-25 00:00:00 Powe… Anti P… A Mat…
##  8  5057 Jharkh… Giridih Beena … Shikh… 2022-12-20 00:00:00 Water Anti P… An en…
##  9  2543 Jharkh… Giridih Beena … Shikh… 2021-08-19 00:00:00 Anti… Health  A Mat…
## 10  4345 Jharkh… Chatra  Saroj … Gradu… 2022-08-22 00:00:00 Forc… Forced… A Sys…
## # … with 1,992 more rows, 10 more variables: government_program_details <chr>,
## #   approval <chr>, impact_in_process <chr>, duration_of_problem <chr>,
## #   underlying_reason_of_issue <chr>, primary_affected_groups <chr>,
## #   actors_who_need_to_take_action <chr>, no_of_individuals_affected <dbl>,
## #   impact_govt_official_support_details <chr>,
## #   impact_youtube_publish_date <dttm>, and abbreviated variable names
## #   ¹state_name, ²district

Now that I have a data frame with all my needed variables I need to clean it so it will be easier to visualize in tableau.

I will need to specifically wrangle the Government Officials, Affected groups variables, Issues, and .

Government Support Officals

First starting with impact_govt_official_support_details I noticed this is a multi-choice variable but each choice is separated by commas instead of each having their own field, so I am going to need to split the possible options into 3 separate columns.

This is so options are not separate by commas (i,e a = 1,2,3) and instead have their own fields (i.e a=1, b=2, c=3).

## replace NA value to null
vv_df <- vv_df %>%
  mutate_at(c('impact_govt_official_support_details'), ~replace_na(.,''))

## delimiter split govt official agencies

vv_df$govt <- vv_df$impact_govt_official_support_details

vv_df <-vv_df %>%
  separate(impact_govt_official_support_details, c('govt_split1','govt_split2','govt_split3'), sep = ',')

vv_df <- vv_df%>%
  mutate_at(c('govt_split2', 'govt_split3'), ~replace_na(.,'')) ##removed 'NA'

## Need to trim field b/c split2 & split3 have spaces after respective commas.

vv_df$govt_split2 <- trimws(vv_df$govt_split2, "both" )

vv_df$govt_split3 <- trimws(vv_df$govt_split3, "both" )

Now that the 3 choice options now have their own columns, I need to create a new data frame where each govt official have their own row.

This is so when a issues appears it does not appear as (Issue1 = govt_official_1, govt_official_2) , but as (Issue1 = govt_official_1) (Issue1 = govt_official_2)

Govt_Officials_df <- data.frame( govt_org = c(vv_df$govt_split1, vv_df$govt_split2, vv_df$govt_split3),
                         govt_org_vv_ID = c(vv_df$id),
                         govt_org_vv_district = c(vv_df$district))

## check to make sure there are no duplicates & to confirm TRIM worked.
unique(Govt_Officials_df$govt_org)

##  [1] "Others officials"                                   
##  [2] "Block development officer or other block officials" 
##  [3] "District Collector or district officials"           
##  [4] "Panchayat officials"                                
##  [5] "Block development officer or other block officials "
##  [6] ""                                                   
##  [7] "MLA"                                                
##  [8] "Other village officials (non-panchayat)"            
##  [9] "Police"                                             
## [10] "MP"

Affected Marginalized Groups

Primary_affected_groups columns have the same issues as Govt_officials_df so I need to clean it the same way, so I am going to need to split the possible options into 3 separate columns.

vv_df <- vv_df %>%
  separate(primary_affected_groups, c('affected_groups_1', 'affected_groups_2', 'affected_groups_3'), sep = ',')

vv_df <- vv_df %>%
  mutate_at(c('affected_groups_1','affected_groups_2','affected_groups_3'), ~replace_na(.,''))

## Need to trim field b/c split2 & split3 have spaces after respective commas.

vv_df$affected_groups_2 <- trimws(vv_df$affected_groups_2, "both" )

vv_df$affected_groups_3 <- trimws(vv_df$affected_groups_3, "both" )

Now that the 3 choice options now have their own columns, I need to create a new data frame where each affected marginalized group have their own row.

Affected_Groups_df <- data.frame(affected_groups = c(vv_df$affected_groups_1, vv_df$affected_groups_2, vv_df$affected_groups_3),
           affected_groups_vv_id = c(vv_df$id),
           affected_groups_vv_district = c(vv_df$district))

summary(Affected_Groups_df)

##  affected_groups    affected_groups_vv_id affected_groups_vv_district
##  Length:6006        Min.   :    1         Length:6006                
##  Class :character   1st Qu.:  521         Class :character           
##  Mode  :character   Median : 3463         Mode  :character           
##                     Mean   : 3760                                    
##                     3rd Qu.: 6247                                    
##                     Max.   :12459

## check to make sure there are no duplicates & to confirm TRIM worked.
unique(Affected_Groups_df$affected_groups)

##  [1] "No particular groups is affected -- the problem is cross-cutting"
##  [2] "Tribals"                                                         
##  [3] "Muslims"                                                         
##  [4] "Women"                                                           
##  [5] "Dalits"                                                          
##  [6] "Other religious minorities"                                      
##  [7] ""                                                                
##  [8] "Children"                                                        
##  [9] "Youth"                                                           
## [10] "Other lower castes (e.g. OBC)"                                   
## [11] "Elderly"                                                         
## [12] "Adivasi"                                                         
## [13] "Scheduled Castes"                                                
## [14] "Transgender"                                                     
## [15] "Scheduled Tribe"

Issues

Issues variable is another multi-choice field, fortunately the selected options are already in their own fields (2 choices maximum).

So I do not have to delimiter split like with the government officials and affected groups, so I can go straight into to creating the Issues dataframe.

Issue_df <- data.frame(Issue = c(vv_df$issue, vv_df$issue_2),
           Issue_vv_ID = c(vv_df$id),
           Issue_vv_District = c(vv_df$district),
           Issue_vv_CCNames = c(vv_df$cc_name),
           Issue_vv_YT_Published = c(vv_df$impact_youtube_publish_date),
           Issue_vv_Duration_of_Issue = c(vv_df$duration_of_problem),
           Issue_vv_Story_Pitch_Date = c(vv_df$story_pitch_date))

Issue_df <- Issue_df %>%
  mutate(Issue_vv_YT_Published = as.character(Issue_vv_YT_Published), Issue_vv_Story_Pitch_Date = as.character(Issue_vv_Story_Pitch_Date))

##Issue_df <- replace(Issue_df, is.na(Issue_df),'')

summary(Issue_df)

##     Issue            Issue_vv_ID    Issue_vv_District  Issue_vv_CCNames  
##  Length:4004        Min.   :    1   Length:4004        Length:4004       
##  Class :character   1st Qu.:  521   Class :character   Class :character  
##  Mode  :character   Median : 3463   Mode  :character   Mode  :character  
##                     Mean   : 3760                                        
##                     3rd Qu.: 6247                                        
##                     Max.   :12459                                        
##  Issue_vv_YT_Published Issue_vv_Duration_of_Issue Issue_vv_Story_Pitch_Date
##  Length:4004           Length:4004                Length:4004              
##  Class :character      Class :character           Class :character         
##  Mode  :character      Mode  :character           Mode  :character         
##                                                                            
##                                                                            
##

## To find all Issues presented in data set
unique(Issue_df$Issue)

##  [1] "Anti Poverty Programs"         "Health"                       
##  [3] "Livelihoods"                   "Power and Energy"             
##  [5] "Water"                         "Forced Evictions"             
##  [7] "Sanitation"                    "Rural Innovation"             
##  [9] "Crumbling Infrastructure"      "Education"                    
## [11] "Agrarian Crisis"               "Art & Culture"                
## [13] "Governance and Accountability" "Indigenous People"            
## [15] "Natural Disaster"              "Environment"                  
## [17] "Mining"                        "Courage & Inspiration"        
## [19] "Trafficking & Migration"       "Labour Rights"                
## [21] "Technology"                    "Corruption"                   
## [23] "Reproductive Rights"           "Gender"                       
## [25] "Religion & Faith"              "State Repression"             
## [27] "Development"                   "Impact"                       
## [29] "Infrastructure"                "Caste"                        
## [31] "Conflict"                      NA                             
## [33] "Food & Social Security"

ANALYZE

Guiding Questions

Viz for social good provided some guiding questions to help with Analysis:

How can we use this data and videos to tell a better story and give a better picture of various districts
How can we view this data from a public communications perspective to capture the attention of partners and potential collaborators? How can we present this data in a way that makes it stand out and be impactful ?
What are the major issues affecting various regions, groups of stakeholders and how many people are being affected by these issues?
What is the average duration of a community issue (across issues) before it is resolved?
How does community engagement and participation impact the resolution of reported issues?

knowing this I want to find: + How many Issues were Produced. + How many Issues were Resolved. + Produced to Resolved Ratio. + Number of Individuals Impacted + How long it takes for issues to be resolved.

How many Issues were Produced.

Issue_df %>%
  filter(!is.na(Issue)) %>%
  mutate(n = cumsum(Issue)) %>%
  summarise(Issues_Produced = n())

##   Issues_Produced
## 1            3146

How many Issues were Resolved.

I’ll use Issue_vv_YT_Published which contains a url to the video of resolved issues. Since there is no varable that clearly notes that an Issue is resolved, so this is the closest mode of confirming the resolution of an issue.

Issue_df %>%
  filter(!is.na(Issue_vv_YT_Published)) %>%
  mutate(n = cumsum(Issue_vv_YT_Published)) %>%
  summarise(Issues_Resolved = n())

##   Issues_Resolved
## 1             556

Produced to Resolved Ratio.

Issue_df %>%
  filter(!is.na(Issue)) %>%
  mutate(n = cumsum(Issue)) %>%
  summarise(n()) -> P

Issue_df %>%
  filter(!is.na(Issue_vv_YT_Published)) %>%
  mutate(n = cumsum(Issue_vv_YT_Published)) %>%
  summarise(n()) -> R

 Ratio <- P/R
 colnames(Ratio)[1] <- 'Ratio'
 print(Ratio)

##      Ratio
## 1 5.658273

Number of Individuals Impacted

sum(vv_df$no_of_individuals_affected, na.rm = TRUE)

## [1] 3274503

How long it takes for issues to be resolved.

Month <- as.tibble(interval(Issue_df$Issue_vv_Story_Pitch_Date, Issue_df$Issue_vv_YT_Published)%>%
        as.numeric('months'))
Month <- na.omit(Month)

AvgMonths <- sum(Month)/count(Month)
colnames(AvgMonths)[1] <- "Average Time to resolve Issue in MONTHS"
print(AvgMonths)

##   Average Time to resolve Issue in MONTHS
## 1                                18.95988

VISUALIZATIONS

No to visualize the data given.

NOTE I am not the most familiar with visualizations in Rstudio, but I always love to try and improve my skills. However I MUCH prefer visualizing my data in TABLEAU so here is a link to my TABLEAU visualization of this data source HERE

Visualizing Issues Produced

filter(Issue_df, Issue != '')%>%
  ggplot(.) +
  geom_col(aes(x = Issue, y = Issue_vv_ID, fill = Issue))

Visualizing Government Officials support

filter(Govt_Officials_df, govt_org != '')%>%
  ggplot(.) +
  geom_col(aes(x = govt_org, y = govt_org_vv_ID, fill = govt_org))

Visualizing Marginalized Groups

filter(Affected_Groups_df, affected_groups != '')%>%
  ggplot(.) +
  geom_col(aes(x = affected_groups, y = affected_groups_vv_id, fill = affected_groups))

Visualizing how many individuals are impacted per issue

vv_df%>%
  ggplot(.) +
  geom_col(aes(x = reorder(issue, -no_of_individuals_affected), y = no_of_individuals_affected, fill =  issue))

RECOMMENDATIONS

Now to conclude. I have Gathered, Wrangled, Analyzed, Visualized and followed various rules of thumb to ensure data integrity and from my findings I would recommend.

Focusing more Video resources on the Chatra District due to the high Issues Produced to Resolved ratio at 3.6 along with it taking about 1 year for issues to be resolved.
The District of West Singhbhum community would benefit the most from Videos and an overall emphasis on finding solutions for health-related issues, due to this district encompassing 99% of individuals suffering from Health-related issues.
Dalits face uniquely harsh harassment due to the Caste System. Videos empathetically detailing the plight of these individuals and how social acceptance and comrodery will pay dividends for the state of Jharkhand.
Continued networking with government officials, shareholders, and benevolent organizations to acquire necessary resources and labor to resolve issues and affect government policies.

Now if you would like to view this wall of text in a visually appealing dashboard check out my TABLEAU LINK HERE