For the Month of February - March Viz for social good posted a project for Video Volunteers - India and I want to detail how I gathered, wrangled, and analyzed the provided data to create a visualization dashboard and recommendations from my findings.
Note I have already created a dashboard condensing this Rmarkdown file into a TABLEAU visualization view it HERE
Video Volunteers model over the last 12+ years has been simple. They empower a community member in every district to become a community journalist and social worker and we call them Community Correspondents. They identify a community issue (could be health related, water, gender, etc) and make the first video called the issue video capturing the issue details. Post this, they show this to relevant local officials and work on solving the issue the video is about. As the issue is resolved, they create another video calling it the impact video. This video inspires others to act. Each video is tagged to multiple data relating to the issue and impact. 1 out of every 5 videos today are resolved. And we believe, being able to show our community data in a powerful visualization would help us double our impact rate.
Knowing this:
HOW CAN DATA AMPLIFY THE STORIES OF THESE MARGENALIZED COMMUNITIES?
The Community Correspondent, the community and local government officials. We want to create a visualization that a community when looking at feels that their voices are captured effectively on the platform and officials are able to act on insights of the community.
Lets load some packages:
library(tidyverse)
library(dplyr)
library(readxl)
library(lubridate)
The data source was provided my Video Volunteers and Viz for Social Good and I’ll use the (R.O.C.C.C) method to varify the Reliability, Comprehensiveness, currency, and Cited.
RELIABLE: This data sourece was provided by Video Volunteers and Viz for Social Good.
ORIGINAL: This data source is original because it is gathered first hand through the Community Correspondents and Organization.
COMPREHENSIVE: This data contains an expansive amount of fields and varaibles needed to analyze key aspects of the data. Many fields are multi-choice, yes/no, and descriptions so there are many NULL values as related to these fields.
CURRENT: The data provided is current due to the noted impact and publish dates along with the insigts gained needed to utilized for near future decision making.
CITED: The data as far as I know is publicly available provided by the Viz for social Good Project page.
Now lets load the provided data over last few years: HERE
Lests assign the data to a data frame
df <- read_excel("Data_Video Volunteers.xlsx")
print(df)
## # A tibble: 2,002 × 133
## id uid state…¹ state…² distr…³ cc_id cc_name mentor story_pitch_date
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <dttm>
## 1 2465 JH_52… 16 Jharkh… Palamu 438 Bhaska… Shikh… 2021-08-10 00:00:00
## 2 2476 JH_52… 16 Jharkh… Khunti 409 Ravind… Shikh… 2021-08-13 00:00:00
## 3 2879 JH_52… 16 Jharkh… Garhwa 440 Prasha… Gradu… 2021-10-22 00:00:00
## 4 2881 JH_52… 16 Jharkh… Garhwa 440 Prasha… Gradu… 2021-10-23 00:00:00
## 5 2639 JH_52… 16 Jharkh… Dhanbad 83 Halima… Gurup… 2021-08-01 00:00:00
## 6 953 JH_50… 16 Jharkh… Dhanbad 83 Halima… Gurup… 2020-10-15 00:00:00
## 7 4364 JH_53… 16 Jharkh… Godda 452 Mohd G… Shikh… 2022-08-25 00:00:00
## 8 5057 JH_54… 16 Jharkh… Giridih 86 Beena … Shikh… 2022-12-20 00:00:00
## 9 2543 JH_52… 16 Jharkh… Giridih 86 Beena … Shikh… 2021-08-19 00:00:00
## 10 4345 JH_53… 16 Jharkh… Chatra 407 Saroj … Gradu… 2022-08-22 00:00:00
## # … with 1,992 more rows, 124 more variables: description <chr>,
## # village_name <chr>, block_name <chr>, panchayat_name <chr>, approval <chr>,
## # reason_to_hold <chr>, cc_creative_idea_for_issue_video <chr>,
## # shoot_plan <chr>, topic_identification_reason <chr>,
## # other_topic_identification_reason <chr>, issue <chr>, issue_2 <chr>,
## # themes <chr>, government_program_details <chr>, project <chr>,
## # problem_identification_by_cc <chr>, impact_problem_reporting <chr>, …
This data source contains 2002 observations and 133 variables.
That is a lot of variables and many are not needed for my findings so I will trim it down to the variables that I need.
vv_df <- df%>%
select(id,state_name, district, cc_name, mentor, story_pitch_date,issue, issue_2, themes, government_program_details, approval, impact_in_process, duration_of_problem, underlying_reason_of_issue, primary_affected_groups, actors_who_need_to_take_action, no_of_individuals_affected, impact_govt_official_support_details, impact_youtube_publish_date )
print(vv_df)
## # A tibble: 2,002 × 19
## id state…¹ distr…² cc_name mentor story_pitch_date issue issue_2 themes
## <dbl> <chr> <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 2465 Jharkh… Palamu Bhaska… Shikh… 2021-08-10 00:00:00 Anti… Health A Mat…
## 2 2476 Jharkh… Khunti Ravind… Shikh… 2021-08-13 00:00:00 Heal… Liveli… A Mat…
## 3 2879 Jharkh… Garhwa Prasha… Gradu… 2021-10-22 00:00:00 Anti… Liveli… An en…
## 4 2881 Jharkh… Garhwa Prasha… Gradu… 2021-10-23 00:00:00 Live… Anti P… An en…
## 5 2639 Jharkh… Dhanbad Halima… Gurup… 2021-08-01 00:00:00 Heal… Anti P… A Mat…
## 6 953 Jharkh… Dhanbad Halima… Gurup… 2020-10-15 00:00:00 Heal… Anti P… A Mat…
## 7 4364 Jharkh… Godda Mohd G… Shikh… 2022-08-25 00:00:00 Powe… Anti P… A Mat…
## 8 5057 Jharkh… Giridih Beena … Shikh… 2022-12-20 00:00:00 Water Anti P… An en…
## 9 2543 Jharkh… Giridih Beena … Shikh… 2021-08-19 00:00:00 Anti… Health A Mat…
## 10 4345 Jharkh… Chatra Saroj … Gradu… 2022-08-22 00:00:00 Forc… Forced… A Sys…
## # … with 1,992 more rows, 10 more variables: government_program_details <chr>,
## # approval <chr>, impact_in_process <chr>, duration_of_problem <chr>,
## # underlying_reason_of_issue <chr>, primary_affected_groups <chr>,
## # actors_who_need_to_take_action <chr>, no_of_individuals_affected <dbl>,
## # impact_govt_official_support_details <chr>,
## # impact_youtube_publish_date <dttm>, and abbreviated variable names
## # ¹​state_name, ²​district
Now that I have a data frame with all my needed variables I need to clean it so it will be easier to visualize in tableau.
I will need to specifically wrangle the Government Officials, Affected groups variables, Issues, and .
First starting with impact_govt_official_support_details I noticed this is a multi-choice variable but each choice is separated by commas instead of each having their own field, so I am going to need to split the possible options into 3 separate columns.
This is so options are not separate by commas (i,e a = 1,2,3) and instead have their own fields (i.e a=1, b=2, c=3).
## replace NA value to null
vv_df <- vv_df %>%
mutate_at(c('impact_govt_official_support_details'), ~replace_na(.,''))
## delimiter split govt official agencies
vv_df$govt <- vv_df$impact_govt_official_support_details
vv_df <-vv_df %>%
separate(impact_govt_official_support_details, c('govt_split1','govt_split2','govt_split3'), sep = ',')
vv_df <- vv_df%>%
mutate_at(c('govt_split2', 'govt_split3'), ~replace_na(.,'')) ##removed 'NA'
## Need to trim field b/c split2 & split3 have spaces after respective commas.
vv_df$govt_split2 <- trimws(vv_df$govt_split2, "both" )
vv_df$govt_split3 <- trimws(vv_df$govt_split3, "both" )
Now that the 3 choice options now have their own columns, I need to create a new data frame where each govt official have their own row.
This is so when a issues appears it does not appear as (Issue1 = govt_official_1, govt_official_2) , but as (Issue1 = govt_official_1) (Issue1 = govt_official_2)
Govt_Officials_df <- data.frame( govt_org = c(vv_df$govt_split1, vv_df$govt_split2, vv_df$govt_split3),
govt_org_vv_ID = c(vv_df$id),
govt_org_vv_district = c(vv_df$district))
## check to make sure there are no duplicates & to confirm TRIM worked.
unique(Govt_Officials_df$govt_org)
## [1] "Others officials"
## [2] "Block development officer or other block officials"
## [3] "District Collector or district officials"
## [4] "Panchayat officials"
## [5] "Block development officer or other block officials "
## [6] ""
## [7] "MLA"
## [8] "Other village officials (non-panchayat)"
## [9] "Police"
## [10] "MP"
Primary_affected_groups columns have the same issues as Govt_officials_df so I need to clean it the same way, so I am going to need to split the possible options into 3 separate columns.
vv_df <- vv_df %>%
separate(primary_affected_groups, c('affected_groups_1', 'affected_groups_2', 'affected_groups_3'), sep = ',')
vv_df <- vv_df %>%
mutate_at(c('affected_groups_1','affected_groups_2','affected_groups_3'), ~replace_na(.,''))
## Need to trim field b/c split2 & split3 have spaces after respective commas.
vv_df$affected_groups_2 <- trimws(vv_df$affected_groups_2, "both" )
vv_df$affected_groups_3 <- trimws(vv_df$affected_groups_3, "both" )
Now that the 3 choice options now have their own columns, I need to create a new data frame where each affected marginalized group have their own row.
Affected_Groups_df <- data.frame(affected_groups = c(vv_df$affected_groups_1, vv_df$affected_groups_2, vv_df$affected_groups_3),
affected_groups_vv_id = c(vv_df$id),
affected_groups_vv_district = c(vv_df$district))
summary(Affected_Groups_df)
## affected_groups affected_groups_vv_id affected_groups_vv_district
## Length:6006 Min. : 1 Length:6006
## Class :character 1st Qu.: 521 Class :character
## Mode :character Median : 3463 Mode :character
## Mean : 3760
## 3rd Qu.: 6247
## Max. :12459
## check to make sure there are no duplicates & to confirm TRIM worked.
unique(Affected_Groups_df$affected_groups)
## [1] "No particular groups is affected -- the problem is cross-cutting"
## [2] "Tribals"
## [3] "Muslims"
## [4] "Women"
## [5] "Dalits"
## [6] "Other religious minorities"
## [7] ""
## [8] "Children"
## [9] "Youth"
## [10] "Other lower castes (e.g. OBC)"
## [11] "Elderly"
## [12] "Adivasi"
## [13] "Scheduled Castes"
## [14] "Transgender"
## [15] "Scheduled Tribe"
Issues variable is another multi-choice field, fortunately the selected options are already in their own fields (2 choices maximum).
So I do not have to delimiter split like with the government officials and affected groups, so I can go straight into to creating the Issues dataframe.
Issue_df <- data.frame(Issue = c(vv_df$issue, vv_df$issue_2),
Issue_vv_ID = c(vv_df$id),
Issue_vv_District = c(vv_df$district),
Issue_vv_CCNames = c(vv_df$cc_name),
Issue_vv_YT_Published = c(vv_df$impact_youtube_publish_date),
Issue_vv_Duration_of_Issue = c(vv_df$duration_of_problem),
Issue_vv_Story_Pitch_Date = c(vv_df$story_pitch_date))
Issue_df <- Issue_df %>%
mutate(Issue_vv_YT_Published = as.character(Issue_vv_YT_Published), Issue_vv_Story_Pitch_Date = as.character(Issue_vv_Story_Pitch_Date))
##Issue_df <- replace(Issue_df, is.na(Issue_df),'')
summary(Issue_df)
## Issue Issue_vv_ID Issue_vv_District Issue_vv_CCNames
## Length:4004 Min. : 1 Length:4004 Length:4004
## Class :character 1st Qu.: 521 Class :character Class :character
## Mode :character Median : 3463 Mode :character Mode :character
## Mean : 3760
## 3rd Qu.: 6247
## Max. :12459
## Issue_vv_YT_Published Issue_vv_Duration_of_Issue Issue_vv_Story_Pitch_Date
## Length:4004 Length:4004 Length:4004
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## To find all Issues presented in data set
unique(Issue_df$Issue)
## [1] "Anti Poverty Programs" "Health"
## [3] "Livelihoods" "Power and Energy"
## [5] "Water" "Forced Evictions"
## [7] "Sanitation" "Rural Innovation"
## [9] "Crumbling Infrastructure" "Education"
## [11] "Agrarian Crisis" "Art & Culture"
## [13] "Governance and Accountability" "Indigenous People"
## [15] "Natural Disaster" "Environment"
## [17] "Mining" "Courage & Inspiration"
## [19] "Trafficking & Migration" "Labour Rights"
## [21] "Technology" "Corruption"
## [23] "Reproductive Rights" "Gender"
## [25] "Religion & Faith" "State Repression"
## [27] "Development" "Impact"
## [29] "Infrastructure" "Caste"
## [31] "Conflict" NA
## [33] "Food & Social Security"
Viz for social good provided some guiding questions to help with Analysis:
How can we use this data and videos to tell a better story and give a better picture of various districts
How can we view this data from a public communications perspective to capture the attention of partners and potential collaborators? How can we present this data in a way that makes it stand out and be impactful ?
What are the major issues affecting various regions, groups of stakeholders and how many people are being affected by these issues?
What is the average duration of a community issue (across issues) before it is resolved?
How does community engagement and participation impact the resolution of reported issues?
knowing this I want to find: + How many Issues were Produced. + How many Issues were Resolved. + Produced to Resolved Ratio. + Number of Individuals Impacted + How long it takes for issues to be resolved.
Issue_df %>%
filter(!is.na(Issue)) %>%
mutate(n = cumsum(Issue)) %>%
summarise(Issues_Produced = n())
## Issues_Produced
## 1 3146
I’ll use Issue_vv_YT_Published which contains a url to the video of resolved issues. Since there is no varable that clearly notes that an Issue is resolved, so this is the closest mode of confirming the resolution of an issue.
Issue_df %>%
filter(!is.na(Issue_vv_YT_Published)) %>%
mutate(n = cumsum(Issue_vv_YT_Published)) %>%
summarise(Issues_Resolved = n())
## Issues_Resolved
## 1 556
Issue_df %>%
filter(!is.na(Issue)) %>%
mutate(n = cumsum(Issue)) %>%
summarise(n()) -> P
Issue_df %>%
filter(!is.na(Issue_vv_YT_Published)) %>%
mutate(n = cumsum(Issue_vv_YT_Published)) %>%
summarise(n()) -> R
Ratio <- P/R
colnames(Ratio)[1] <- 'Ratio'
print(Ratio)
## Ratio
## 1 5.658273
sum(vv_df$no_of_individuals_affected, na.rm = TRUE)
## [1] 3274503
Month <- as.tibble(interval(Issue_df$Issue_vv_Story_Pitch_Date, Issue_df$Issue_vv_YT_Published)%>%
as.numeric('months'))
Month <- na.omit(Month)
AvgMonths <- sum(Month)/count(Month)
colnames(AvgMonths)[1] <- "Average Time to resolve Issue in MONTHS"
print(AvgMonths)
## Average Time to resolve Issue in MONTHS
## 1 18.95988
No to visualize the data given.
NOTE I am not the most familiar with visualizations in Rstudio, but I always love to try and improve my skills. However I MUCH prefer visualizing my data in TABLEAU so here is a link to my TABLEAU visualization of this data source HERE
filter(Issue_df, Issue != '')%>%
ggplot(.) +
geom_col(aes(x = Issue, y = Issue_vv_ID, fill = Issue))
filter(Govt_Officials_df, govt_org != '')%>%
ggplot(.) +
geom_col(aes(x = govt_org, y = govt_org_vv_ID, fill = govt_org))
filter(Affected_Groups_df, affected_groups != '')%>%
ggplot(.) +
geom_col(aes(x = affected_groups, y = affected_groups_vv_id, fill = affected_groups))
vv_df%>%
ggplot(.) +
geom_col(aes(x = reorder(issue, -no_of_individuals_affected), y = no_of_individuals_affected, fill = issue))
Now to conclude. I have Gathered, Wrangled, Analyzed, Visualized and followed various rules of thumb to ensure data integrity and from my findings I would recommend.
Focusing more Video resources on the Chatra District due to the high Issues Produced to Resolved ratio at 3.6 along with it taking about 1 year for issues to be resolved.
The District of West Singhbhum community would benefit the most from Videos and an overall emphasis on finding solutions for health-related issues, due to this district encompassing 99% of individuals suffering from Health-related issues.
Dalits face uniquely harsh harassment due to the Caste System. Videos empathetically detailing the plight of these individuals and how social acceptance and comrodery will pay dividends for the state of Jharkhand.
Continued networking with government officials, shareholders, and benevolent organizations to acquire necessary resources and labor to resolve issues and affect government policies.
Now if you would like to view this wall of text in a visually appealing dashboard check out my TABLEAU LINK HERE