class: center, middle, inverse, title-slide .title[ # Data Science for Social Good⨠] .subtitle[ ## Understanding whatâs in the toolbox of data analysis for non profits ] .author[ ### Rachel L. Wilkerson ] .date[ ### 2024-05-15 ] ---
<br> <br> <center><img src="datascience.png" alt="Data Science Process" height="450px" /></center> .footnote[Cathy O'Neil's **Doing Data Science: Straight Talk from the frontlines**] --- class: inverse center middle # Real World What if the way we respond to the crisis is part of the crisis? .footnote[BĂĄyò AkĂłmolĂĄfĂŠ of [Ten](https://www.emergencenetwork.org/), formerly the Emergence Network] --- <br> <center><img src="bureaucratic.png" alt="comic about the bureaucratic and legal obstacles of obtaining data" height="500px" /></center> --- # Garbage in ⥠Garbage out đŠ - Sexism in ⥠sexism out - Racism in ⥠racism out - Systemic oppression in ⥠systemic oppression out ### Pervasive Lie: that you can magic social justice out of the burning ashes of datified oppression đĽ --- # Art as data đ¨ .pull-left[ <center><img src="ashley.png" alt="photo of Ashley and her son" height="400px" /></center> ] .pull-right[ > When the public sees our children . . . I just want them to know that we donât have a label on us. There shouldnât be a stereotype. We shouldnât generalize. . . . Weâre not a number. I am you. You are me. Weâre the same, weâre equals. Iâm no different because Iâm on welfare. Weâre equal. No label. No number. My name is Ashley. Your name is Jennifer. Weâre the same. ] .footnote[Chilton, Mariana, et al. "Witnesses to hunger: participation through photovoice to ensure the right to food." *** Health & Hum. Rts.*** 11 (2009): 73.] --- # Impact Outcomes .pull-left[ - Who will change - What will change - By how much - By when - ***How the change will be measured*** ] .pull-right[  ] .footnote[Image by [Rob Cottingham](https://www.robcottingham.ca/cartoon/archive/measuring-the-networked-nonprofit-1-cute-animal-theory/)] ??? - Who will change: the [people] you are training/providing services to - What will change: the knowledge, attitudes, and skills you expect to change - By how much: how much change you think you can realistically achieve - By when: the timeframe within you hope to see change - How the change will be measured: the surveys, tests, interviews, or other methods you will use to measure the different changes specified --- # Leadership đ¤ .pull-left[ - â Leadership aware of the value of data - â Team members have data analytics expertiseđŠâđť - â Leaders committed to investing enough in data related resources đŠâđŤ - â Leadership has a business plan with defined, aligned and measurable goals đ - â Data seen as a team effort, rather than just one person's responsibility đ - â Data used by staff to help solve a variety of problems, across departments, and functionsđ§Ž - â Staff are able to communicate about data use and are data literate đ ] .pull-right[  - Alice Daish: first data scientist at the British Museum ] --- # Knowledge & Skills - â Individuals who data collection and storage systems (ex: database administrator or data engineer) - â Individuals responsible for analyzing internal or external data using statistical tools and/or models (ex: data analyst/scientist) - â Broader staff who have the general skills needed to use data? - â Staff trainings to collect, manage, use, store, and interpret data .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] --- # Activity ### Where is your organization on the data maturity scale? .pull-left[ [https://www.menti.com/alz531ukcufq](https://www.menti.com/alz531ukcufq) <center><img src="mentimeter_qr_code.png" alt="mentimeter" height="350px" /></center> ] .pull-right[<center><img src="datamaturity.png" alt="data_maturity" height="350px" /></center> ] .footnote[ [Image Credit Link](https://www.heap.io/blog/the-four-stages-of-data-maturity)] --- class: inverse center middle # Data Collection Data is not truth, and tech is not an answer in-and-of-itself. Without designing for the humans on the other end, our work is in vain. .footnote[Jake Porway, [Five principles for applying data science for social good](https://www.oreilly.com/radar/five-principles-for-applying-data-science-for-social-good/)] --- # How do we get data? 1. Find some useful data - Open source data - Public agency data - Creative webscraping 1. Collect our own data - surveys - internal databases 1. Ask "what if" questions - A focus group for measuring uncertainty 𤨠--- # Existing data for local non-profits #### đĄ Local Level .pull-left[ [Waco Roundtable](https://www.wacoroundtable.org/) ] .pull-right[ <center><img src="roundtable.png" alt="roundtable" height="350px" /></center> ] --- # Existing data for local non-profits #### đState level .pull-left[ - [Texas Data.gov](https://data.texas.gov/dataset/Summer-Meal-Programs-Seamless-Summer-Option-SSO-Co/skk3-4frt/about_data) - State agency data (TEA, HHSC, TDA) ] .pull.right[ <center><img src="txdata.png" alt="txdata" height="350px" /></center> ] --- # Existing data for local non-profits #### đşđ¸ National level .pull-left[ - [Data.gov](https://catalog.data.gov/dataset/chicago-public-schools-progress-report-cards-2011-2012) - Census Bureau: aggregate data at spatial levels - [IPUMS Extraction Tool](https://cps.ipums.org/cps/): microdata ] .pull-right[ <center><img src="ipums.png" alt="ipums" height="350px" /></center> ] --- # Data Collection ### The classic blunders: - đ Collect data with an identifier to merge with other datasets - âď¸ Don't aggregate data: collect at the level you want insight - đĽ Maintain unique identifiers to enable matching ### ***What else counts as data?*** --- # Data Collection: Motivation #### Against the data exhaust pipe *** What happens if the people own the data? *** - **Data cooperatives: ** đ¤ sharing and owning data through peer-to-peer repositories - **Data sovreignty:** đĄ digital right for citizens and communities - Case study: UN-Habitat programme âPeople-Centred Smart Citiesâ - Resource đ§: [Disco Coop](https://www.disco.coop/) .footnote[Calzada, Igor. "Data co-operatives through data sovereignty." Smart Cities 4.3 (2021): 1158-1172.] --- # Expert elicitation .pull-left[ - Assesses ***uncertainty and risk*** for policy questions - Ex: What learning loss do educators anticipate with and without summer reading program? - ***Domain experts*** are calibrated using questions with answers that are known but not readily available - Ex: What was the change in reading scores for Dean Highland Elementary from 2020 - 2021? - Applicable when a focus group is more feasible than a survey ] .pull-right[ <center><img src="superforecasting.png" alt="Superforecasting" height="400px" /></center> ] --- # Expert elicitation  .footnote[Hemming, Victoria, et al. "Eliciting improved quantitative judgements using the IDEA protocol: A case study in natural resource management." PloS one 13.6 (2018): e0198468.] --- # Activity #### What data would you like to collect? [https://www.menti.com/alz531ukcufq](https://www.menti.com/alz531ukcufq) <center><img src="mentimeter_qr_code.png" alt="mentimeter" height="350px" /></center> --- class: inverse center middle # Data Processing --- # Data Quality - â Processes for keeping data up to date - â Trust that data is accurate and complete - â Quality assured data - â Central database for data access .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] --- # Privacy .pull-left[ - â Minimum policy and practice in place to ensure data is safeguarded - â rules on passwords - â how data is stored - â when and how data is deleted - â Staff understand important regulations and governance policies ] .pull-right[  ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist Illustration by Seb Agresti] --- # Quality checks .pull-left[ [Poverty Action Lab Data Quality Checks](https://www.povertyactionlab.org/resource/data-quality-checks) - **High-Frequency Checks**: daily or weekly checks for data irregularities. - **Back-checks**: are short, audit-style surveys of respondents who have already been surveyed. - **Spot-checks**:are unanticipated visits by senior field staff to verify enumerators are surveying when and where they should be. ] .pull-right[  ] --- # Missing Data #### Types of missing data: - Missing completely at ***random*** - Missing at random (within ***subpopulations***) - Missing not at random - missingness depends on the ***missing value*** itself - missingness depends on an ***unobserved variable***. .footnote[ Case study: [Deworming example](https://ssir.org/articles/entry/why_the_social_sector_needs_the_scientific_method) ] --- # Missing Data #### What do we do about it? - Remove observations - Methods that use all available data (eg max likelihood, EM algorithm) - Imputation - Bayesian methods đĄ When designing data collection, remember to collect information that may inform missingness --- # Missing data ## How do we minimize it? - Better training and involvement of staff and participants - Better data tracking - âď¸ Collect information from participants regarding the likelihood they will drop out - ex: the insurance status or the employment status tends to have a strong influence on participant's decision to remain in the study. - Better participant experience .footnote[Little et al. (2012a,b)] ??? - Better training and involvement of staff and participants - shared understanding and goals for intervention - Beter data tracking - Set acceptable rates of missingness - Monitor progress of program - Up to date information for participants - Collect information from participants regarding the likelihood they will drop out - ex: the insurance status or the employment status tends to have a strong influence on participant's decision to remain in the study. - Better participant experience - Offer incentives - Positive experience - Elevated platform .footnote[Little et al. (2012a,b)] --- # Activity: Geocodingđ [Geocoding in Google Drive Resource](https://handsondataviz.org/geocode.html) .footnote[ Jack Dougherty & Ilya Ilyankou ***Hands-On Data Visualization*** ] --- class: inverse center middle # Exploratory Data Analysis >âThe real purpose of the scientific method is to make sure nature hasnât misled you into thinking you know something you actually donât know.â ***Zen and the Art of Motorcycle Maintenance*** --- # One Dimension #### Barcharts & Histograms .pull-left[ <center><img src="barchart.png" alt="barchart" height="380px" /></center> ] .pull-right[ <center><img src="bhistogram.png" alt="barchart" height="380px" /></center> ] .footnote[Image from Nathan Yau's Flowing Data [How to use and read a histogram in R](https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/)] --- # Case Study: <img src="template_files/figure-html/unnamed-chunk-1-1.png" width="100%" /> --- # Case Study: <img src="template_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> --- # Case Study: <img src="template_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> --- # Outliers <center><img src="outlier.png" alt="outlier" height="400px" /></center> - Outliers can be defined by being outside some number of standard deviations - Outliers can be visually detected in a scatter plot --- # Outliers and Boxplots <img src="template_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> --- # Two Dimensions #### Scatterplots <center><img src="scatter-plot.png" alt="scatterplot" height="400px" /></center> --- # Case Study <img src="template_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> --- # Activity: Mappingđ [Geocoding in Google Drive Resource](https://handsondataviz.org/geocode.html) .footnote[ Jack Dougherty & Ilya Ilyankou ***Hands-On Data Visualization*** ] --- class: inverse center middle # Modeling "All models are wrong, but some are useful" .footnote[George Box] --- # Levels of Analysis .pull-left[ #### What we want to know - â Basic counts and/or charts used - â Descriptive analytics about what happened used - â summarizing the overview averages - â variation - â range - â past trends ] .pull-right[ #### What we use - Summary Statistics - đ§ Resource: [Statistical Literacy Guide from UK ](https://researchbriefings.files.parliament.uk/documents/SN04944/SN04944.pdf) - Pre-post (before & after) ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] ??? --- # Case Study - Summary Statistics - How many books did each student check out at each site visit over the summer? - Average number of books per site? - How does participation change over the summer? - Pre/Post Test - Reading level pre and post test đ --- # Levels of Analysis ### Simple Differences .pull-left[ #### What we want to know - â Diagnostic analytics about why it happened - â Discovering differences - â Correlations - â Patterns and anomalies - â Drilling down to explore causes ] .pull-right[ #### What we use - Simple difference - Difference in Difference - Linear Modelling - Root cause analysis - Standard Deviation - Root Cause Analysis ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] --- # Case Study ### Differences - Simple difference: - Post test reading scores for children who did not enter the program (for any reason) - đ Data collected **after** the summer - Simple difference - Pre and Post test reading scores for children who did not enter the program (for any reason) - đ Data collected **before** and **after** the summer --- # Case Study: ### Correlations & Outliers - Correlations - Ex: 80% of the variability in reading scores can be explained by the attendance rate - đ Data collected: post test scores and independent variables - Standard Deviation - Ex: School has < 85% attendance rates - đ Data collected: cofounders of interest --- # Levels of Analysis ### Predictive Analytics .pull-left[ #### What we want to know - â Predictive analytics about what will happen in future - â forecasting - â modeling trends - â behavior patterns - â machine learning ] .pull-right[ #### What we use - Multivariate Regression - Logistic Regression ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] --- # Case Study ### Predictive Analytics - Multivariate Regression - Ex: How can we predict reading scores? - đ Data collected: readings scores and confounders of interest for study & control group - Logistic Regression: Yes/no questions - Ex: Can we predict whether a site will have sufficient participation to be solvent? - đ Data collected: solvent data and confounders of interest --- # Levels of Analysis ### Towards Causation .pull-left[ - â Are prescriptive analytics about how you can do it in the best way used - â optimization - â recommending decisions for effective intervention - â experimental design - â simulation - â artificial intelligence ] .pull-right[ #### What we use - Randomized Evaluation - Instrumental Variables - Statistical Matching - Regression Discontinuity ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] --- # Case Study ### Towards Causation - Randomized Evaluation - Ex: What is the impact of summer meal reading enrichment on reading test scores? - Data collected đ: Outcome data for students randomly assigned to attend a summer meal site with reading enrichment or not - Instrumental Variables - Ex: What is the impact of summer meal reading enrichment if we know propensity to enroll? - Data collected đ: Outcome data for program participants and non-participants, as well as an âinstrumental variableâ --- # Case Study ### Towards Causation - Statistical Matching - Ex: What is the impact of summer meal reading enrichment on reading test scores at participating schools? - Data collected đ: Outcome data for program participants as well as another group of non-participants, as well as âmatching variablesâ for both groups. - Regression Discontinuity - Ex: What is the impact of summer meal reading enrichment if there is a cut off for program participation? - Data collected đ: Outcome data for program participants and non-participants, as well as the âordering variableâ (also called âforcing variableâ). --- # Activity #### What technique makes the most sense for your impact measure? [https://www.menti.com/alz531ukcufq](https://www.menti.com/alz531ukcufq) <center><img src="mentimeter_qr_code.png" alt="mentimeter" height="350px" /></center> --- class: inverse center middle # Communication --- # Communication .pull-left[ - Data from different sources brought together and analyzed in an automated way - a dashboard/business intelligence system/data warehouse, pulling data from different tools and systems)? - Is data used in reports as part of strategic discussions? - Are datasets and/or results shared with external stakeholders when applicable? ] .pull-right[ <center><img src="empathy.png" alt="empathy.png" height="400px" /></center> ] .footnote[ from the [Data Kind](https://www.datakind.org/join-us/partner/) Data Maturity Checklist] [image from Alice Daish's blog](https://dataimpact.co.uk/2017/10/11/blog/) --- # Communication <center><img src="postcards.png" alt="postcards" height="400px" /></center> .footnote[a selection of Stefanie Posavec's postcards from [Dear Data](https://www.stefanieposavec.com/dear-data)] --- # Communication - Shiny apps are an open-source way to develop dashboards - Resource đ§: [Shiny Gallery](https://shiny.posit.co/r/gallery/) - Examples: - [Country Profiles App](https://rachwhatsit.shinyapps.io/countryProfilesApp/) - [Blueprint App](https://rachwhatsit.shinyapps.io/blueprintapp/) --- # Resource Library ### Books - [Hands-On Data Visualization](https://handsondataviz.org/) by Jack Dougherty & Ilya Ilyankou - Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (W.W. Norton & Company, 2013) - Edward R Tufte, Envisioning Information (Cheshire, CT: Graphics Press, 1990) - David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019) ### Visualization - [Flowing Data](https://flowingdata.com/) by Nathan Yau - Data Points by Nathan Yau - [Analog Data Visualization for Storytelling](https://www.domestika.org/en/courses/3964-analog-data-visualization-for-storytelling/stefanie_posavec) workshop by Stefanie Posavec --- #Resource Library ### People - Data Kind - Data Science for Social Good - R-Ladies Global - Code for America --- # Acknowledgements - Dr. Linda Engish for inspiration - Dr. Emily Hinojosa for data conversations - Khristian Howard for being a sounding board - Kelly Ezell for organization --- class: center, middle # Thanks! Rachel L. Wilkerson<br> [rachel@tesserwell.co](mailto:rachel@tesserwell.co) [tesserwell.co](www.tesserwell.co)