I found it valuable as an MPH student to work with manageable, real-world data sets as I got more comfortable using R. I prefer a data set that is large enough to be interesting but not so large as to be overwhelming. It should have good documentation and data that requires some but not a lot of data cleansing. It was also very important to me that the data could be used to develop meaningful insights of an existing population.

CalEnviroScreen 4.0 is one of these data sets for me. I worked with it during my practicum experience, then as a TA supporting students on class projects, and as a student in a data visualization course. It is rich in key environmental and social indicators related to health in the state of California. It is an approachable data set useful for creating visualizations that are empowering and inspiring for the new R user.

Here are a few examples of data visualizations I created looking at just a couple of variables.

  1. Scatter plot - Average CES 4.0 score and percentage of population living in poverty:

This graph shows the correlation between poverty and exposure to pollution. I grouped by county and then calculated population, mean CES 4.0 score, and mean poverty (%), weighted by population count. To calculate the CES 4.0 score, CalEnviroScreen 4.0 uses 20 different indicators to rank each census tract by its cumulative pollution burden and associated social vulnerabilities (e.g., poverty, education level) that put the population at risk. A higher score indicates higher risk for health concerns. Poverty is defined as the percent of population living below two times the federal poverty level (OEHHA, 2021).

  1. Box Plots - distribution of CES 4.0 scores and poverty by rural county status.

Next I looked at how these counties are distributed when you consider their rural status. 41 of California’s 58 counties are considered rural. The median values for CES 4.0 score did not differ greatly by rural status (23.71 non rural compared to 20.84 rural). However, the rural counties had more observations, a smaller spread, and more outliers.

However, in the second box polot we see there’s a significant difference in distribution in terms of poverty. The median percentage of poverty in non-rural counties is 28.82% while in rural counties it is 35.63%. The spread is similar between rural and non rural groups.

  1. Scatter plot - Association between CES 4.0 score and poverty by rural county status

Finally, I plotted the association between average CES 4.0 score and average percentage of population in poverty, considering rural status. This visualization illustrates well the rural-urban poverty divide. There are two distinct distributions here in terms of rural status. Rural counties generally have a more a positive association between CES score and poverty.

Data

For these visualizations, I used the California Communities Environmental Health Screening Tool: CalEnviroScreen 4.0, which was released in October 2021. I created an indicator to categorize each census tract as located in a “rural” or “non rural” county, based on a list of California’s 41 rural counties available at the Rural County Representatives of California website (https://www.rcrcnet.org/counties).

Conclusions & Limitations:

References: