INFO 201 Data Assignment 1: Birth Data Visualization
Author
Jc Fletcher
Published
April 23, 2025
Assignment Overview
In this assignment, you will look at a subset 2018 United States birth records, provided by the National Center for Vital Statistics. You will:
Create information visualizations for core visualization types
Include titles and alt-text to tell a story with the visualizations
Justify visualization choices and describe some ways the visualizations may be interpreted
To complete this assignment, you will need to write code or text for each of the exercises below. Make sure to work from top to bottom to complete the assignment. For specific R code issues, feel free to consult outside resources, in particular the R4DS textbook chapters on data visualization.
To complete this assignment, you need to check out the example file posted on Canvas!! Your job is partly to recreate the visualizations there!
Warning
Making inferences or forming predictions about newborn birth weight (and other features in pregnancy and birth) is complex, and an area of intense research in many different disciplines. Nothing you are able to see in this data given our current skills in this class is sufficient to form reasonable, justified beliefs about health.
The Data Setup
This assignment uses a subset of the 2018 US natality data set. The data set is very large and includes all reported births in the US states and District of Columbia in 2018 for single births only. In other words, it excludes childbirths with twins, triplets, or quadruplets. It also does not include childbirths that occurred in the US territories which are administered distinctly from the US states and DC for many public health and data collection purposes (so, the data does include births in Puerto Rico, the US Virgin Islands, Guam, American Samoa, or the Northern Mariana Islands which are populated US territories). The data also does not include unreported births, but it is estimated that the number of births that go unreported in the US is negligible for most purposes, as it is incredibly small. ## Load Libraries
Set up the libraries required for visualizations.
#If you get an error saying tidyverse has not been found, then you have not#installed it yet! To install them, "uncomment" the next line by deleting the#hash/pound symbol and try again! After you've installed it, make sure to put#the hash/pound symbol back; Quarto doesn't like rendering documents if you#have installation lines.#install.packages("tidyverse")library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
options(scipen =999)
Load Data
To load the data, make sure you saved the .Rdata file in the same location as this Quarto file.
load("single_births_US2018.Rdata")
Have a look at the variables recorded in this dataset.
glimpse(births)
Rows: 3,616,658
Columns: 7
$ preterm <fct> No, No, No, Yes, No, No, No, No, No, No, No, No, No, …
$ cig_rec <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No, …
$ wic <fct> No, No, Yes, No, Yes, No, No, No, No, No, No, Yes, No…
$ sex <fct> Male, Female, Male, Female, Male, Female, Female, Fem…
$ dbwt_g <dbl> 3657, 3242, 3470, 3140, 2125, 4082, 3180, 3700, 3430,…
$ dbwt_lb <dbl> 8.062295, 7.147378, 7.650031, 6.922507, 4.684818, 8.9…
$ low_birth_weight <chr> "Not Low Birth Weight", "Not Low Birth Weight", "Not …
Data Dictionary
Each row in this data represents one live childbirth in the US in 2018 and features of the newborn, parents and preceding pregnancy. The data includes all reported childbirths in the US states and District of Columbia in 2018 for single births only. In other words, it excludes childbirths with twins, triplets, or quadruplets. It also does not include childbirths that occurred in the US territories which are administered distinctly from the US states and DC for many public health and data collection purposes (so, the data does include births in Puerto Rico, the US Virgin Islands, Guam, American Samoa, or the Northern Mariana Islands which are populated US territories). The data also does not include unreported births, but it is estimated that the number of births that go unreported in the US is negligible for most purposes, as it is incredibly small.
Note that this dataset is very large and sparse, but in odd scenarios with particularly unique newborn weights, it is conceivable that you could identify unique individual newborns born in 2018. Even if you were able to do this technically, do not do so. In fact, the use conditions set forth by the National Center for Vital Statistics notes that it is not allowed with the dataset this is based off and, in fact, it may be a finable offence to do. So, even if you had a child in 2018 – don’t try to look up a specific individual! This dataset is only for attempting to understand big-picture trends in US public health.
This data set includes a lot of features about US births (and this is already trimmed from the full US natality data set available online). This document won’t go into too many details, but a lot more can be found at the website for the National Center for Health Statistics. Here are some features to consider in particular:
dbwt_g - the weight of a live newborn in grams
dbwt_lb - the weight of a live newborn in pounds
low_birth_weight - this is a categorical label that describes if a newborn is described as “low birth weight” (less than 2,500 grams) This article published by the Boston Children’s hospital describes these categories and some reasons why physicians and parents may be interested in them, note that in this dataset, we only have “low birth weight” and “not low birth weight”; we don’t have a distinct label for “very low birth weight”.
preterm - this is a designation of whether or not a newborn is a “premie” or born before the normative 37 week cutoff for “normal” pregnancy length. Here is some information on why physicians and parents might be concerned about preterm status of newborns.
wic - a recording of whether of not a mother was enrolled in the US federal government’s Special Supplemental Nutrition Program for Women, Infants, and Children (usually called WIC, pronounced like “wick”). This is a program that offers healthcare and nutrition assistance to low-income pregnant and breastfeeding women.
cig_rec - a label that describes if a mother reported consuming tobacco cigarettes during pregnancy. This collapses over frequency of cigarette consumption and doesn’t distinguish between early and later pregnancy.
sex - the sex determination made for a newborn, typically based on morphological or anatomical features of a newborn. Note that the NCHS only records female or male as options. Physicians or parents make a determination to choose a category for each newborn – this is a complicated and contentious process, and especially so in cases where infants are intersex.
Data Interpretation Exercises
In each of the following exercise, you’ll create a visualization that addresses a prompt. Your visualization should be as close as possible to the examples in the accompanying example document. For each visualization, include alt-text in the correct location for it to be encoded as alt text and copy that same text below the visualization where indicated (which is just to make it easier to grade). For the interpretation questions, leave the question, and replace the …s with your answer inside the Response: …* part.
Exercise 0: Birthweights and Smoking in Pregnancy
Exercise 0 is an example to show you how following exercises may be formatted. You don’t need to change anything.
Cigarette smoking during pregnancy is an issue that has been of great interest in the biomedical research world – and also at times a source of guilt and stigma for pregnant women. Create a visualization that can be used to demonstrate how well you could or could not predict the weight of a newborn based solely on whether or not a mother was a smoker. Here, let’s visualize some trends in newborn weight and consider what they could mean.
births |>ggplot(aes(y = cig_rec, x = dbwt_lb)) +geom_boxplot(outliers =FALSE) +labs(x ="Weight (lbs)",y ="Smoking Habit of Mother during Pregnancy",title ="Newborns with mothers who smoke often weighted less.",subtitle ="Weight data for US live single births in 2018.",caption ="Data available at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm") +theme_minimal()
Alt-text
Box plots of newborn weight data split by whether mothers smoked cigarettes in pregnancy where the newborns with smoking mothers typically have lower birthweight, but where there are still many infants with higher birthweights in both groups.
Interpretation
Question: As a group, overall, women who smoked cigarettes had newborns who weighed less than newborns with mothers who did not smoke at all. There is a lot of research that provides evidence that smoking cigarettes can contribute to low birth weight (and increase the risk of several birth-related complications). While smoking is linked to low birthweight, some expecting mothers may have difficulty quitting smoking completely and feel a strong sense of stigma and fear for the health of their future infants. Imagine that you hear someone talking about a pregnant friend who is in this situation. That person says “Obviously we know that her baby will be small – the data tell us that all you have to know is that she smokes cigarettes.” Based on this data, is this person correct? Write 1-2 sentences why or why not.
Response: No, that isn’t really a conclusion supported by this data because there is a really wide range of weights for newborns with either mothers who smoked or didn’t smoke. Even if mothers who smoke tend to lighter babies as a group, there are still many medium and heavy babies with mothers who smoke!
Exercise 1: Birthweights
Create a visualization that displays the distribution of all birthweights for single newborns in the US in 2018 and which communicates how many of those qualify as low birth weight.
For this graph, you can choose to display weights in pounds or grams. Consider why one choice or the other may be a better choice for different audiences when making this choice. Make sure that the labels and text match your choice! You only need to include one visualization.
Hint for visualization: Check how to set the width of bins in histograms and try a few values until you find one that is similar to the examples!
births |>ggplot(aes(fill= low_birth_weight, x = dbwt_lb)) +geom_histogram(binwidth =0.25) +labs(x ="Birth Weight (lbs)",y ="Count",title ="Distribution of birth weights in the United States for 2018.",subtitle ="7% of newborns were low birth weight.",caption ="Data available at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm", fill ="Newborn Weight") +theme_minimal()
Alt-text
Histogram of newborn birth weight data where the typical births make up 93% while 7% of reported are under the low birth weight threshold.
Interpretation
Question: Did you choose to display information in grams or in pounds? In about a sentence or two, why is your choice to use pounds or grams a good choice for this information visualization? Who is the inteded audience you imagine for this visualization?
Response: For this visualization, I chose to display the data units as pounds, to match the existing standard unit of measurement within the Unites States of America.The intended audience for this data would be the interested citizens, potential parents and public health professionals, and possible interested media outlets who wish to report this data.
Exercise 2: Birthweights and Recorded Sex
Create a visualization that can be used to compare birth weights for male and female newborns.
Hint for visualization: Check how to manually choose the colors. The colors in the example are purple and orange.
births |>ggplot(aes(fill= sex, x = dbwt_lb)) +geom_histogram(binwidth =0.25) +scale_fill_manual(name ="Sex", values =c("Male"="purple", "Female"="gold" ) ) +facet_wrap(~sex,ncol =1) +labs(x ="Birth Weight (lbs)",y ="Count",title ="Typical weights were slightly lower for females than males .",subtitle ="weight data for US live single births in 2018.",caption ="Data available at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm") +theme_minimal()
Alt-text
Two Histograms of birth weight data in the US for 2018 split by by sex, indicating males to have a slightly higher distribution
Interpretation
Question: As a group, male newborns tend to weight more than female newborns. Consider if this pattern is particularly useful for prediction purposes. Some people think that they are able to determine the sex (and also the gender) of an infant based on how heavy the infant is. Based on the data here, do you think this is something that anyone could do with accuracy or precision? In about 1-2 sentences, why or why not?
Response: When considering whether “Some people think that they are able to determine the sex (and also the gender) of an infant based on how heavy the infant is.” I am assuming this determination is a prenatal estimate as once the baby is born the sex can be determined visually. This assumption leaves a myriad of variable unaccounted for such as variations in mothers physiology, genetics, medical conditions, and mothers daily diet, including water just to name a few. For these reasons above coupled with the fact that there exists technology which can safely determine the sex of the child, I would not recommend that inferences on the unborn baby’s weight be made using this method and would urge the parents to consult a physician regarding the health of the child.
Exercise 3: Birthweights and Food Assistance
In the US, the Special Supplemental Nutrition Program for Women, Infants, and Children (or WIC) is one of the most widespread welfare programs offered by the federal government. Almost 7 million people used WIC in 2018. One of the stated goals of WIC is to enhance the health outcomes of mothers and children. One thing you might think to do given this goal is to naively check out the distribution of newborn weights based on whether mothers used WIC or not. Check this out.
Hint for visualization: Check how to remove outliers from boxplots.
births |>ggplot(aes(y = wic, x = dbwt_lb)) +geom_boxplot(outliers =FALSE) +labs(x ="Weight (lbs)",y ="wic participation",title ="Mothers who tended to have less heavy newborns.",subtitle ="Weight data for US live single births in 2018 .",caption ="Data available at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm") +theme_minimal()
Alt-text
Box plots of newborn weight data split by WIC participation indicating wic mothers to have slightly lower median birth weights.
Interpretation
Question: Based on the data shown above, mothers who were enrolled in WIC tended to have newborn infants that had lower birth weight than those who were not enrolled in WIC. However, it is important to note that this is not evidence on its own that WIC was ineffective or bad for newborn health. In about 1-2 sentences, provide an alternative explanation for this pattern of data to someone who tries to argue that this is evidence that WIC is not helpful.
Response: I would redirect their attention to data, which does indicate that wic participation may not yield higher birth weights, it does offer nutrition for mothers and family experiencing food scarcity to meet their daily dietary needs. Birth weight is not the only indicator in a newborn babys health rather one metric among many
Exercise 4: Birthweight and premie status
Many parents worry about the risk of giving birth prematurely to an infant. Indeed, there is a lot of research looking into the risks of early and very early births, and there are elevated risks of a number of birth and health complications tied to early births. But one thing to consider is that early births do not always mean that an infant will be unhealthy (or have low birth weight – which is just one of many ways to consider infant health). Let’s consider how preterm birth relates in particular to low birth weight status. Recall that preterm is defined here as 37 weeks or earlier.
Hint for visualization: Check how to create stacked barplots with proportions.
births |>count(preterm, low_birth_weight ) |>ggplot(aes(x= preterm, y = n, fill = low_birth_weight)) +geom_col(position ="fill") +labs(x ="Preterm status",y ="Proportion",title ="Preterm Infants a hiher rate od low birth weight.",subtitle ="Weight data for US live single births 2018",caption ="Data available at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm") +theme_minimal()
Alt-text
A stacked bar chart indicating a higher proportion of pre term infants in the us in 2018 which fall under the Low birth weight threshold.
Interpretation
Question: As seen in this data, there is a greater risk of low birth weight for preterm than for full-term newborns. Based on this data, should you be surprised to hear that a preterm infant has a health birthweight? Why or why not?
Response: I am not surprised by this metric. In fact, newborn infant weight can vary widely among perfectly healthy mothers and would instead save these questions for my doctor, and urge others to consult their physician as well.
Exercise 5: Coding Notes
All of the tools (R functions, operators, and code formalisms) needed to complete these activities are covered in the lecture activities and labs. You can use R tools that were not covered in class for these activities, but if you do so, you need to be able to relate them to material covered in class.
If you didn’t use any functions, operators, or formalisms that were not covered in class, then you should write something along the lines of “I used tools covered in class materials” for the first bullet point below and delete other bullet points. If you used outside tools (functions, operators, or formalisms), then for each such tool, you should write 3-4 sentences. These sentences must describe: a) What the goal of the tools is (what is it used for in general – not just in the code here), b) Which tool that has been covered in class does the closest thing to the tool that you decided to use, and c) How you learned about the tool and a web link to a resource on how to learn how the tool works.
If you do not include the information below, then you will be marked off on exercises that use outside tools. If you later realize you accidentally used an outside tool and forgot to mark it, you can attend office hours and explain its use to have those points reassigned.
Tools Used:
Tool 1: facet_wrap:Tool 1: facet_wrap facet_wrap is used to create multi-panel plots by splitting data into subsets and displaying each subset in its own panel.
Im nor sure if we covered it in class but the closest its similar to facet_grid in ggplot2.I learned about facet_wrap from OpenStack. - Tool 2: …
All Done!
When you have completed this assignment, make sure to render it as an html file. Check that the html file looks as you expect and upload the html report. If you do not complete the assignment and are unable to render the document, you can submit the .qmd for partial credit.