Overview

In this report, we discuss data visualization best practices, provide an example of bad data visualization, and investigate the relationship between maternal smoking and infant birthweights.

In Part 1 of this report we discuss data visualization best practices and examine an example of bad data visualization.

In Part 2 of this report we conduct and present the findings of our investigation on the relationship between maternal smoking and infant health.

Part 1

The following discussion is animated by the article How to Display Data Badly by Howard Wainer.

Discussion

Much to our surprise, this paper was entertaining. The author’s dry academic language made his criticisms in the introduction funnier than they should have been, and his “Dirty Dozen” of strategies for confusion and misinformation continue to amuse the reader while delivering important information.

Wainer’s paper is an excellent guide for both seasoned and novice statisticians on what not to do. The twelve rules Wainer sets out cover a wide variety of common mistakes and explain exactly how dangerous they can be. And although this paper was written in 1984, its warnings are still relevant today, as confusion and chartjunk continue to thrive in the age of the internet. But despite these positive attributes, we don’t think that “How to Display Data Badly” is flawless.

Although this “compendium” of warnings is helpful to those with some knowledge in data analysis, it could be inaccessible to the general public. Wainer uses 25 statistically deplorable graphs to illustrate his points in his article which become overwhelming as readers are forced to juggle understanding his commentary with deciphering the presented figures. With close reading, these problems can mostly be overcome, but it’s unfortunate that this paper isn’t more digestible. The general public is the group most likely to be duped by poorly presented data, and Wainer’s paper predominantly criticizes figures from mass media sources like The Washington Post, demonstrating how common misinformation is and how useful a guide like this could be for a more general audience. For statisticians, however, this paper is invaluable, and it puts pressure on them to avoid Wainer’s dirty dozen and inform the public with accuracy.

Bad Data Visualization

The following image is from the blogpost COVID-19 In Charts: Examples of Good & Bad Data Visualization by Stephen Tracy , on the site https://analythical.com/.

The following image satisfies the following “dirty dozen” rules:

  • Rule 4: Only Order Matters

    • By aggregating all data (both new COIVD-19 confirmed cases, and patients discharged) and ordering it by day from January 23rd to March 14th of 2020.
  • Rule 7: Emphasizing the Trivial

    • By insisting that the order of the aggregate data take precedence over clarity, the trends of confirmed cases (in red) and patients discharged (blue) are not apparent.

Original Chart Publisher: Channel 2 News Asia

Part 2

The health waning found on the side panel of cigarette packages claims that:

Smoking by pregnant women may result in fetal injury, premature birth, and low birthweight

— US Surgeion General

In this report we will investigate the subclaim that birthweights of babies born to smokers are lower than birthweights of babies born to nonsmokers.

Preliminary Remarks on Data

The data provided is a subset of the Chid Health and Development Studies (CHDS) , which was an investigation of pregnancies between the years 1960 and 1967 of San Francisco-East Bay area women members of the Kaiser Foundtion Health Plan.

The data examined in this report consists of:

1,236 baby boys born during one year of the study, that lived at least 28 days , and who were single births (not part of a set of twins or triplets).

Stat Labs by Nolan and Speed

Investigation Overview

The central purpose of Part 2 of the report is to produce evidence from the data set through exploratory data analysis to confirm the subclaim that birthweights of babies born to smoking mothers are lower than birthweights of babies born to nonsmokers.

The outline of our anlaysis is as follows:

  • Import the data into R

  • Perform a univariate exploratory analysis to inspect the quality and patterns of the data from the data set

  • Summarize the distribution of birthweighs for babies of smoking and nonsmoking mothers separately

    • Summarize numerically by providing the 5-number summary along with the mean and standard deviation for each group

    • Summarize graphically by providing a box plot and histogram for each group

Importing Data in R

Before importing the data set babies.csv into R, we identify all of the names of the columns of interest, and inspect the data to make a table outlining what the data (after importing) should look like. After inspecting the data we find the following:

Column Name Description
(Blank) is observation number (ordinal)
bwt is brithweight (numeric)
smoke is maternal smoking status (0 False, 1 True, 9 unknown)

The following is R code that demonstrates the method employed to import the data set babies.csv

#vector of names for first 3 columns
col_names = c("Observation", "Birthweight", "Smoking_Status")

#vector of data-types for first 3 columns
col_types = c("character", "real", "integer")

#path for data
data_path <- "C:\\Users\\cdgm9\\Desktop\\Academic\\FALL 2020\\STAT 20\\Projects\\Project 1\\babies.csv"

#"read.table()" import method, we skip first row with headers
my_babies <- read.table(file = data_path, sep = ",", dec = ".",
                        colClasses = c(col_types),
                        skip = 1, 
                        stringsAsFactors = FALSE)

#re-name columns with col_names
colnames(my_babies) <- col_names

#confirm desired output
str(my_babies, vec.len = 1)
## 'data.frame':    1236 obs. of  3 variables:
##  $ Observation   : chr  "1" ...
##  $ Birthweight   : num  120 113 ...
##  $ Smoking_Status: int  0 0 ...

Exploratory Data Analysis

We now perform a cursory exploration of data imported (and renamed my_babies) and check that we’ve imported what was advertised in the data dictionary.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

We tally the frequency of smoking status of mothers in the data set.

#Tally Smoking_Status levels
table(factor(pull(my_babies, Smoking_Status)))
## 
##   0   1   9 
## 742 484  10

We provide the 5-number summary along with the mean and standard deviation by smoking group for birthweights. Observe that after comparing each summary statistic between both groups, the birthweights of babies belonging to smoking mothers is always lower (except for the minimum value and standard deviation of both data sets).

#filter our data and remove observations related to Smoking_Status of type
#unknown
my_babies2 <- filter(my_babies, Smoking_Status != 9)

#Summarise Birthweights by Smoking_Status
my_babies2 %>% 
  group_by(Smoking_Status) %>%
  summarise("Min."  = quantile(Birthweight, probs= 0),
            "Q1"  = quantile(Birthweight, probs= 0.25),
            "Q2"  = quantile(Birthweight, probs= 0.5),
            "Q3"  = quantile(Birthweight, probs= 0.75),
            "Max." = quantile(Birthweight, probs= 1),
            "Avg." = mean(Birthweight),
            "S.D." = sd(Birthweight),
            .groups = "keep") %>%
  knitr::kable()
Smoking_Status Min. Q1 Q2 Q3 Max. Avg. S.D.
0 55 113 123 134 176 123.0472 17.39869
1 58 102 115 126 163 114.1095 18.09895

We visualize the data with histograms for the smoking and non-smoking groups with the histograms below. Note that the distribution of birthweights in black belongs to babies with smoking mothers while the one in red belongs to non-smoking mothers.

Observe in the histograms below, that the histogram of birthweights belonging to smoking mothers is shifted leftward from the distribution of birthwegihts belonging to non-smoking mothers. Which also indicates that the smoking status of the mother is associated with the birthweight of their children.

#plot of two histograms
ggplot()+
#distribution of Birthweights for non-smoking group
  geom_histogram(data = filter(my_babies2, Smoking_Status == 0), 
                 aes(x = Birthweight, y = ..density..), 
                 fill = "red", 
                 color = "red", 
                 alpha = 0.75, 
                 bins = 40) +
#distribution of Birthweights for smoking group  
  geom_histogram(data = filter(my_babies2, Smoking_Status == 1), 
                 aes(x = Birthweight, y = ..density..), 
                 fill = "black", 
                 color = "black", 
                 alpha = 0.75, 
                 bins = 40) +
#labels  
  labs(title = "Distributions of Birthweights", 
       subtitle = "Smoking Mothers (black) vs. Non-Smoking Mothers (red)",
       x = "Birthweights",
       y = "Density")

\(~\)

Conclusion

From our summary statistics and data visualization, we find enough evidence to support the claim that there is a relationship between maternal smoking and infant health. Specifically, the birthweights of babies tends to be lower for babies of smoking mothers than for babies of non-smokers.

Having concluded this, there are possibilities for the existence of confounding factors since data came from an observational study. It could be perhaps, that smoking is an activity that is a proxy for other determinants of health. For example, smoking is associated with lower socioeconomic status, and we know that the social determinants of health for this group of people tend to be worse. This could explain the apparent discrepancy in birthweights, for poorer women simply tend to have poorer health, which could include smoking.