Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions
# replace "#FFFFFF" with your own colors
colors <- c( "#a07af1", "#bbe970", "#ebc20b", "#0e4fca", "#fa081b")
swatchplot(colors)
Problem 2: We will be using this dataset
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?
To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.
Introduction: This dataset is from rdocumentation.org and contains data on births in North Carolina in 2001. It samples 1450 birth records that John Holcomb, a mathematician primarily known for his work in statistics, selected from the North Carolina State Center for Health and Environmental Statistics. The question for this project is to evaluate whether or not there is a relationship between whether the mother smokes or not and her baby’s weight at birth. The variables being used to answer this question are “Smoke”, which represents whether the mom is a smoker or a non-smoker, “BirthWeightGm” which tells us the baby’s birth weight in grams, and “Plural” which tells whether the baby is a single, twin, or triplet.
Approach: First I am going to get myself familiar with the different variables I will be using as well as the main over arching question for this investigation. Then, I will begin to code to build my box plot which will show the distribution of birth weight by smoking status, using the colors I picked out in the beginning of the project. Next, I will start to think about any other variables I may consider using for my bar graph and make sure I have a general understanding of how to start coding. I will check the ggplot2.tidyverse.org website and the in class notes to help me build my plots.
Analysis:
changing the value names in the variable, “Smoke”.
NCbirths$Smoke <- recode(NCbirths$Smoke, "0" = "Non-Smoker"); NCbirths
## Warning: Unreplaced values treated as NA as `.x` is not compatible.
## Please specify replacements exhaustively or supply `.default`.
## # A tibble: 1,409 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 Non-Smok… 3147. 0 0 0
## 2 1 2 32 37 34 Non-Smok… 3289. 0 0 0
## 3 1 1 27 39 12 Non-Smok… 3912. 0 0 0
## 4 1 1 27 39 15 Non-Smok… 3856. 0 0 0
## 5 1 1 25 39 32 Non-Smok… 3430. 0 0 0
## 6 1 1 28 43 32 Non-Smok… 3317. 0 0 0
## 7 1 2 25 39 75 Non-Smok… 4054. 0 0 0
## 8 1 2 15 42 25 Non-Smok… 3204. 0 0 1
## 9 1 2 21 39 28 Non-Smok… 3402 0 0 0
## 10 1 2 27 40 37 Non-Smok… 3515. 0 0 1
## # ℹ 1,399 more rows
NCbirths$Smoke <- recode(NCbirths$Smoke, "1" = "Smoker"); NCbirths
## # A tibble: 1,409 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 Non-Smok… 3147. 0 0 0
## 2 1 2 32 37 34 Non-Smok… 3289. 0 0 0
## 3 1 1 27 39 12 Non-Smok… 3912. 0 0 0
## 4 1 1 27 39 15 Non-Smok… 3856. 0 0 0
## 5 1 1 25 39 32 Non-Smok… 3430. 0 0 0
## 6 1 1 28 43 32 Non-Smok… 3317. 0 0 0
## 7 1 2 25 39 75 Non-Smok… 4054. 0 0 0
## 8 1 2 15 42 25 Non-Smok… 3204. 0 0 1
## 9 1 2 21 39 28 Non-Smok… 3402 0 0 0
## 10 1 2 27 40 37 Non-Smok… 3515. 0 0 1
## # ℹ 1,399 more rows
NCbirths$Smoke <- ifelse(is.na(NCbirths$Smoke), "Smoker", NCbirths$Smoke)
Box plot
# Your R code here(boxplot)
ggplot(NCbirths, aes(x = Smoke, y = BirthWeightGm, fill = Smoke)) +
geom_boxplot() +
labs(y ="Birth Weight (g)", title = "Distribution of Birthweight Based on Smoking Status") +
scale_fill_manual(values = c("#bbe970", "#a07af1")) +
theme_minimal()
ggplot(NCbirths,aes(y = Smoke))+
geom_bar(aes(fill = factor(Plural)), position = position_stack(reverse = TRUE))+
scale_fill_manual(name = "Birth Type",
values = c("#ebc20b", "#0e4fca", "#fa081b"),
labels = c("Single", "Twins", "Triplets"))+
labs(title = " Distribution of Birth Type Based on Smoking Status")+
facet_wrap(vars(Smoke))
Extra Credit
ggplot(NCbirths,aes(x = Plural, fill = factor(Plural)))+
geom_bar()+
scale_x_discrete(name = "Birth Type")+
scale_y_continuous(name = "Count of Birth Type")+
labs(title = "Distribution of Birth Type Based on Smoking Status")+
scale_fill_manual(name = "Birth Type",
values = c("#ebc20b", "#0e4fca", "#fa081b"),
labels = c("Single", "Twins", "Triplets"))+
facet_wrap(vars(Smoke))
Discussion:
The box plot, “Distribution of Birthweight Based on Smoking Status” shows a lower median birthweight for smokers. This is clearly shown through a visibly lower horizontal line within the “Smoker” box compared to the “Non-Smoker” box. Additionally, there are more outliers in the non-smoking plot in comparison to the smoking plot. The high to low median difference in the box plot could be due to the intoxication of the developing fetus administered through smoking. This affects the weight of your baby causing potential and likely a difference in weight and overall health. The higher number of outliers in the non-smoking plot is most likely due to the counts of the smoking status. There is a lower number of smokers compared to non smokers which will affect the number of outliers.
Though both the bar plots, “Distribution of Birth Type Based on Smoking Status” appear different, they convey the same things. A very prominent difference in the distribution in birth type shows us there is an overwhelming majority of single birth in both the smoker and non-smoker graphs. Both the occurrence of twins and triplets are rare events however, twins are more likely to occur than triplets. The reason for the significantly lower amount of all type of births in the smoker category is most likely due to the sample size and the lack of data collected on mothers that smoke. The difference of twins in both sides, smoker or not, is presumably due to the natural distribution of birth types in the general population.