Assignment 1

The Data
Histograms
- Faceting
Boxplots

Note: please write down the answers in text. For codes, please write and insert a chunk. You need to knit the final work into a html file to submit. The outputs of answer chunks need to be included.

library(dplyr)
library(ggplot2)
library(readr)

The Data

Load the data

nc <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTm2WZwNBoQdZhMgot7urbtu8eG7tzAq-60ZJsQ_nupykCAcW0OXebVpHksPWyR4x8xJTVQ8KAulAFS/pub?gid=202410847&single=true&output=csv")

Q1.1: What does “message = F” mean in {r, message = F}? (answer in text)

The F stands for FALSE. When message = FALSE, any messages produced by the code will not appear in the output when the document is rendered.

Q1.2: What does ‘eval= F’ mean in {r, eval = F}? (answer in text)

The F stands for FALSE. When eval = FALSE, the code in that chunk will not be executed, and its output will not appear in the rendered document.

Q1.3. What type of variable is R considering the variable habit to be? What variable type is visits? (answer in text)

In R, if habit is a behavioral category (e.g., “smoker”, “non-smoker”), it is treated as a categorical variable (factor). On the other hand, if visits represents the number of times something occurs (e.g., hospital visits), it is a numeric variable, likely an integer.

Histograms

Q2. Check this code, and answer each of the following questions.

ggplot(data = nc, aes(x = weeks)) + geom_histogram(binwidth = 1, color = "white", fill="#D8BFD8")

Q2.1 The y axis is labeled count. What is specifically being counted in this case? Hint: think about what each case is in this data set. (answer in text)

The number of observations (cases) in the dataset nc that fall into each week bin.

Q2.2 What appears to be roughly the average length of pregnancies in weeks? (answer in text)

39 weeks

Q2.3 If we changed the binwidth to 100, how many bins would there be? Roughly how many cases would be in each bin? (answer in text)

Number of bins: 1, Number of cases in each bin: All the cases in the dataset would be in this one bin.

Q3 Make a histogram of the birth weight of newborns (which is in lbs), including a title and axis labels. (answer in code chunk)

ggplot(data = nc, aes(x = weight)) + geom_histogram(binwidth = 1, color = "white", fill="#D8BFD8") + labs(title = "Distribution of Newborn Birth Weights", x = "Birth Weight (lbs)", y = "Count") + theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Faceting

Learn the concept and application of facet here: https://moderndive.com/2-viz.html#facets

Q4.1 Make a histogram of newborn birth weight split by gender of the child. Set the binwidth to 0.5. (answer in code chunk)

ggplot(data = nc, mapping = aes(x = weight)) + geom_histogram(binwidth = 0.5, color = "white", fill="#D8BFD8") + facet_wrap(~ gender) + labs(title = "Distributions of newborn by gender and birth weight") + theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Q4.2 Which gender appears to have a slightly larger average birth weight? (answer in text)

male newborns

Boxplots

Q5.1 Make a boxplot of the weight gained by moms, split by the maturity status of the mothers (mature). Include axis labels and a title on your plot. (answer in code chunk)

ggplot(data = nc, mapping = aes( y = gained)) + geom_boxplot(color = "black") + facet_wrap(~ mature) + labs(title = "Weight gained of mothers during pregnancy", y = "Gained Weight (lbs)" ) + theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Q5.2 Is the median weight gain during pregnancy larger for younger or older moms? (answer in text)

The median weight gain during pregnancy is approximately 30 lbs for younger moms and 28 lbs for mature moms. Therefore, younger moms tend to gain more weight during pregnancy compared to mature moms.

Q6.1 Make a boxplot of pregnancy duration in weeks by smoking habit.(answer in code chunk)

ggplot(data = nc, mapping = aes(x = habit, y = weeks, fill = habit)) + geom_boxplot(color = "black") + labs(title = "Pregnancy Duration by Smoking Habit", x = "Smoking Habit", y = "Pregnancy Duration (Weeks)") + theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Q6.2 Is the duration of pregnancy more variable for smokers or non-smokers? (i.e. which group has the greater spread for the variable weeks?). (answer in text)

nonsmoker

Q7.1. Develop a suitable visualization so as to visually assess: Is the variable for father’s age (fage) symmetrical, or does it have a skew? (code in chunk and answer in text)

ggplot(data = nc, aes(x = fage)) + geom_histogram(binwidth = 1, color = "white", fill="#D8BFD8") + labs(title = "Distribution of Father's Age", x = "Father's Age", y = "Count") + theme(plot.title = element_text(hjust = 0.5, face = "bold"))

No, the distribution of the father’s age is not symmetrical. It has a right skew, meaning that most fathers are younger, but there are a few older fathers that pull the distribution to the right.

Q7.2. Which one is more appropriate for the central tendency: mean or median? Please explain why (answer in text)

The median is more appropriate than the mean for measuring the central tendency because the distribution of father’s age is right-skewed. In a skewed distribution, the mean is influenced by outliers, while the median provides a better representation of the central point of the data.

Make sure your document is knitting, and that your html file includes Exercise headers, text, and code. The knitted html file will be stored in the same folder as your RMD file.