Self-grading Instructions

1. Self-grading

Ensure that for each question, you have included:

Correct codes: Verify the accuracy of the coding syntax, labels and logic.
R outputs: Check if the outputs generated by R are correctly interpreted and relevant to the question.
Written answer: Make sure answers are in complete sentences. Reference the data dictionary/documentation for precision and thoroughness.

2. Revisions

If revisions are necessary, specify the question numbers requiring adjustments. Update both the codes and/or the written responses in your original submission to correct any inaccuracies or to provide additional clarity.

3. Reflection

Summarize the key concepts and skills you have practiced this week. Reflect on how these exercises have contributed to your understanding of the material and your ability to apply statistical analysis using R. Consider the challenges you faced, how you overcame them, what you learned from the process, and how you will get more support and clarity if challenges were not resolved.

A Reminder

Please only write codes inside a code chunk, keep your #comment short in a code chunk.
Write your answer above or below the code chunks.
Only self-graded assignments (i.e., grading with corrections and reflection) will receive a full score for recognizing the hard work and problem-solving acumen. An assignment not self-graded will get zero points.
To qualify for self-grading credit, you must first submit a genuine attempt at the assignment by the original deadline.

Self-grading for Lab 2 Assignment

Question numbers with corrections:

Your Reflection for this week:

Answer Key

Load all the necessary libraries

library(tidyverse) 
library(dplyr) # tidyverse or dplyr for the filter function
library(psych) # for the describe function
library(ggplot2) # for the ggplot function
library(ggthemes) # for theme_fivethirtyeight for Q7

Q1

Import the dataset “birthweight_smoking.csv”, name your dataset.

smoking_data <- read.csv("birthweight_smoking.csv")

Q2

From the summary() statistics output, distinguish which are numerical or categorical variables in the dataset. (Tips: A categorical variable takes on two or more values which represents categories or labels without inherent numerical meaning. A numerical variable has countable or infinite values within a given range.) [1 point]

1 point for identifying all the variable structure correctly. (Explanation not required.)
0.25 points deducted for each misclassification.

Numerical variables include birthweight, drinks, nprevist, age, and educ as they take on numeric values between a minimum and maximum value, such as number of drinks, number of visits and years of education as defined by the data codebook.

There are a number of dummy (categorical) variables in the dataset include smoker, alcohol, tripre1, tripre2, tripre3, tripre0, and unmarried because they contain two distinct categories or groups.

Note that numerical variables can always be converted into categorical variables, but not the other way round. The variables, age and years of education, as measured in the codebook are examples of discrete variables (discretization of the numerical variable). Discrete variables (integer-based) are similar to continuous variables (numerical-based), although the former have countable values and clear bound, the latter have continuous values to infinity. If education is defined as ordered categories (such as high school, bachelor’s degree or above), then it becomes a categorical (ordinal) variable.

summary(smoking_data)

##     nprevist        alcohol           tripre1         tripre2     
##  Min.   : 0.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 9.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median :12.00   Median :0.00000   Median :1.000   Median :0.000  
##  Mean   :10.99   Mean   :0.01933   Mean   :0.804   Mean   :0.153  
##  3rd Qu.:13.00   3rd Qu.:0.00000   3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :35.00   Max.   :1.00000   Max.   :1.000   Max.   :1.000  
##     tripre3         tripre0      birthweight       smoker        unmarried     
##  Min.   :0.000   Min.   :0.00   Min.   : 425   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3062   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00   Median :3420   Median :0.000   Median :0.0000  
##  Mean   :0.033   Mean   :0.01   Mean   :3383   Mean   :0.194   Mean   :0.2267  
##  3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:3750   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.00   Max.   :5755   Max.   :1.000   Max.   :1.0000  
##       educ            age            drinks        
##  Min.   : 0.00   Min.   :14.00   Min.   : 0.00000  
##  1st Qu.:12.00   1st Qu.:23.00   1st Qu.: 0.00000  
##  Median :12.00   Median :27.00   Median : 0.00000  
##  Mean   :12.91   Mean   :26.89   Mean   : 0.05833  
##  3rd Qu.:14.00   3rd Qu.:31.00   3rd Qu.: 0.00000  
##  Max.   :17.00   Max.   :44.00   Max.   :21.00000

Q3

Change the data type of another dummy variable [other than smoker] (name a new variable), and assign the value labels. Then use summary() to discuss the summary results. [2 points]

Correct codes for factor() [0.5 points]
Correct codes for adding arguments of levels and labels as follows. [0.5 points]
Choose a dummy variable from any of these: alcohol, tripre1, tripre2, tripre3, tripre0, unmarried (smoker should not be used per instruction)
Discuss summary results [1 point]

Example of a dummy variable: unmarried

Given that unmarried is a dummy variable, I changed the variable structure from numeric to factor using the factor() function. Following the data documentation, unmarried mothers was labelled as 1 and married mothers as 0. The summary function shows that 2320 out of 3000 mothers were married, and 680 mothers were unmarried in the sample.

# the following labeling code should be run only once, otherwise NAs occur
# though you can flip around the labels to reverse them 
# note: the documentation labels `unmarried` as 1. 
smoking_data$unmarried1 <- factor(smoking_data$unmarried, 
                                 levels = c(0,1), 
                                 labels = c("married", "unmarried"))

summary(smoking_data$unmarried1)

##   married unmarried 
##      2320       680

Q4

In your response, specify the condition(s) for at least one variable to create a new data frame. In the code chunk below, use filter() to generate a new data frame (specify a new name) using the conditions and operator. Lastly, use the summary() function to inspect the new dataset. [2 points]

Specify the condition(s) for filtering the dataset in your response (0.5 points)
Correct code should include (1) a variable to name a new dataset, (2) the function filter(), (3) the original dataset smoking_data, and (4) the criteria with which you use to filter the results. [1 point]
Check summary() of the new data. [0.5 points]
Discussion of the summary results is not required, but good job if you have done this!

Note: If you use filter(), you need to call the tidyverse or dplyr library. For setting the criteria, use any operators suggested above to define a correct variable range based on the data documentation. For instance, it is incorrect to use smoking_data$smoker > 18 as the criteria because smoker only takes values from 0 to 1. [-0.5 points for incorrect criteria.

Here, I used the filter() function to create a new dataset called data_new, using two criteria that mothers consumed alcohol and smoked during pregnancy. This smaller dataset contained only 31 observations. The average birthweight of drinking and smoking mothers was lower (3020 grams), compared with that of our entire sample (3383 grams).

data_new <- filter(smoking_data,
                   smoker == 1 & unmarried == 1)

summary(data_new)

##     nprevist       alcohol         tripre1         tripre2        tripre3     
##  Min.   : 0.0   Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.: 6.0   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:0.000  
##  Median :10.0   Median :0.000   Median :1.000   Median :0.00   Median :0.000  
##  Mean   : 9.2   Mean   :0.072   Mean   :0.544   Mean   :0.32   Mean   :0.096  
##  3rd Qu.:12.0   3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:1.00   3rd Qu.:0.000  
##  Max.   :30.0   Max.   :1.000   Max.   :1.000   Max.   :1.00   Max.   :1.000  
##     tripre0      birthweight       smoker    unmarried      educ      
##  Min.   :0.00   Min.   : 510   Min.   :1   Min.   :1   Min.   : 7.00  
##  1st Qu.:0.00   1st Qu.:2836   1st Qu.:1   1st Qu.:1   1st Qu.:10.00  
##  Median :0.00   Median :3147   Median :1   Median :1   Median :12.00  
##  Mean   :0.04   Mean   :3102   Mean   :1   Mean   :1   Mean   :11.39  
##  3rd Qu.:0.00   3rd Qu.:3487   3rd Qu.:1   3rd Qu.:1   3rd Qu.:12.00  
##  Max.   :1.00   Max.   :4508   Max.   :1   Max.   :1   Max.   :16.00  
##       age            drinks           unmarried1 
##  Min.   :15.00   Min.   : 0.000   married  :  0  
##  1st Qu.:20.00   1st Qu.: 0.000   unmarried:250  
##  Median :22.50   Median : 0.000                  
##  Mean   :23.56   Mean   : 0.324                  
##  3rd Qu.:27.00   3rd Qu.: 0.000                  
##  Max.   :38.00   Max.   :21.000

# example with a labelled variable
data_new1 <- filter(smoking_data,
                   smoker == 1 & unmarried1 == "unmarried")

summary(data_new1)

##     nprevist       alcohol         tripre1         tripre2        tripre3     
##  Min.   : 0.0   Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.: 6.0   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:0.000  
##  Median :10.0   Median :0.000   Median :1.000   Median :0.00   Median :0.000  
##  Mean   : 9.2   Mean   :0.072   Mean   :0.544   Mean   :0.32   Mean   :0.096  
##  3rd Qu.:12.0   3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:1.00   3rd Qu.:0.000  
##  Max.   :30.0   Max.   :1.000   Max.   :1.000   Max.   :1.00   Max.   :1.000  
##     tripre0      birthweight       smoker    unmarried      educ      
##  Min.   :0.00   Min.   : 510   Min.   :1   Min.   :1   Min.   : 7.00  
##  1st Qu.:0.00   1st Qu.:2836   1st Qu.:1   1st Qu.:1   1st Qu.:10.00  
##  Median :0.00   Median :3147   Median :1   Median :1   Median :12.00  
##  Mean   :0.04   Mean   :3102   Mean   :1   Mean   :1   Mean   :11.39  
##  3rd Qu.:0.00   3rd Qu.:3487   3rd Qu.:1   3rd Qu.:1   3rd Qu.:12.00  
##  Max.   :1.00   Max.   :4508   Max.   :1   Max.   :1   Max.   :16.00  
##       age            drinks           unmarried1 
##  Min.   :15.00   Min.   : 0.000   married  :  0  
##  1st Qu.:20.00   1st Qu.: 0.000   unmarried:250  
##  Median :22.50   Median : 0.000                  
##  Mean   :23.56   Mean   : 0.324                  
##  3rd Qu.:27.00   3rd Qu.: 0.000                  
##  Max.   :38.00   Max.   :21.000

Q5

Describe the distribution of the variable birthweight, including mean, median, the IQR, and the shape. Discuss possible outliers using the measures of skewness (symmetry) and kutosis (masses in tails). [2 points]

Your distribution does not necessarily follow the ones shown in the answer key, but a complete response should describe the followings in detail:

shape of the distribution [0.5 points]
mean and median [0.5 points]
IQR [0.5 points]
skewness and kutosis [0.5 points]

Note: Recap of the lecture, a normal distribution has zero skewness (symmetrical) and kurtosis of three. Any symmetric data should have a skewness near zero. The Central Limit Theorem posits that in many situations as long as the sample size is large ($n > 30$), the distribution of the sample means will follow an approximately normal distribution. However, when we have a highly skewed distribution, we will need to conduct variable transformation, more in future lectures on Log Transformation.

Original dataset (approximately normal distribution): The distribution of birthweight follows a bell shape and is approximately normal. The mean or the average birthweight is 3382.93 grams, with a median of 3420 grams and an interquartile range of 688 grams. It is slightly skewed to the left (or negatively skewed) with a skewness of -0.83. It has lighter tails than a normal distribution with a kurtosis of 2.54.

library(psych) # for describe() function

# original dataset
hist(smoking_data$birthweight)

describe(smoking_data$birthweight, IQR=TRUE)

##    vars    n    mean     sd median trimmed    mad min  max range  skew kurtosis
## X1    1 3000 3382.93 592.16   3420 3412.04 520.39 425 5755  5330 -0.83     2.54
##       se IQR
## X1 10.81 688

If you continued to use the new dataset generated from Q4, it is acceptable this time as long as you were able to describe the distribution as below.

Subsetted dataset (skewed distribution): The distribution of birthweight does not follow a bell shape and is skewed to the left or negatively skewed, with a skewness of -2.11. The mean or the average birthweight is 3020.16 grams, with a median of 3147 grams and an interquartile range of 496 grams. It has heavier tails than a normal distribution with a kurtosis of 5.51.

# new dataset with a skewed distribution
hist(data_new$birthweight)

describe(data_new$birthweight, IQR=TRUE)

##    vars   n    mean     sd median trimmed    mad min  max range  skew kurtosis
## X1    1 250 3101.63 623.19   3147 3150.52 504.08 510 4508  3998 -1.11     2.72
##       se    IQR
## X1 39.41 650.75

Q6

Use ggplot() to create a histogram with a density plot for another continuous variable other than birthweight. What can you observe from the graph in terms of the distribution? [1 point]

Correct code for ggplot() with the argument geom_histogram [0.5 points]
Any one of the continuous variables include drinks, nprevist, age, educ other than birthweight [-0.5 points for incorrect variable]
Discuss the distribution (indicate: approximately normal or skewed to the left/right) [0.5 points]

Note: You should put only one variable within aes(), and use the argument geom_histogram() for plotting histogram. See more examples and illustrations here: https://ggplot2.tidyverse.org/reference/geom_histogram.html

nprevist: Based on the histogram, we can see that the distribution of nprevist is skewed to the right (positively skewed). (Some mothers had more than 30 visits, but a majority had less as the mean was 11 visits).

age: The distribution of age is is slightly skewed to the right (positively skewed).

drinks: The distribution of drinks is highly skewed to the right (positively skewed). (There were heavier drinkers in the sample while most of the others did not drink.)

educ: The distribution of educ is skewed to the left (negatively skewed). (Most mothers had at least 12 years of education, but a small number had very few years of education which negatively skew the sample.)

library(ggplot2)

# either one of the codes below: 

ggplot(smoking_data, aes(nprevist)) + 
  labs(title = "Histogram and Kernel Density of Prenatal Visits") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="orange")

ggplot(smoking_data, aes(age)) + 
  labs(title = "Histogram and Kernel Density of Mothers' age") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="pink")

ggplot(smoking_data, aes(drinks)) + 
  labs(title = "Histogram and Kernel Density of Drinks during Pregnancy") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="lightgreen")

ggplot(smoking_data, aes(educ)) + 
  labs(title = "Histogram and Kernel Density of Years of Mothers' Education") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="steelblue")

Q7

Plot a new scatterplot using ggplot() for another continuous variable (x) and birthweight, and interpret the relationship. You can add themes, labels, colors that make your graph look neat and professional. [2 points]

Correct code for plotting one scatterplot with ggplot() and geom_point() [1 point]
Discuss the slope of the line [0.5 points]
professional presentation with labels and a trend line [0.5 points]

For the scatterplot, use any of these continuous variables for x: drinks, nprevist, age, educ, y should be birthweight. Only one scatterplot is needed.

drinks: The scatterplot reveals that drinks is negatively related to birthweight as indicated by the downward trend (negative slope). As number of drinks goes up, birthweight tends to decrease.

nprevist: The scatterplot reveals that number of prenatal visits is positively related to birthweight as indicated by the upward trend (positive slope). As the number of prenatal visits increases, birthweight tends to go up.

age: The scatterplot reveals that age is positively related to birthweight as indicated by the upward trend (a gentle positive slope). As the mothers’ age increases, birthweight tends to go up.

educ: The scatterplot reveals that years of education is positively related to birthweight as indicated by the upward trend (positive slope). As the years of education mothers had increases, birthweight tends to increase.

# Drinks
plot_drinks <- ggplot(smoking_data, aes(x = drinks, y = birthweight)) + 
  geom_point(color = 'lightgreen') +
  geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
  labs(title = "Scatterplot of Drinking and Birthweight of Infants", 
    subtitle = "Lab 3",
    caption = "(Based on data in Pennsylvania in 1989)",
    x = "Number of Drinks per week",
    y = "Infants' birthweight (in grams)") +
  theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_drinks

# Prenatal visits
plot_nprevist <- ggplot(smoking_data, aes(x = nprevist, y = birthweight)) + 
  geom_point(color = 'orange') +
  geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
  labs(title = "Scatterplot of Prenatal Visits and Birthweight of Infants", 
    subtitle = "Lab 3",
    caption = "(Based on data in Pennsylvania in 1989)",
    x = "Number of Prenatal Visits",
    y = "Infants' birthweight (in grams)") +
  theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_nprevist

# Age
plot_age <- ggplot(smoking_data, aes(x = age, y = birthweight)) + 
  geom_point(color = 'lightblue') +
  geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
  labs(title = "Scatterplot of Mothers' Age and Birthweight of Infants", 
    subtitle = "Lab 3",
    caption = "(Based on data in Pennsylvania in 1989)",
    x = "Mothers' Age",
    y = "Infants' birthweight (in grams)") +
  theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_age

# Education
plot_educ <- ggplot(smoking_data, aes(x = educ, y = birthweight)) + 
  geom_point(color = 'salmon') +
  geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
  labs(title = "Scatterplot of Mother's Education and Birthweight of Infants", 
    subtitle = "Lab 3",
    caption = "(Based on data in Pennsylvania in 1989)",
    x = "Mother's Education Level",
    y = "Infants' birthweight (in grams)") +
  theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_educ

This is the end of Lab 2 Assignment. Keep up the good work!

SPP608 Statistical Methods Lab 2 Answer Key

Viviana Wu

2/18/2025