Ensure that for each question, you have included:
Correct codes: Verify the accuracy of the coding syntax, labels and logic.
R outputs: Check if the outputs generated by R are correctly interpreted and relevant to the question.
Written answer: Make sure answers are in complete sentences. Reference the data dictionary/documentation for precision and thoroughness.
If revisions are necessary, specify the question numbers requiring adjustments. Update both the codes and/or the written responses in your original submission to correct any inaccuracies or to provide additional clarity.
Summarize the key concepts and skills you have practiced this week. Reflect on how these exercises have contributed to your understanding of the material and your ability to apply statistical analysis using R. Consider the challenges you faced, how you overcame them, what you learned from the process, and how you will get more support and clarity if challenges were not resolved.
Question numbers with corrections:
Your Reflection for this week:
Import the dataset “birthweight_smoking.csv”, name your dataset.
From the summary()
statistics output, distinguish which
are numerical or categorical variables in the dataset. (Tips: A
categorical variable takes on two or more values which represents
categories or labels without inherent numerical meaning. A numerical
variable has countable or infinite values within a given range.) [1
point]
Numerical variables include
birthweight
,drinks
,nprevist
,age
, andeduc
as they take on numeric values between a minimum and maximum value, such as number of drinks, number of visits and years of education as defined by the data codebook.
There are a number of dummy (categorical) variables in the dataset include
smoker
,alcohol
,tripre1
,tripre2
,tripre3
,tripre0
, andunmarried
because they contain two distinct categories or groups.
Note that numerical variables can always be converted into categorical variables, but not the other way round. The variables, age and years of education, as measured in the codebook are examples of discrete variables (discretization of the numerical variable). Discrete variables (integer-based) are similar to continuous variables (numerical-based), although the former have countable values and clear bound, the latter have continuous values to infinity. If education is defined as ordered categories (such as high school, bachelor’s degree or above), then it becomes a categorical (ordinal) variable.
## nprevist alcohol tripre1 tripre2
## Min. : 0.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 9.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median :12.00 Median :0.00000 Median :1.000 Median :0.000
## Mean :10.99 Mean :0.01933 Mean :0.804 Mean :0.153
## 3rd Qu.:13.00 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :35.00 Max. :1.00000 Max. :1.000 Max. :1.000
## tripre3 tripre0 birthweight smoker unmarried
## Min. :0.000 Min. :0.00 Min. : 425 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3062 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.00 Median :3420 Median :0.000 Median :0.0000
## Mean :0.033 Mean :0.01 Mean :3383 Mean :0.194 Mean :0.2267
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:3750 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.00 Max. :5755 Max. :1.000 Max. :1.0000
## educ age drinks
## Min. : 0.00 Min. :14.00 Min. : 0.00000
## 1st Qu.:12.00 1st Qu.:23.00 1st Qu.: 0.00000
## Median :12.00 Median :27.00 Median : 0.00000
## Mean :12.91 Mean :26.89 Mean : 0.05833
## 3rd Qu.:14.00 3rd Qu.:31.00 3rd Qu.: 0.00000
## Max. :17.00 Max. :44.00 Max. :21.00000
Change the data type of another dummy variable [other than smoker]
(name a new variable), and assign the value labels. Then use
summary()
to discuss the summary results. [2 points]
factor()
[0.5 points]levels
and
labels
as follows. [0.5 points]alcohol
,
tripre1
, tripre2
, tripre3
,
tripre0
, unmarried
(smoker should not be used
per instruction)Example of a dummy variable: unmarried
Given that unmarried is a dummy variable, I changed the variable structure from numeric to factor using the factor() function. Following the data documentation, unmarried mothers was labelled as 1 and married mothers as 0. The summary function shows that 2320 out of 3000 mothers were married, and 680 mothers were unmarried in the sample.
# the following labeling code should be run only once, otherwise NAs occur
# though you can flip around the labels to reverse them
# note: the documentation labels `unmarried` as 1.
smoking_data$unmarried1 <- factor(smoking_data$unmarried,
levels = c(0,1),
labels = c("married", "unmarried"))
summary(smoking_data$unmarried1)
## married unmarried
## 2320 680
In your response, specify the condition(s) for at least one variable
to create a new data frame. In the code chunk below, use filter() to
generate a new data frame (specify a new name) using the conditions and
operator. Lastly, use the summary()
function to inspect the
new dataset. [2 points]
Specify the condition(s) for filtering the dataset in your response (0.5 points)
Correct code should include (1) a variable to name a new dataset,
(2) the function filter()
, (3) the original dataset
smoking_data, and (4) the criteria with which you use to filter the
results. [1 point]
Check summary()
of the new data. [0.5
points]
Discussion of the summary results is not required, but good job if you have done this!
Note: If you use filter(), you need to call the tidyverse or dplyr
library. For setting the criteria, use any operators suggested above to
define a correct variable range based on the data documentation. For
instance, it is incorrect to use
smoking_data$smoker > 18
as the criteria because smoker
only takes values from 0 to 1. [-0.5 points for incorrect
criteria.
Here, I used the filter() function to create a new dataset called
data_new
, using two criteria that mothers consumed alcohol and smoked during pregnancy. This smaller dataset contained only 31 observations. The average birthweight of drinking and smoking mothers was lower (3020 grams), compared with that of our entire sample (3383 grams).
## nprevist alcohol tripre1 tripre2 tripre3
## Min. : 0.0 Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.: 6.0 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :10.0 Median :0.000 Median :1.000 Median :0.00 Median :0.000
## Mean : 9.2 Mean :0.072 Mean :0.544 Mean :0.32 Mean :0.096
## 3rd Qu.:12.0 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :30.0 Max. :1.000 Max. :1.000 Max. :1.00 Max. :1.000
## tripre0 birthweight smoker unmarried educ
## Min. :0.00 Min. : 510 Min. :1 Min. :1 Min. : 7.00
## 1st Qu.:0.00 1st Qu.:2836 1st Qu.:1 1st Qu.:1 1st Qu.:10.00
## Median :0.00 Median :3147 Median :1 Median :1 Median :12.00
## Mean :0.04 Mean :3102 Mean :1 Mean :1 Mean :11.39
## 3rd Qu.:0.00 3rd Qu.:3487 3rd Qu.:1 3rd Qu.:1 3rd Qu.:12.00
## Max. :1.00 Max. :4508 Max. :1 Max. :1 Max. :16.00
## age drinks unmarried1
## Min. :15.00 Min. : 0.000 married : 0
## 1st Qu.:20.00 1st Qu.: 0.000 unmarried:250
## Median :22.50 Median : 0.000
## Mean :23.56 Mean : 0.324
## 3rd Qu.:27.00 3rd Qu.: 0.000
## Max. :38.00 Max. :21.000
# example with a labelled variable
data_new1 <- filter(smoking_data,
smoker == 1 & unmarried1 == "unmarried")
summary(data_new1)
## nprevist alcohol tripre1 tripre2 tripre3
## Min. : 0.0 Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.: 6.0 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :10.0 Median :0.000 Median :1.000 Median :0.00 Median :0.000
## Mean : 9.2 Mean :0.072 Mean :0.544 Mean :0.32 Mean :0.096
## 3rd Qu.:12.0 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :30.0 Max. :1.000 Max. :1.000 Max. :1.00 Max. :1.000
## tripre0 birthweight smoker unmarried educ
## Min. :0.00 Min. : 510 Min. :1 Min. :1 Min. : 7.00
## 1st Qu.:0.00 1st Qu.:2836 1st Qu.:1 1st Qu.:1 1st Qu.:10.00
## Median :0.00 Median :3147 Median :1 Median :1 Median :12.00
## Mean :0.04 Mean :3102 Mean :1 Mean :1 Mean :11.39
## 3rd Qu.:0.00 3rd Qu.:3487 3rd Qu.:1 3rd Qu.:1 3rd Qu.:12.00
## Max. :1.00 Max. :4508 Max. :1 Max. :1 Max. :16.00
## age drinks unmarried1
## Min. :15.00 Min. : 0.000 married : 0
## 1st Qu.:20.00 1st Qu.: 0.000 unmarried:250
## Median :22.50 Median : 0.000
## Mean :23.56 Mean : 0.324
## 3rd Qu.:27.00 3rd Qu.: 0.000
## Max. :38.00 Max. :21.000
Describe the distribution of the variable birthweight
,
including mean, median, the IQR, and the shape. Discuss possible
outliers using the measures of skewness (symmetry) and kutosis (masses
in tails). [2 points]
Your distribution does not necessarily follow the ones shown in the answer key, but a complete response should describe the followings in detail:
Note: Recap of the lecture, a normal distribution has zero skewness (symmetrical) and kurtosis of three. Any symmetric data should have a skewness near zero. The Central Limit Theorem posits that in many situations as long as the sample size is large (\(n > 30\)), the distribution of the sample means will follow an approximately normal distribution. However, when we have a highly skewed distribution, we will need to conduct variable transformation, more in future lectures on Log Transformation.
Original dataset (approximately normal distribution): The distribution of birthweight follows a bell shape and is approximately normal. The mean or the average birthweight is 3382.93 grams, with a median of 3420 grams and an interquartile range of 688 grams. It is slightly skewed to the left (or negatively skewed) with a skewness of -0.83. It has lighter tails than a normal distribution with a kurtosis of 2.54.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 3000 3382.93 592.16 3420 3412.04 520.39 425 5755 5330 -0.83 2.54
## se IQR
## X1 10.81 688
If you continued to use the new dataset generated from Q4, it is acceptable this time as long as you were able to describe the distribution as below.
Subsetted dataset (skewed distribution): The distribution of birthweight does not follow a bell shape and is skewed to the left or negatively skewed, with a skewness of -2.11. The mean or the average birthweight is 3020.16 grams, with a median of 3147 grams and an interquartile range of 496 grams. It has heavier tails than a normal distribution with a kurtosis of 5.51.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 250 3101.63 623.19 3147 3150.52 504.08 510 4508 3998 -1.11 2.72
## se IQR
## X1 39.41 650.75
Use ggplot()
to create a histogram with a density plot
for another continuous variable other than birthweight
.
What can you observe from the graph in terms of the distribution? [1
point]
ggplot()
with the argument
geom_histogram
[0.5 points]drinks
,
nprevist
, age
, educ
other than
birthweight
[-0.5 points for incorrect variable]Note: You should put only one variable within aes()
, and
use the argument geom_histogram()
for plotting histogram.
See more examples and illustrations here: https://ggplot2.tidyverse.org/reference/geom_histogram.html
nprevist
: Based on the histogram, we can see that the
distribution of nprevist is skewed to the right (positively skewed).
(Some mothers had more than 30 visits, but a majority had less as the
mean was 11 visits).
age
: The distribution of age is is slightly skewed to
the right (positively skewed).
drinks
: The distribution of drinks is highly skewed to
the right (positively skewed). (There were heavier drinkers in the
sample while most of the others did not drink.)
educ
: The distribution of educ is skewed to the left
(negatively skewed). (Most mothers had at least 12 years of education,
but a small number had very few years of education which negatively skew
the sample.)
library(ggplot2)
# either one of the codes below:
ggplot(smoking_data, aes(nprevist)) +
labs(title = "Histogram and Kernel Density of Prenatal Visits") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="orange")
ggplot(smoking_data, aes(age)) +
labs(title = "Histogram and Kernel Density of Mothers' age") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="pink")
ggplot(smoking_data, aes(drinks)) +
labs(title = "Histogram and Kernel Density of Drinks during Pregnancy") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="lightgreen")
ggplot(smoking_data, aes(educ)) +
labs(title = "Histogram and Kernel Density of Years of Mothers' Education") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="steelblue")
Plot a new scatterplot using ggplot()
for another
continuous variable (x) and birthweight, and interpret the relationship.
You can add themes, labels, colors that make your graph look neat and
professional. [2 points]
ggplot()
and geom_point()
[1 point]For the scatterplot, use any of these continuous variables for x:
drinks
, nprevist
, age
,
educ
, y should be birthweight
. Only one
scatterplot is needed.
drinks
: The scatterplot reveals that drinks is
negatively related to birthweight as indicated by the downward trend
(negative slope). As number of drinks goes up, birthweight tends to
decrease.
nprevist
: The scatterplot reveals that number of
prenatal visits is positively related to birthweight as indicated by the
upward trend (positive slope). As the number of prenatal visits
increases, birthweight tends to go up.
age
: The scatterplot reveals that age is positively
related to birthweight as indicated by the upward trend (a gentle
positive slope). As the mothers’ age increases, birthweight tends to go
up.
educ
: The scatterplot reveals that years of education is
positively related to birthweight as indicated by the upward trend
(positive slope). As the years of education mothers had increases,
birthweight tends to increase.
# Drinks
plot_drinks <- ggplot(smoking_data, aes(x = drinks, y = birthweight)) +
geom_point(color = 'lightgreen') +
geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
labs(title = "Scatterplot of Drinking and Birthweight of Infants",
subtitle = "Lab 3",
caption = "(Based on data in Pennsylvania in 1989)",
x = "Number of Drinks per week",
y = "Infants' birthweight (in grams)") +
theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_drinks
# Prenatal visits
plot_nprevist <- ggplot(smoking_data, aes(x = nprevist, y = birthweight)) +
geom_point(color = 'orange') +
geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
labs(title = "Scatterplot of Prenatal Visits and Birthweight of Infants",
subtitle = "Lab 3",
caption = "(Based on data in Pennsylvania in 1989)",
x = "Number of Prenatal Visits",
y = "Infants' birthweight (in grams)") +
theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_nprevist
# Age
plot_age <- ggplot(smoking_data, aes(x = age, y = birthweight)) +
geom_point(color = 'lightblue') +
geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
labs(title = "Scatterplot of Mothers' Age and Birthweight of Infants",
subtitle = "Lab 3",
caption = "(Based on data in Pennsylvania in 1989)",
x = "Mothers' Age",
y = "Infants' birthweight (in grams)") +
theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_age
# Education
plot_educ <- ggplot(smoking_data, aes(x = educ, y = birthweight)) +
geom_point(color = 'salmon') +
geom_smooth(method = "lm", colour = "grey", size = 0.5, se = FALSE) +
labs(title = "Scatterplot of Mother's Education and Birthweight of Infants",
subtitle = "Lab 3",
caption = "(Based on data in Pennsylvania in 1989)",
x = "Mother's Education Level",
y = "Infants' birthweight (in grams)") +
theme_fivethirtyeight(base_size = 10, base_family = "sans")
plot_educ
This is the end of Lab 2 Assignment. Keep up the good work!