Slides for Week 3 is available here
In this tutorial, you will learn how to:
Tips: You should have installed the following packages in Lab 2 using
install.packages
code. Remember you shoud only install
packages once. In every R session, you need to load the library before
you can use their corresponding functions.
library(psych)
library(ggplot2)
library(ggthemes)
library(ggpubr) # for ggboxplot()
library(tidyverse) # for filter()
library(dplyr)
library(gapminder)
options(scipen=999) # remove scientific notation
Want to include emojis and make it fun? Install the emo
package! (Remove the hashtag to run the code and put it back once you
download it once.) For a complete list of emojis that rmarkdown supports
please follow the link here.
Step 0. Look at the Data Documentation
Check the Documentation for Birthweight_Smoking
in your
project folder.
Your filename.csv
needs to be exact and indicate the
file format, which is a .csv file (Comma Separated Values file)–a
delimited text file that uses a comma to separate values. It is one
types of Excel spreadsheet besides the common .xlsx format)
smoking_data <- read.csv("birthweight_smoking.csv")
smoking_data1 <- smoking_data
# you can save a copy of the original dataset
# smoking_data1 is now the spare copy shall we "mess up" smoking_data!
Before proceeding to analyze our data, we have to do some data
cleaning, especially for our IVs and DV. We have done some of that in
Lab 2. We have also inspected the summary statistics of our continuous
variables. Now, let’s review our last demonstration on
smoker
.
Create a new variable with a factor structure
smoking_data$smoker1 <- factor(smoking_data$smoker)
smoking_data$smoker2 <- factor(smoking_data$smoker,
levels = c(0,1),
labels = c("nonsmoker", "smoker"))
# Now, check if the data structure and summary for `smoker1` has changed
str(smoking_data$smoker2)
## Factor w/ 2 levels "nonsmoker","smoker": 2 1 2 1 1 1 2 1 1 1 ...
summary(smoking_data$smoker2)
## nonsmoker smoker
## 2418 582
Let’s turn unmarried into a factor variable and label the two levels.
smoking_data$unmarried1 <- factor(smoking_data$unmarried)
smoking_data$unmarried2 <- factor(smoking_data$unmarried,
levels = c(0,1),
labels = c("married", "unmarried"))
# Now, check if the data structure and summary has changed
str(smoking_data$unmarried2)
## Factor w/ 2 levels "married","unmarried": 2 1 1 1 1 1 1 1 1 1 ...
summary(smoking_data$unmarried2)
## married unmarried
## 2320 680
In this lab, we will focus on exploring our dummy or categorical variables, which constitutes two or more mutually exclusive categories.
Create a frequency table of a factor variable (dummy or categorical); The frequencies are ordered and labeled by the level attributes of the factor.
table(smoking_data$smoker2)
##
## nonsmoker smoker
## 2418 582
We can express table as proportion tables or fraction of marginal table of two variables.
Tips: the function round
helps us round the values to a
specified number of decimal places (default is 0), here we give it
2
. Note that there is a comma
before the
number!
# for one variable
round(prop.table(table(smoking_data$smoker2)), 2)
##
## nonsmoker smoker
## 0.81 0.19
# or use nrow
round(table(smoking_data$unmarried2) / nrow(smoking_data), 2)
##
## married unmarried
## 0.77 0.23
# a frequency table for two variables
round(prop.table(table(smoking_data$unmarried2, smoking_data$smoker2), margin = 2), 2)
##
## nonsmoker smoker
## married 0.82 0.57
## unmarried 0.18 0.43
# the first variable (unmarried) is the row, the second variable (smoker) is the column
boxplot()
Using boxplot() we tell R to create a horizontal box plot. Horizontal and vertical box plots display the data distribution using a rectangular box and whiskers. Whiskers are lines that indicate a data range outside of the box. The “interquartile range”, abbreviated “IQR”, is just the width of the box in the box-and-whisker plot.
hist(smoking_data$birthweight)
boxplot(smoking_data$birthweight)
Here we will only focus on the dplyr
way:
Use pipe %>%
to emphasize a sequence of
actions
Note: %>%
should always have a space before it,
and should usually be followed by a new line.
Let’s use the verb group_by
to specify the grouping
variable for comparison. Here, suppose the grouping variable is
smoker2
, we want to find out the count, mean, median and
IQR of birthweight by each subgroup, i.e., non-smoking and smoking
mothers. Besides the statistics functions, we can use the
summerise
function from the dplyr library to create columns
of summary statistics, i.e., count, mean_bw, median_bw and IQR_bw:
smoking_data %>%
group_by(smoker2) %>%
summarise(count = n(),
mean_bw = mean(birthweight),
median_bw = median(birthweight),
IQR_bw = IQR(birthweight),
skew_bw = skew(birthweight))
## # A tibble: 2 × 6
## smoker2 count mean_bw median_bw IQR_bw skew_bw
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 2418 3432. 3459 680 -0.818
## 2 smoker 582 3179. 3220 624 -1.02
# count, mean_bw, median_bw, etc. are variables we created, and can be named differently.
ggboxplot
: Boxplot of two variables by groupsWe can compare the outcomes of the two groups by creating a boxplot by the values of categorical variable, say for instance, the birthweight of infants whose mothers were smokers against that of the counterparts whose mothers were non-smokers.
Check out this color cheatsheet for setting the palette: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf
library(ggpubr)
ggboxplot(smoking_data, x = "smoker2", y = "birthweight",
color = "smoker2", palette = c("#00AFBB", "#FC4E07"),
order = c("nonsmoker", "smoker"),
ylab = "Infants' Birthweight (in grams)", xlab = "")
Based on the summary statistics by group and the boxplot, what can you say about the differences in median and IQR of infants’ birthweight between smoking mothers and non-smoking mothers?
Suppose you are interested in the means of two different populations, denote them \(\mu1\) and \(\mu2\). More specifically, you are interested whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations.
We set the significance level, also denoted as alpha or \(\alpha\) at the 0.05 level, which is the probability of rejecting the null hypothesis when it is true (i.e., incorrectly rejecting the null hypothesis). In other words, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.
Before we run a hypothesis test, we have to first define our null hypothesis and alternate hypothesis. In the case of a two-tailed test:
The null hypothesis: There is no difference in infants’ birthweight between smoking and non-smoking mothers.
\[H_0: \mu1 - \mu2 =0\]
The alternative hypothesis: There is a difference in infants’ birthweight between smoking and non-smoking mothers.
\[H_1: \mu1 - \mu2 \neq 0\]
t.test(birthweight ~ smoker2, data = smoking_data)
##
## Welch Two Sample t-test
##
## data: birthweight by smoker2
## t = 9.4414, df = 887.15, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
## 200.5882 305.8685
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 3432.060 3178.832
# for t-test, variable label or factor structure doesn't matter.
If this \(p\)-value of the t-test falls into (i.e., below or equal to) the critical region (\(\alpha\) = 0.05), it indicates strong evidence against the null hypothesis. Hence, we reject the null hypothesis, and conclude that there is a significant difference between the groups at the 0.05 level.
Mean Difference: The average birthweight among infants of non-smoking mothers was 3432.06, which was higher than the average birthweight among infants of smoking mothers (mean = 3178.83) by 253.22 grams.
P-value: The p-value of the Welch two sample t-test is less than 0.05. (This means that if the null hypothesis is true, the probability of observing a test statistic of 9.44 in a two‐sided test, is extremely small.)
Interpretation: The t-test result suggests that the difference in the birthweight between the two groups is statistically significant at the 0.05 level, indicating that for infants whose mothers were smokers, their average birthweight was significantly lower than that of infants whose mothers were not smokers.
Conclusion: We thus reject the null hypothesis and find evidence to support the alternative hypothesis at the 0.05 level. There is a statistically significant difference in infants’ birthweight between smoking and non-smoking mothers. (Or, the average birthweight among infants of non-smoking mothers is significantly higher than the average birthweight among infants of smoking mothers)
The one-way analysis of variance (ANOVA) is similar to the independent two-samples t-test we did earlier, but can be used for comparing means across more than two groups. Think of one categorical factor variable with 3 levels that denote the three types of volunteers to test the validity of vaccines: (1) control group, (2) vaccine 1 group, (3) vaccine 2 group. This is a good explanation and demonstration of conducting one-way anova test in R. http://www.sthda.com/english/wiki/one-way-anova-test-in-r
The gapminder
dataset contains data on life expectancy,
GDP per capita and population by country between 1952 and 2007.
Suppose we want to know the average life expectancy for each continent in 2007.
library(dplyr)
data2007 <- filter(gapminder, year == 2007)
We want to know if there is any significant difference between the average life expectancy across the five continents in 2007.
As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.
# Compute the analysis of variance
life.aov <- aov(lifeExp ~ continent, data = data2007)
# Summary of the analysis
summary(life.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 13061 3265 59.71 <0.0000000000000002 ***
## Residuals 137 7491 55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences) for performing multiple pairwise-comparison between the means of groups.
The function TukeyHD()
takes the fitted ANOVA as an
argument.
* diff: difference between means of the two groups
* lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)
* p adj: p-value after adjustment for the multiple comparisons.
TukeyHSD(life.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lifeExp ~ continent, data = data2007)
##
## $continent
## diff lwr upr p adj
## Americas-Africa 18.802082 13.827042 23.777121 0.0000000
## Asia-Africa 15.922446 11.372842 20.472051 0.0000000
## Europe-Africa 22.842562 18.155858 27.529265 0.0000000
## Oceania-Africa 25.913462 11.183451 40.643472 0.0000305
## Asia-Americas -2.879635 -8.299767 2.540497 0.5844901
## Europe-Americas 4.040480 -1.495233 9.576193 0.2630707
## Oceania-Americas 7.111380 -7.910342 22.133102 0.6862848
## Europe-Asia 6.920115 1.763372 12.076858 0.0027416
## Oceania-Asia 9.991015 -4.895221 24.877251 0.3464970
## Oceania-Europe 3.070900 -11.857807 17.999607 0.9793776
Can you tell from the output, which pairs of continents have significantly different life expectancy at the 0.05 level?
Note: Materials below are for your reference and will not be quizzed.*
dplyr
library 🍎We often need to transform our dataset in order to better understand our data. We also manipulate data for analysis and visualization. dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
Here we will only focus on the dplyr
way:
Use pipe %>%
to emphasize a sequence of
actions
Note: %>%
should always have a space before it,
and should usually be followed by a new line.
Let’s take a closer look at the life expectancy in 2007 across continents.
ggplot(data2007,
aes(x= lifeExp, y = continent, color = continent)) +
geom_point() +
labs(title = "Life Expectancy by Continents",
subtitle = "Gapminder Data in 2007",
x = "Life Expectancy",
y = "Continents") +
theme_minimal()
# If the levels are not automatically in the correct order, re-order them as follow:
data2007$continent1 <- ordered(data2007$continent,
levels = c("Africa", "Americas", "Asia","Europe", "Oceania"))
# Compute summary statistics by groups - count, mean, sd:
group_by(data2007, continent1) %>%
summarise(
count = n(),
mean = mean(lifeExp),
sd = sd(lifeExp)
)
## # A tibble: 5 × 4
## continent1 count mean sd
## <ord> <int> <dbl> <dbl>
## 1 Africa 52 54.8 9.63
## 2 Americas 25 73.6 4.44
## 3 Asia 33 70.7 7.96
## 4 Europe 30 77.6 2.98
## 5 Oceania 2 80.7 0.729
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(data2007, x = "continent1", y = "lifeExp",
color = "continent1",
order = c("Africa", "Americas", "Asia","Europe", "Oceania"),
ylab = "Life Expectancy", xlab = "Continents")
Cleveland plots are useful when you want to compare a numeric statistic for a large number of groups. For example, say that you want to compare the 2007 life expectancy for Americas using the gapminder dataset.
data(gapminder)
# subset Americas countries in 2007
library(dplyr)
Americas2007 <- gapminder %>%
filter(continent == "Americas" &
year == 2007)
# basic Cleveland plot of life expectancy by country
ggplot(Americas2007,
aes(x= lifeExp, y = country)) +
geom_point()
Comparisons are usually easier if the y-axis is sorted.
# Sorted Cleveland plot
ggplot(Americas2007,
aes(x=lifeExp,
y=reorder(country, lifeExp))) +
geom_point()
Suppose we want to know the average life expectancy and average GDP per capita for each continent in each year. We need to group the data by continent and year, then compute the average life expectancy and average GDP per capita.
# compute the required statistics
gapminder_dplyr <- gapminder %>%
group_by(continent, year) %>%
summarise(count = n(),
mean_lifeexp = mean(lifeExp),
mean_gdppercap = mean(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
Now we could look at the result in gapminder_dplyr
, or
compute some statistics from it.
ggplot(data = gapminder_dplyr,
mapping = aes(x = mean_lifeexp,
y = mean_gdppercap,
color = continent,
size = count)) +
geom_point(alpha = 1/2) +
labs(x = "Average life expectancy",
y = "Average GDP per capita",
color = "Continent",
size = "Number of countries") +
theme_bw()
The code for this graph is taken from ScPoEconometrics.
library(gapminder)
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent))
# Compare multiple countries of a continent
Americas <- gapminder %>%
filter(continent == "Americas")
ggplot(data = Americas) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country))
# Compare selected countries
UJN <- gapminder %>%
filter(country == c("United States", "Japan", "Norway"))
ggplot(data = UJN) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country))
When there are several groups and several time points, a slope graph
can be helpful. Let’s plot life expectancy for six Central American
countries in 1992, 1997, 2002, and 2007. Again we’ll use the gapminder
data. To create a slope graph, we’ll use the newggslopegraph function
from the CGPfunctions
package.
The newggslopegraph function parameters are (in order) * data frame * time variable (which must be a factor) * numeric variable to be plotted * and grouping variable (creating one line per group).
# install.packages("CGPfunctions")
library(CGPfunctions)
# Select Central American countries data
# for 1992, 1997, 2002, and 2007
CentralAmericas <- gapminder %>%
filter(year %in% c(1992, 1997, 2002, 2007) &
country %in% c("Panama", "Costa Rica",
"Nicaragua", "Honduras",
"El Salvador", "Guatemala",
"Belize")) %>%
mutate(year = factor(year),
lifeExp = round(lifeExp))
# create slope graph
newggslopegraph(CentralAmericas, year, lifeExp, country) +
labs(title="Life Expectancy by Country",
subtitle="Central America",
caption="source: gapminder")
##
## Converting 'year' to an ordered factor
Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.
Q2: Use ggboxplot()
to create a boxplot for another
categorical variable and birthweight
. Find the mean, median
and IQR using the summarise()
function from the
dplyr
library. Discuss any differences in mean, median and
IQR based on the graph and outputs.
Q3: Which dummy variable(s) among unmarried
,
tripre0
and alcohol
might make a significant
difference in the birthweight of infants? Run a t.test()
to
test the differences across groups in the sample, respectively.
Q4: Based on Q3, what conclusion can you draw from the t-test results? Interpret your findings based on the mean difference of the two groups and the p-value at the 0.05 alpha level. (Refer to WK3 lecture slides)
Statastic – Well done!
Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!
Step 2: If you use RMarkdown (.Rmd) document, Knit
your
R Markdown document–move your cursor to the face-down triangle next to
Knit
, and choose for PDF. If you use an R
script (.R), then transfer your codes, results and work on the
assignment in a word document, then convert it to a PDF.
Step 3: Submit your assignment to Gradescope https://www.gradescope.com/.