Lecture

Slides for Week 3 is available here

In this tutorial, you will learn how to:

  • run a t-test to examine differences in group means
  • interpret the significance and find the p-value
  • summarize categorical variables
  • make decisions to transform the variables
  • visualize data using ggplot2 library
  • visualize longitudinal data

A Reminder

  • Keep your #comment short in a code chunk.
  • You should write paragraphs above or below a code chunk.

Basic Set Up: Install and load all necessary libraries

Tips: You should have installed the following packages in Lab 2 using install.packages code. Remember you shoud only install packages once. In every R session, you need to load the library before you can use their corresponding functions.

library(psych)
library(ggplot2)
library(ggthemes)  
library(ggpubr)  # for ggboxplot()
library(tidyverse) # for filter()
library(dplyr)
library(gapminder)

options(scipen=999) # remove scientific notation

Want to include emojis and make it fun? Install the emo package! (Remove the hashtag to run the code and put it back once you download it once.) For a complete list of emojis that rmarkdown supports please follow the link here.

Let’s get started!

Step 0. Look at the Data Documentation

Check the Documentation for Birthweight_Smoking in your project folder.

Step 1. Import a dataset (+Save a Copy!)

Your filename.csv needs to be exact and indicate the file format, which is a .csv file (Comma Separated Values file)–a delimited text file that uses a comma to separate values. It is one types of Excel spreadsheet besides the common .xlsx format)

smoking_data <- read.csv("birthweight_smoking.csv")

smoking_data1 <- smoking_data 
# you can save a copy of the original dataset 
# smoking_data1 is now the spare copy shall we "mess up" smoking_data!

Step 2. Data Cleaning and Transformation

Before proceeding to analyze our data, we have to do some data cleaning, especially for our IVs and DV. We have done some of that in Lab 2. We have also inspected the summary statistics of our continuous variables. Now, let’s review our last demonstration on smoker.

Create a new variable with a factor structure

smoking_data$smoker1 <- factor(smoking_data$smoker)   

smoking_data$smoker2 <- factor(smoking_data$smoker, 
                              levels = c(0,1),
                              labels = c("nonsmoker", "smoker"))
# Now, check if the data structure and summary for `smoker1` has changed 
str(smoking_data$smoker2)  
##  Factor w/ 2 levels "nonsmoker","smoker": 2 1 2 1 1 1 2 1 1 1 ...
summary(smoking_data$smoker2)  
## nonsmoker    smoker 
##      2418       582

Let’s turn unmarried into a factor variable and label the two levels.

smoking_data$unmarried1 <- factor(smoking_data$unmarried)   

smoking_data$unmarried2 <- factor(smoking_data$unmarried, 
                              levels = c(0,1),
                              labels = c("married", "unmarried"))
# Now, check if the data structure and summary has changed 
str(smoking_data$unmarried2)  
##  Factor w/ 2 levels "married","unmarried": 2 1 1 1 1 1 1 1 1 1 ...
summary(smoking_data$unmarried2)  
##   married unmarried 
##      2320       680

Step 3. Exploratory Data Analysis

In this lab, we will focus on exploring our dummy or categorical variables, which constitutes two or more mutually exclusive categories.

Categorical: table() and prop.table()

Create a frequency table of a factor variable (dummy or categorical); The frequencies are ordered and labeled by the level attributes of the factor.

table(smoking_data$smoker2)   
## 
## nonsmoker    smoker 
##      2418       582

We can express table as proportion tables or fraction of marginal table of two variables.

Tips: the function round helps us round the values to a specified number of decimal places (default is 0), here we give it 2. Note that there is a comma before the number!

# for one variable
round(prop.table(table(smoking_data$smoker2)), 2) 
## 
## nonsmoker    smoker 
##      0.81      0.19
# or use nrow
round(table(smoking_data$unmarried2) / nrow(smoking_data), 2) 
## 
##   married unmarried 
##      0.77      0.23
# a frequency table for two variables
round(prop.table(table(smoking_data$unmarried2, smoking_data$smoker2), margin = 2), 2)   
##            
##             nonsmoker smoker
##   married        0.82   0.57
##   unmarried      0.18   0.43
# the first variable (unmarried) is the row, the second variable (smoker) is the column 

Step 4. Data Visualization of Bivariate Statistics

Find Extreme Outliers using boxplot()

Using boxplot() we tell R to create a horizontal box plot. Horizontal and vertical box plots display the data distribution using a rectangular box and whiskers. Whiskers are lines that indicate a data range outside of the box. The “interquartile range”, abbreviated “IQR”, is just the width of the box in the box-and-whisker plot.

hist(smoking_data$birthweight)

boxplot(smoking_data$birthweight)

Summary statistics by Groups

Here we will only focus on the dplyr way:

  • Use pipe %>% to emphasize a sequence of actions

  • Note: %>% should always have a space before it, and should usually be followed by a new line.

Let’s use the verb group_by to specify the grouping variable for comparison. Here, suppose the grouping variable is smoker2, we want to find out the count, mean, median and IQR of birthweight by each subgroup, i.e., non-smoking and smoking mothers. Besides the statistics functions, we can use the summerise function from the dplyr library to create columns of summary statistics, i.e., count, mean_bw, median_bw and IQR_bw:

smoking_data %>% 
  group_by(smoker2) %>% 
  summarise(count = n(),
            mean_bw = mean(birthweight),
            median_bw = median(birthweight),
            IQR_bw = IQR(birthweight),
            skew_bw = skew(birthweight))
## # A tibble: 2 × 6
##   smoker2   count mean_bw median_bw IQR_bw skew_bw
##   <fct>     <int>   <dbl>     <dbl>  <dbl>   <dbl>
## 1 nonsmoker  2418   3432.      3459    680  -0.818
## 2 smoker      582   3179.      3220    624  -1.02
# count, mean_bw, median_bw, etc. are variables we created, and can be named differently.

ggboxplot: Boxplot of two variables by groups

We can compare the outcomes of the two groups by creating a boxplot by the values of categorical variable, say for instance, the birthweight of infants whose mothers were smokers against that of the counterparts whose mothers were non-smokers.

Check out this color cheatsheet for setting the palette: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf

library(ggpubr)
ggboxplot(smoking_data, x = "smoker2", y = "birthweight", 
          color = "smoker2", palette = c("#00AFBB", "#FC4E07"),
          order = c("nonsmoker",  "smoker"),
          ylab = "Infants' Birthweight (in grams)", xlab = "")

Based on the summary statistics by group and the boxplot, what can you say about the differences in median and IQR of infants’ birthweight between smoking mothers and non-smoking mothers?


Step 5: Hypothesis Testing

Welch Two Sample t-test

Suppose you are interested in the means of two different populations, denote them \(\mu1\) and \(\mu2\). More specifically, you are interested whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations.

We set the significance level, also denoted as alpha or \(\alpha\) at the 0.05 level, which is the probability of rejecting the null hypothesis when it is true (i.e., incorrectly rejecting the null hypothesis). In other words, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

Before we run a hypothesis test, we have to first define our null hypothesis and alternate hypothesis. In the case of a two-tailed test:

The null hypothesis: There is no difference in infants’ birthweight between smoking and non-smoking mothers.

\[H_0: \mu1 - \mu2 =0\]

The alternative hypothesis: There is a difference in infants’ birthweight between smoking and non-smoking mothers.

\[H_1: \mu1 - \mu2 \neq 0\]

t.test(birthweight ~ smoker2, data = smoking_data) 
## 
##  Welch Two Sample t-test
## 
## data:  birthweight by smoker2
## t = 9.4414, df = 887.15, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
##  200.5882 305.8685
## sample estimates:
## mean in group nonsmoker    mean in group smoker 
##                3432.060                3178.832
# for t-test, variable label or factor structure doesn't matter.

If this \(p\)-value of the t-test falls into (i.e., below or equal to) the critical region (\(\alpha\) = 0.05), it indicates strong evidence against the null hypothesis. Hence, we reject the null hypothesis, and conclude that there is a significant difference between the groups at the 0.05 level.

Interpretation

  • Mean Difference: The average birthweight among infants of non-smoking mothers was 3432.06, which was higher than the average birthweight among infants of smoking mothers (mean = 3178.83) by 253.22 grams.

  • P-value: The p-value of the Welch two sample t-test is less than 0.05. (This means that if the null hypothesis is true, the probability of observing a test statistic of 9.44 in a two‐sided test, is extremely small.)

  • Interpretation: The t-test result suggests that the difference in the birthweight between the two groups is statistically significant at the 0.05 level, indicating that for infants whose mothers were smokers, their average birthweight was significantly lower than that of infants whose mothers were not smokers.

  • Conclusion: We thus reject the null hypothesis and find evidence to support the alternative hypothesis at the 0.05 level. There is a statistically significant difference in infants’ birthweight between smoking and non-smoking mothers. (Or, the average birthweight among infants of non-smoking mothers is significantly higher than the average birthweight among infants of smoking mothers)


One-way ANOVA test

The one-way analysis of variance (ANOVA) is similar to the independent two-samples t-test we did earlier, but can be used for comparing means across more than two groups. Think of one categorical factor variable with 3 levels that denote the three types of volunteers to test the validity of vaccines: (1) control group, (2) vaccine 1 group, (3) vaccine 2 group. This is a good explanation and demonstration of conducting one-way anova test in R. http://www.sthda.com/english/wiki/one-way-anova-test-in-r

gapminder dataset

The gapminder dataset contains data on life expectancy, GDP per capita and population by country between 1952 and 2007.

Suppose we want to know the average life expectancy for each continent in 2007.

library(dplyr)
data2007 <- filter(gapminder, year == 2007)

We want to know if there is any significant difference between the average life expectancy across the five continents in 2007.

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.

# Compute the analysis of variance
life.aov <- aov(lifeExp ~ continent, data = data2007)
# Summary of the analysis
summary(life.aov)
##              Df Sum Sq Mean Sq F value              Pr(>F)    
## continent     4  13061    3265   59.71 <0.0000000000000002 ***
## Residuals   137   7491      55                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey multiple pairwise-comparisons

As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences) for performing multiple pairwise-comparison between the means of groups.

The function TukeyHD() takes the fitted ANOVA as an argument.

* diff: difference between means of the two groups

* lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)

* p adj: p-value after adjustment for the multiple comparisons.

TukeyHSD(life.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lifeExp ~ continent, data = data2007)
## 
## $continent
##                       diff        lwr       upr     p adj
## Americas-Africa  18.802082  13.827042 23.777121 0.0000000
## Asia-Africa      15.922446  11.372842 20.472051 0.0000000
## Europe-Africa    22.842562  18.155858 27.529265 0.0000000
## Oceania-Africa   25.913462  11.183451 40.643472 0.0000305
## Asia-Americas    -2.879635  -8.299767  2.540497 0.5844901
## Europe-Americas   4.040480  -1.495233  9.576193 0.2630707
## Oceania-Americas  7.111380  -7.910342 22.133102 0.6862848
## Europe-Asia       6.920115   1.763372 12.076858 0.0027416
## Oceania-Asia      9.991015  -4.895221 24.877251 0.3464970
## Oceania-Europe    3.070900 -11.857807 17.999607 0.9793776

Can you tell from the output, which pairs of continents have significantly different life expectancy at the 0.05 level?

Note: Materials below are for your reference and will not be quizzed.*

Data Manipulation with dplyr library 🍎

We often need to transform our dataset in order to better understand our data. We also manipulate data for analysis and visualization. dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
  • These all combine naturally with group_by() which allows you to perform any operation “by group”.

Here we will only focus on the dplyr way:

  • Use pipe %>% to emphasize a sequence of actions

  • Note: %>% should always have a space before it, and should usually be followed by a new line.


Let’s take a closer look at the life expectancy in 2007 across continents.

ggplot(data2007, 
       aes(x= lifeExp, y = continent, color = continent)) +
  geom_point() +
  labs(title = "Life Expectancy by Continents", 
       subtitle = "Gapminder Data in 2007",
       x = "Life Expectancy",
       y = "Continents") +
  theme_minimal()

# If the levels are not automatically in the correct order, re-order them as follow:
data2007$continent1 <- ordered(data2007$continent,
                         levels = c("Africa", "Americas", "Asia","Europe", "Oceania"))

# Compute summary statistics by groups - count, mean, sd:

group_by(data2007, continent1) %>%
  summarise(
    count = n(),
    mean = mean(lifeExp),
    sd = sd(lifeExp)
  )
## # A tibble: 5 × 4
##   continent1 count  mean    sd
##   <ord>      <int> <dbl> <dbl>
## 1 Africa        52  54.8 9.63 
## 2 Americas      25  73.6 4.44 
## 3 Asia          33  70.7 7.96 
## 4 Europe        30  77.6 2.98 
## 5 Oceania        2  80.7 0.729
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(data2007, x = "continent1", y = "lifeExp", 
          color = "continent1",
          order = c("Africa", "Americas", "Asia","Europe", "Oceania"),
          ylab = "Life Expectancy", xlab = "Continents")


Categorical x Continuous

Cleveland Dot Charts

Cleveland plots are useful when you want to compare a numeric statistic for a large number of groups. For example, say that you want to compare the 2007 life expectancy for Americas using the gapminder dataset.

data(gapminder)

# subset Americas countries in 2007
library(dplyr)
Americas2007 <- gapminder %>%
  filter(continent == "Americas" & 
         year == 2007)

# basic Cleveland plot of life expectancy by country
ggplot(Americas2007, 
       aes(x= lifeExp, y = country)) +
  geom_point()

Sorted y-axis

Comparisons are usually easier if the y-axis is sorted.

# Sorted Cleveland plot
ggplot(Americas2007, 
       aes(x=lifeExp, 
           y=reorder(country, lifeExp))) +
  geom_point()

Suppose we want to know the average life expectancy and average GDP per capita for each continent in each year. We need to group the data by continent and year, then compute the average life expectancy and average GDP per capita.

# compute the required statistics
gapminder_dplyr <- gapminder %>% 
  group_by(continent, year) %>% 
  summarise(count = n(),
            mean_lifeexp = mean(lifeExp),
            mean_gdppercap = mean(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Now we could look at the result in gapminder_dplyr, or compute some statistics from it.

ggplot(data = gapminder_dplyr, 
       mapping = aes(x = mean_lifeexp,
                     y = mean_gdppercap,
                     color = continent,
                     size = count)) +
  geom_point(alpha = 1/2) +
  labs(x = "Average life expectancy",
       y = "Average GDP per capita",
       color = "Continent",
       size = "Number of countries") +
  theme_bw()

The code for this graph is taken from ScPoEconometrics.

Continuous x Continuous x Categorical Variables

Scatterplot

library(gapminder)
ggplot(data = gapminder) +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent))

Line graphs by groups

# Compare multiple countries of a continent
Americas <- gapminder %>%
  filter(continent == "Americas")

ggplot(data = Americas) +
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))

# Compare selected countries
UJN <- gapminder %>%
  filter(country == c("United States", "Japan", "Norway"))

ggplot(data = UJN) +
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))

Slope graphs

When there are several groups and several time points, a slope graph can be helpful. Let’s plot life expectancy for six Central American countries in 1992, 1997, 2002, and 2007. Again we’ll use the gapminder data. To create a slope graph, we’ll use the newggslopegraph function from the CGPfunctions package.

The newggslopegraph function parameters are (in order) * data frame * time variable (which must be a factor) * numeric variable to be plotted * and grouping variable (creating one line per group).

# install.packages("CGPfunctions")
library(CGPfunctions)
# Select Central American countries data
# for 1992, 1997, 2002, and 2007
CentralAmericas <- gapminder %>%
        filter(year %in% c(1992, 1997, 2002, 2007) &
        country %in% c("Panama", "Costa Rica",
                        "Nicaragua", "Honduras",
                        "El Salvador", "Guatemala",
                        "Belize")) %>%
        mutate(year = factor(year),
        lifeExp = round(lifeExp))

# create slope graph
newggslopegraph(CentralAmericas, year, lifeExp, country) +
labs(title="Life Expectancy by Country",
subtitle="Central America",
caption="source: gapminder")
## 
## Converting 'year' to an ordered factor


Lab 3 Assignment

Q1: Import the dataset named “birthweight_smoking.csv”, name your dataset.

Q2: Use ggboxplot() to create a boxplot for another categorical variable and birthweight. Find the mean, median and IQR using the summarise() function from the dplyr library. Discuss any differences in mean, median and IQR based on the graph and outputs.

Q3: Which dummy variable(s) among unmarried, tripre0 and alcohol might make a significant difference in the birthweight of infants? Run a t.test() to test the differences across groups in the sample, respectively.

Q4: Based on Q3, what conclusion can you draw from the t-test results? Interpret your findings based on the mean difference of the two groups and the p-value at the 0.05 alpha level. (Refer to WK3 lecture slides)


Submit your Assignment

Statastic – Well done!

Step 1: Double check if you answered all the questions thoroughly and check for accuracy ALWAYS!

Step 2: If you use RMarkdown (.Rmd) document, Knit your R Markdown document–move your cursor to the face-down triangle next to Knit, and choose for PDF. If you use an R script (.R), then transfer your codes, results and work on the assignment in a word document, then convert it to a PDF.

Step 3: Submit your assignment to Gradescope https://www.gradescope.com/.