Disclaimer: The content of this RMarkdown note came from a course called Introduction to Data in datacamp.
Scientists seek to answer questions using rigorous methods and careful observations. These observations—collected from the likes of field notes, surveys, and experiments—form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. It is helpful to put statistics in the context of a general process of investigation: 1) identify a question or problem; 2) collect relevant data on the topic; 3) analyze the data; and 4) form a conclusion. In this course, you’ll focus on the first two steps of the process.
This chapter introduces terminology of datasets and data frames in R. A reference manual for the openintro package can be found here.
# Install packages
#install.packages("dplyr") #Once it's installed, you won't have to run this code again
#install.packages("ggplot2")
#install.packages("openintro")
# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr) #for the use of dplyr functions such as mutate
library(ggplot2) #for use of ggplot2 functions such ggplot()
# Load data
data(email50) #this data is from the openintro package
# View its structure
str(email50)
## 'data.frame': 50 obs. of 21 variables:
## $ spam : num 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
## $ image : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
The glimpse() function from dplyr provides a handy alternative to str() for previewing a dataset. In addition to telling you the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.
# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number <fct> small, big, none, small, small, small, small, sma...
Categorical data are often stored as factors in R. Get some practice working with a factor variable, number, which tells you what type of number (none, small, or big) an email contains.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fct> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fct> big, big, big, big, big, big, big
The droplevels() function removes unused levels of factor variables from your dataset. As you saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.
# Table of number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number <- droplevels(email50_big$number)
# Another table of number variable
table(email50_big$number)
##
## big
## 7
Interpreatation
Create a categorical version of the num_char variable in the email50 dataset, which tells you the number of characters in an email, in thousands. This new variable will have two levels—“below median” and “at or above median”—depending on whether an email has less than the median number of characters or equal to or more than that value.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
med_num_char
## [1] 6.8895
# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
email50_fortified
## spam to_multiple from cc sent_email time image attach
## 1 0 0 1 0 1 2012-01-04 13:19:16 0 0
## 2 0 0 1 0 0 2012-02-16 20:10:06 0 0
## 3 1 0 1 4 0 2012-01-04 15:36:23 0 2
## 4 0 0 1 0 0 2012-01-04 17:49:52 0 0
## 5 0 0 1 0 0 2012-01-27 09:34:45 0 0
## 6 0 0 1 0 0 2012-01-17 17:31:57 0 0
## 7 0 0 1 0 0 2012-03-18 04:18:55 0 0
## 8 0 0 1 0 1 2012-03-31 13:58:56 0 0
## 9 0 0 1 1 1 2012-01-11 01:57:54 0 0
## 10 0 0 1 0 0 2012-01-07 19:29:16 0 0
## 11 0 0 1 0 0 2012-02-23 00:57:02 0 0
## 12 0 0 1 0 0 2012-02-04 23:26:09 0 0
## 13 1 0 1 0 0 2012-01-24 16:15:56 0 0
## 14 1 1 1 0 0 2012-02-09 02:22:46 0 2
## 15 0 0 1 0 0 2012-03-09 18:46:12 0 0
## 16 0 0 1 0 1 2012-01-12 16:17:53 0 0
## 17 0 0 1 0 1 2012-01-31 19:44:22 0 0
## 18 1 0 1 0 0 2012-03-21 02:00:30 0 1
## 19 0 0 1 1 1 2012-01-03 19:39:06 0 0
## 20 0 1 1 4 0 2012-03-29 00:48:08 0 0
## 21 0 0 1 0 0 2012-01-09 15:04:18 0 0
## 22 0 0 1 0 0 2012-01-14 10:07:03 0 0
## 23 0 0 1 0 1 2012-03-24 15:00:57 0 0
## 24 0 0 1 2 1 2012-01-12 21:43:42 0 0
## 25 0 0 1 0 0 2012-03-02 19:05:22 0 0
## 26 0 0 1 0 0 2012-02-16 04:01:40 0 0
## 27 0 0 1 0 1 2012-02-09 13:51:43 0 0
## 28 0 1 1 5 0 2012-01-23 14:03:19 0 0
## 29 0 0 1 0 1 2012-02-01 16:12:20 0 0
## 30 0 1 1 0 0 2012-03-23 17:42:28 0 0
## 31 0 0 1 0 0 2012-02-14 13:43:48 0 0
## 32 0 0 1 0 0 2012-01-19 23:33:55 0 0
## 33 0 0 1 0 0 2012-01-21 17:35:48 0 0
## 34 0 0 1 0 0 2012-01-25 22:37:06 0 0
## 35 0 1 1 0 0 2012-03-06 17:03:41 0 0
## 36 0 0 1 0 1 2012-03-25 21:08:44 0 0
## 37 0 0 1 0 0 2012-02-15 00:17:09 0 0
## 38 0 0 1 0 0 2012-03-01 10:00:01 0 0
## 39 0 0 1 0 0 2012-02-10 18:34:42 0 0
## 40 0 0 1 0 1 2012-01-12 21:44:54 0 0
## 41 0 1 1 0 0 2012-01-06 20:14:47 0 0
## 42 0 0 1 1 1 2012-03-21 15:39:21 0 0
## 43 0 0 1 0 1 2012-02-13 20:19:36 0 0
## 44 0 0 1 0 0 2012-01-25 22:18:37 0 0
## 45 0 0 1 0 0 2012-01-24 23:44:52 0 0
## 46 1 0 1 0 0 2012-02-29 23:36:55 0 0
## 47 0 1 1 0 0 2012-03-06 14:10:00 0 0
## 48 0 0 1 0 1 2012-03-14 17:08:27 0 0
## 49 0 0 1 1 1 2012-02-10 16:27:48 0 0
## 50 0 0 1 0 0 2012-01-04 18:27:36 0 0
## dollar winner inherit viagra password num_char line_breaks format
## 1 0 no 0 0 0 21.705 551 1
## 2 0 no 0 0 0 7.011 183 1
## 3 0 no 0 0 0 0.631 28 0
## 4 0 no 0 0 0 2.454 61 0
## 5 9 no 0 0 1 41.623 1088 1
## 6 0 no 0 0 0 0.057 5 0
## 7 0 no 0 0 0 0.809 17 0
## 8 0 no 0 0 0 5.229 88 1
## 9 0 no 0 0 0 9.277 242 1
## 10 23 no 0 0 0 17.170 578 1
## 11 4 no 0 0 0 64.401 1167 1
## 12 0 no 0 0 2 10.368 198 1
## 13 3 yes 0 0 0 42.793 712 1
## 14 2 no 0 0 0 0.451 24 0
## 15 0 no 0 0 0 29.233 604 1
## 16 0 no 0 0 0 9.794 197 1
## 17 0 no 0 0 0 2.139 60 1
## 18 0 no 0 0 0 0.130 5 0
## 19 0 no 0 0 8 4.945 120 1
## 20 2 no 0 0 0 11.533 291 1
## 21 0 no 0 0 0 5.682 87 1
## 22 0 no 0 0 0 6.768 81 1
## 23 0 no 0 0 0 0.086 5 0
## 24 0 no 0 0 0 3.070 65 1
## 25 2 no 0 0 0 26.520 692 1
## 26 0 no 0 0 0 26.255 654 1
## 27 0 no 0 0 0 5.259 140 1
## 28 0 no 0 0 0 2.780 69 0
## 29 0 no 0 0 0 5.864 142 1
## 30 0 no 0 0 0 9.928 219 1
## 31 0 no 0 0 2 25.209 725 1
## 32 0 no 0 0 0 6.563 140 1
## 33 0 no 0 0 0 24.599 621 1
## 34 0 no 0 0 0 25.757 645 1
## 35 0 no 0 0 0 0.409 13 0
## 36 0 no 0 0 0 11.223 512 1
## 37 0 no 0 0 0 3.778 98 1
## 38 0 no 0 0 2 1.493 35 0
## 39 0 no 0 0 8 10.613 225 1
## 40 0 no 0 0 0 0.493 13 1
## 41 0 no 0 0 0 4.415 61 0
## 42 0 no 0 0 0 14.156 300 1
## 43 0 no 0 0 0 9.491 233 1
## 44 0 no 0 0 0 24.837 629 1
## 45 0 no 0 0 0 0.684 17 1
## 46 0 no 0 0 0 13.502 193 0
## 47 0 no 0 0 0 2.789 44 0
## 48 0 no 0 0 0 1.169 35 1
## 49 0 no 0 0 0 8.937 211 1
## 50 0 no 0 0 0 15.829 242 1
## re_subj exclaim_subj urgent_subj exclaim_mess number num_char_cat
## 1 1 0 0 8 small at or above median
## 2 0 0 0 1 big at or above median
## 3 0 0 0 2 none below median
## 4 0 0 0 1 small below median
## 5 0 0 0 43 small at or above median
## 6 0 0 0 0 small below median
## 7 0 0 0 0 small below median
## 8 1 0 0 2 small below median
## 9 1 1 0 22 small at or above median
## 10 0 0 0 3 small at or above median
## 11 0 0 0 13 small at or above median
## 12 0 0 0 1 big at or above median
## 13 0 0 0 2 big at or above median
## 14 0 0 0 2 small below median
## 15 0 0 0 21 small at or above median
## 16 1 0 0 10 small at or above median
## 17 1 0 0 0 small below median
## 18 0 0 0 0 none below median
## 19 0 0 0 2 small below median
## 20 1 0 0 4 small at or above median
## 21 0 0 0 0 small below median
## 22 0 0 0 3 small below median
## 23 0 1 0 0 none below median
## 24 1 0 0 0 small below median
## 25 0 1 0 7 big at or above median
## 26 0 0 0 1 small at or above median
## 27 1 0 0 8 small below median
## 28 1 0 0 1 small below median
## 29 1 0 0 6 small below median
## 30 0 0 0 4 small at or above median
## 31 0 0 0 2 small at or above median
## 32 0 0 0 2 big below median
## 33 0 0 0 1 small at or above median
## 34 0 0 0 1 small at or above median
## 35 0 0 0 1 small below median
## 36 0 0 0 9 big at or above median
## 37 0 0 0 0 small below median
## 38 0 0 0 1 none below median
## 39 0 0 0 9 big at or above median
## 40 0 0 0 0 none below median
## 41 0 0 0 1 small below median
## 42 1 0 0 0 small at or above median
## 43 1 0 0 18 small at or above median
## 44 0 0 0 1 small at or above median
## 45 0 0 0 1 small below median
## 46 0 0 0 1 none at or above median
## 47 0 0 0 0 small below median
## 48 1 0 0 0 small below median
## 49 1 0 0 2 small at or above median
## 50 0 0 0 4 small at or above median
# Count emails in each category
email50_fortified %>%
count(num_char_cat)
## # A tibble: 2 x 2
## num_char_cat n
## <chr> <int>
## 1 at or above median 25
## 2 below median 25
Interpreation
Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50 dataset has a categorical variable called number with levels “none”, “small”, and “big”, but suppose you’re only interested in whether an email contains a number.
# Create number_yn variable in email50
email50_fortified <- email50 %>%
mutate(number_yn = case_when(
number == "none" ~ "no",
number != "none" ~ "yes"
)
)
# Visualize number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
geom_bar()
Visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam.
Recall that in the ggplot() function, the first argument gives the dataset, then the aesthetics map the variables to certain features of the plot, and finally the geom_*() layer informs the type of plot you want to make.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
In this chapter, you will learn about observational studies and experiments, scope of inference, and Simpson’s paradox.
Look at data from a different study on country characteristics. You’ll load the data first and view it, then you’ll be asked to identify the type of study. Remember, an experiment requires random assignment.
# Install gapminder R package
#install.packages("gapminder") #Once it's installed, you won't have to run this code again
# Load gapminder R package
library(gapminder)
# Load data
data(gapminder)
# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Identify type of study
type_of_study <- "observational"
Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.
Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.
The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.
There is no random sampling since the subjects of the study were volunteers, so the results cannot be generalized to all people. However, due to random assignment, the subjects’ memory can be inferred based on these results.
Simpson’s Paradox? It is a phenomenon in probability and statistics where a trend appears in different groups of data but disappears or reverses when these groups are combined.
Calculate the number of males and females admitted
# Import data
ucb_admit <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/ucb_admit.csv")
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admit...
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ Dept <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
summary(ucb_admit)
## Admit Gender Dept
## Admitted:1755 Female:1835 Length:4526
## Rejected:2771 Male :2691 Class :character
## Mode :character
# Load packages
library(dplyr)
# Count number of male and female applicants admitted
ucb_admit %>%
count(Gender, Admit)
## # A tibble: 4 x 3
## Gender Admit n
## <fct> <fct> <int>
## 1 Female Admitted 557
## 2 Female Rejected 1278
## 3 Male Admitted 1198
## 4 Male Rejected 1493
Calculate the percentage of males admitted.
# Define ucb_admission_counts
ucb_admission_counts <-
ucb_admit %>%
count(Gender, Admit)
ucb_admission_counts %>%
# Group by gender
group_by(Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for admitted
filter(Admit == "Admitted")
## # A tibble: 2 x 4
## # Groups: Gender [2]
## Gender Admit n prop
## <fct> <fct> <int> <dbl>
## 1 Female Admitted 557 0.304
## 2 Male Admitted 1198 0.445
Make a table similar to the one you constructed earlier, except you will first group the data by department. Then, you’ll use this table to calculate the proportion of males admitted in each department.
ucb_admission_counts <- ucb_admit %>%
# Counts by department, then gender, then admission status
count(Dept, Gender, Admit)
# See the result
ucb_admission_counts
## # A tibble: 24 x 4
## Dept Gender Admit n
## <chr> <fct> <fct> <int>
## 1 A Female Admitted 89
## 2 A Female Rejected 19
## 3 A Male Admitted 512
## 4 A Male Rejected 313
## 5 B Female Admitted 17
## 6 B Female Rejected 8
## 7 B Male Admitted 353
## 8 B Male Rejected 207
## 9 C Female Admitted 202
## 10 C Female Rejected 391
## # ... with 14 more rows
ucb_admission_counts %>%
# Group by department, then gender
group_by(Dept, Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for male and admitted
filter(Admit == "Admitted", Gender == "Male")
## # A tibble: 6 x 5
## # Groups: Dept, Gender [6]
## Dept Gender Admit n prop
## <chr> <fct> <fct> <int> <dbl>
## 1 A Male Admitted 512 0.621
## 2 B Male Admitted 353 0.630
## 3 C Male Admitted 120 0.369
## 4 D Male Admitted 138 0.331
## 5 E Male Admitted 53 0.277
## 6 F Male Admitted 22 0.0590
Interpretation
Why not take a census?
Sampling is like cooking. You take a spoonful of soup to to get an idea of the dish as a whole: i.e., whether it’s too salty. You wouldn’t eat a whole pot of soup. This would be an exploratory analysis. If you then generalize and conclude that the entire soup need more salt, that’s making an inference. For your inference to be valid, your spoonful you tasted, your sample, should be representative of the entire pot, your population.
Sampling methods
Sampling in R Suppose we want to collect data from counties in the United States. But we don’t have resources to collect data from all the counties. Conveniently, however, the list of all counties are contained in the openintro R package.
# Load county data
data(county) #this data is from the openintro package
# Remove DC
county_noDC <- county %>%
filter(state != "District of Columbia") %>%
droplevels()
Simple random sample
# Simple random sample of 150 counties
county_srs <- county_noDC %>%
sample_n(size = 150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name <fct> Hale County, Perry County, Jasper County, Nolan ...
## $ state <fct> Texas, Pennsylvania, Iowa, Texas, Arkansas, Mich...
## $ pop2000 <dbl> 36602, 43602, 37213, 15802, 17119, 31314, 51335,...
## $ pop2010 <dbl> 36273, 45969, 36842, 15216, 17264, 29598, 53597,...
## $ fed_spend <dbl> 7.912855, 5.985207, 6.660849, 8.638473, 9.883341...
## $ poverty <dbl> 19.0, 9.1, 12.3, 19.4, 21.5, 15.9, 18.9, 14.4, 1...
## $ homeownership <dbl> 65.3, 80.9, 73.4, 68.1, 80.8, 80.6, 78.3, 84.6, ...
## $ multiunit <dbl> 12.1, 8.6, 16.4, 12.2, 3.6, 10.6, 4.8, 5.4, 13.0...
## $ income <dbl> 16322, 23701, 23160, 19973, 16570, 21140, 19600,...
## $ med_income <dbl> 36509, 52659, 46396, 37102, 31135, 36695, 37580,...
# State distribution of SRS counties
county_srs %>%
group_by(state) %>%
count()
## # A tibble: 43 x 2
## # Groups: state [43]
## state n
## <fct> <int>
## 1 Alabama 2
## 2 Alaska 3
## 3 Arizona 2
## 4 Arkansas 4
## 5 California 1
## 6 Colorado 4
## 7 Florida 4
## 8 Georgia 4
## 9 Illinois 5
## 10 Indiana 3
## # ... with 33 more rows
Stratified Sampling
# Stratified sample of 150 counties, each state is a stratum
county_str <- county_noDC %>%
group_by(state) %>%
sample_n(size = 3) # 3 counties from each of the 50 states
glimpse(county_str)
## Observations: 150
## Variables: 10
## $ name <fct> St. Clair County, Sumter County, Barbour County,...
## $ state <fct> Alabama, Alabama, Alabama, Alaska, Alaska, Alask...
## $ pop2000 <dbl> 64742, 14798, 29038, 30711, 3436, 9196, 3072149,...
## $ pop2010 <dbl> 83593, 13763, 27457, 31275, 2150, 9492, 3817117,...
## $ fed_spend <dbl> 5.738698, 13.621086, 8.752158, 37.590184, 24.156...
## $ poverty <dbl> 10.6, 34.8, 25.0, 6.5, 15.9, 24.6, 13.9, 13.5, 1...
## $ homeownership <dbl> 82.2, 68.3, 68.0, 64.0, 64.0, 56.2, 66.3, 46.9, ...
## $ multiunit <dbl> 5.5, 14.5, 11.1, 32.2, 8.6, 17.4, 25.1, 6.1, 4.8...
## $ income <dbl> 22192, 14460, 15875, 34923, 24932, 20549, 27816,...
## $ med_income <dbl> 48837, 25338, 33219, 75517, 43750, 53899, 55054,...
Suppose you want to collect some data from a sample of eight states.
# Import us_regions
us_regions <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/us_regions.csv")
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)
states_srs
## state region
## 51 Washington West
## 49 Hawaii West
## 26 North Carolina South
## 14 Wisconsin Midwest
## 42 Montana West
## 15 Iowa Midwest
## 30 West Virginia South
## 34 Tennessee South
# Count states by region
states_srs %>%
count(region)
## # A tibble: 3 x 2
## region n
## <fct> <int>
## 1 Midwest 2
## 2 South 3
## 3 West 3
Interpretation
With stratified sampling, select an equal number of states from each region.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
states_str
## # A tibble: 8 x 2
## # Groups: region [4]
## state region
## <fct> <fct>
## 1 North Dakota Midwest
## 2 South Dakota Midwest
## 3 Rhode Island Northeast
## 4 New Hampshire Northeast
## 5 Florida South
## 6 Arkansas South
## 7 Hawaii West
## 8 Oregon West
# Count states by region
states_str %>%
count(region)
## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
Principles of experimental design
Example: A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.
Control variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.
In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.
Consider a case study looking at how the physical appearance of instructors impacts their students’ course evaluations. The data used is student evaluation collected at the University of Texas Austin. Plus, six students were presented with the photos of professors and asked to rate their physical attractiveness.
# Import data
evals <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/evals.csv")
# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity <fct> minority, minority, minority, minority, not mino...
## $ gender <fct> female, female, female, female, male, male, male...
## $ language <fct> english, english, english, english, english, eng...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs <fct> single, single, single, single, multiple, multip...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color <fct> color, color, color, color, color, color, color,...
What type of study is this?
The data from this study were gathered by randomly selecting classes. Only the students who took the class can fill out evaluations of the teacher that taught it.
Start your exploration of a dataset by identifying variable types. The results from this exercise will help you design appropriate visualizations and calculate useful summary statistics later in your analysis.
# Inspect variable types
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity <fct> minority, minority, minority, minority, not mino...
## $ gender <fct> female, female, female, female, male, male, male...
## $ language <fct> english, english, english, english, english, eng...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs <fct> single, single, single, single, multiple, multip...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color <fct> color, color, color, color, color, color, color,...
# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language", "cls_level", "cls_profs",
"cls_credits", "pic_outfit", "pic_color")
The cls_students
variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is small, midsize, or large.
# Recode cls_students as cls_type: evals
evals <- evals %>%
# Create new variable
mutate(cls_type = ifelse(cls_students <= 18, "small",
ifelse(cls_students >= 60, "large", "midsize")))
The bty_avg
variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score
variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point()
Interpretation
Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).
# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
geom_point()
How can we revitalize a region’s economy? You’re tasked to examine whether federal spending is positively related to the standard of living. Use the county
data set in the openintro package. Examine the relationship between fed_spend
and income
by following instructions below.
# Randomly sample 150 counties in the US.
county_srs <- county %>%
sample_n(size = 150)
# What type of variables are they?
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name <fct> Dodge County, Colonial Heights city, Baraga Coun...
## $ state <fct> Wisconsin, Virginia, Michigan, Texas, North Caro...
## $ pop2000 <dbl> 85897, 16897, 8746, 16361, 130454, 7304, 14422, ...
## $ pop2010 <dbl> 88759, 17411, 8860, 16921, 141752, 7818, 18395, ...
## $ fed_spend <dbl> 4.285087, 21.815404, 9.549097, 7.399149, 5.31524...
## $ poverty <dbl> 7.8, 7.5, 12.0, 17.7, 17.2, 24.1, 15.9, 10.9, 9....
## $ homeownership <dbl> 73.9, 65.8, 75.5, 71.0, 73.6, 67.5, 75.9, 79.6, ...
## $ multiunit <dbl> 21.4, 20.4, 9.6, 10.9, 10.1, 11.6, 4.8, 11.5, 5....
## $ income <dbl> 23663, 26115, 19107, 22424, 21297, 15635, 19497,...
## $ med_income <dbl> 52571, 50571, 40541, 42401, 40346, 29513, 40455,...
fed_spend
on the y axis and income
on the x axis. Interpret.ggplot(county_srs, aes(x = fed_spend, y = income)) +
geom_point()
Can you think of any confounding variable? Briefly discuss.
Census API: have students choose these variables and retrieve data on their own?
Real estate data of your neighborhood + Have stuents ask a research question that they want to answer given Zillow data + Have students choose a data set of their own interest for their research question + Is it an observational study or experiment? And why? Explain in at least 100 words. + This may not work b/c the data sets are already pre-tabulated. For example, we can’t calculate mean, standard deviation and such. + Zillow real estate data + geographic unit: state, metro, county, city, zip code, neighborhood + metrics: home types and housing stock (e.g., condo, multifamily unit); types of ZHVI (e.g., Median estimated home value for all homes with one bedroom within a given region.); rental metrics (e.g., Median Rent List Price Per Sq Ft); other metrics (e.g., Homes Foreclosed) +