Disclaimer: The content of this RMarkdown note came from a course called Introduction to Data in datacamp.

Introduction to Data

Scientists seek to answer questions using rigorous methods and careful observations. These observations—collected from the likes of field notes, surveys, and experiments—form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. It is helpful to put statistics in the context of a general process of investigation: 1) identify a question or problem; 2) collect relevant data on the topic; 3) analyze the data; and 4) form a conclusion. In this course, you’ll focus on the first two steps of the process.

Chapter 1: Language of data

This chapter introduces terminology of datasets and data frames in R. A reference manual for the openintro package can be found here.

# Install packages
#install.packages("dplyr") #Once it's installed, you won't have to run this code again
#install.packages("ggplot2")
#install.packages("openintro") 

# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr)     #for the use of dplyr functions such as mutate
library(ggplot2) #for use of ggplot2 functions such ggplot()

1.1 Loading data into R

# Load data
data(email50) #this data is from the openintro package

# View its structure
str(email50)
## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

1.2 Identify variable types

The glimpse() function from dplyr provides a handy alternative to str() for previewing a dataset. In addition to telling you the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.

# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fct> small, big, none, small, small, small, small, sma...

1.3 Filtering based on a factor

Categorical data are often stored as factors in R. Get some practice working with a factor variable, number, which tells you what type of number (none, small, or big) an email contains.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fct> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fct> big, big, big, big, big, big, big

1.4 Complete filtering based on a factor

The droplevels() function removes unused levels of factor variables from your dataset. As you saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.

# Table of number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7

# Drop levels
email50_big$number <- droplevels(email50_big$number)

# Another table of number variable
table(email50_big$number)
## 
## big 
##   7

Interpreatation

  • Note that dropping the levels of the number variable gets rid of the levels with counts of zero? This will be useful when you’re creating visualizations later on.

1.5 Discretize a different variable

Create a categorical version of the num_char variable in the email50 dataset, which tells you the number of characters in an email, in thousands. This new variable will have two levels—“below median” and “at or above median”—depending on whether an email has less than the median number of characters or equal to or more than that value.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
med_num_char
## [1] 6.8895

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
email50_fortified  
##    spam to_multiple from cc sent_email                time image attach
## 1     0           0    1  0          1 2012-01-04 13:19:16     0      0
## 2     0           0    1  0          0 2012-02-16 20:10:06     0      0
## 3     1           0    1  4          0 2012-01-04 15:36:23     0      2
## 4     0           0    1  0          0 2012-01-04 17:49:52     0      0
## 5     0           0    1  0          0 2012-01-27 09:34:45     0      0
## 6     0           0    1  0          0 2012-01-17 17:31:57     0      0
## 7     0           0    1  0          0 2012-03-18 04:18:55     0      0
## 8     0           0    1  0          1 2012-03-31 13:58:56     0      0
## 9     0           0    1  1          1 2012-01-11 01:57:54     0      0
## 10    0           0    1  0          0 2012-01-07 19:29:16     0      0
## 11    0           0    1  0          0 2012-02-23 00:57:02     0      0
## 12    0           0    1  0          0 2012-02-04 23:26:09     0      0
## 13    1           0    1  0          0 2012-01-24 16:15:56     0      0
## 14    1           1    1  0          0 2012-02-09 02:22:46     0      2
## 15    0           0    1  0          0 2012-03-09 18:46:12     0      0
## 16    0           0    1  0          1 2012-01-12 16:17:53     0      0
## 17    0           0    1  0          1 2012-01-31 19:44:22     0      0
## 18    1           0    1  0          0 2012-03-21 02:00:30     0      1
## 19    0           0    1  1          1 2012-01-03 19:39:06     0      0
## 20    0           1    1  4          0 2012-03-29 00:48:08     0      0
## 21    0           0    1  0          0 2012-01-09 15:04:18     0      0
## 22    0           0    1  0          0 2012-01-14 10:07:03     0      0
## 23    0           0    1  0          1 2012-03-24 15:00:57     0      0
## 24    0           0    1  2          1 2012-01-12 21:43:42     0      0
## 25    0           0    1  0          0 2012-03-02 19:05:22     0      0
## 26    0           0    1  0          0 2012-02-16 04:01:40     0      0
## 27    0           0    1  0          1 2012-02-09 13:51:43     0      0
## 28    0           1    1  5          0 2012-01-23 14:03:19     0      0
## 29    0           0    1  0          1 2012-02-01 16:12:20     0      0
## 30    0           1    1  0          0 2012-03-23 17:42:28     0      0
## 31    0           0    1  0          0 2012-02-14 13:43:48     0      0
## 32    0           0    1  0          0 2012-01-19 23:33:55     0      0
## 33    0           0    1  0          0 2012-01-21 17:35:48     0      0
## 34    0           0    1  0          0 2012-01-25 22:37:06     0      0
## 35    0           1    1  0          0 2012-03-06 17:03:41     0      0
## 36    0           0    1  0          1 2012-03-25 21:08:44     0      0
## 37    0           0    1  0          0 2012-02-15 00:17:09     0      0
## 38    0           0    1  0          0 2012-03-01 10:00:01     0      0
## 39    0           0    1  0          0 2012-02-10 18:34:42     0      0
## 40    0           0    1  0          1 2012-01-12 21:44:54     0      0
## 41    0           1    1  0          0 2012-01-06 20:14:47     0      0
## 42    0           0    1  1          1 2012-03-21 15:39:21     0      0
## 43    0           0    1  0          1 2012-02-13 20:19:36     0      0
## 44    0           0    1  0          0 2012-01-25 22:18:37     0      0
## 45    0           0    1  0          0 2012-01-24 23:44:52     0      0
## 46    1           0    1  0          0 2012-02-29 23:36:55     0      0
## 47    0           1    1  0          0 2012-03-06 14:10:00     0      0
## 48    0           0    1  0          1 2012-03-14 17:08:27     0      0
## 49    0           0    1  1          1 2012-02-10 16:27:48     0      0
## 50    0           0    1  0          0 2012-01-04 18:27:36     0      0
##    dollar winner inherit viagra password num_char line_breaks format
## 1       0     no       0      0        0   21.705         551      1
## 2       0     no       0      0        0    7.011         183      1
## 3       0     no       0      0        0    0.631          28      0
## 4       0     no       0      0        0    2.454          61      0
## 5       9     no       0      0        1   41.623        1088      1
## 6       0     no       0      0        0    0.057           5      0
## 7       0     no       0      0        0    0.809          17      0
## 8       0     no       0      0        0    5.229          88      1
## 9       0     no       0      0        0    9.277         242      1
## 10     23     no       0      0        0   17.170         578      1
## 11      4     no       0      0        0   64.401        1167      1
## 12      0     no       0      0        2   10.368         198      1
## 13      3    yes       0      0        0   42.793         712      1
## 14      2     no       0      0        0    0.451          24      0
## 15      0     no       0      0        0   29.233         604      1
## 16      0     no       0      0        0    9.794         197      1
## 17      0     no       0      0        0    2.139          60      1
## 18      0     no       0      0        0    0.130           5      0
## 19      0     no       0      0        8    4.945         120      1
## 20      2     no       0      0        0   11.533         291      1
## 21      0     no       0      0        0    5.682          87      1
## 22      0     no       0      0        0    6.768          81      1
## 23      0     no       0      0        0    0.086           5      0
## 24      0     no       0      0        0    3.070          65      1
## 25      2     no       0      0        0   26.520         692      1
## 26      0     no       0      0        0   26.255         654      1
## 27      0     no       0      0        0    5.259         140      1
## 28      0     no       0      0        0    2.780          69      0
## 29      0     no       0      0        0    5.864         142      1
## 30      0     no       0      0        0    9.928         219      1
## 31      0     no       0      0        2   25.209         725      1
## 32      0     no       0      0        0    6.563         140      1
## 33      0     no       0      0        0   24.599         621      1
## 34      0     no       0      0        0   25.757         645      1
## 35      0     no       0      0        0    0.409          13      0
## 36      0     no       0      0        0   11.223         512      1
## 37      0     no       0      0        0    3.778          98      1
## 38      0     no       0      0        2    1.493          35      0
## 39      0     no       0      0        8   10.613         225      1
## 40      0     no       0      0        0    0.493          13      1
## 41      0     no       0      0        0    4.415          61      0
## 42      0     no       0      0        0   14.156         300      1
## 43      0     no       0      0        0    9.491         233      1
## 44      0     no       0      0        0   24.837         629      1
## 45      0     no       0      0        0    0.684          17      1
## 46      0     no       0      0        0   13.502         193      0
## 47      0     no       0      0        0    2.789          44      0
## 48      0     no       0      0        0    1.169          35      1
## 49      0     no       0      0        0    8.937         211      1
## 50      0     no       0      0        0   15.829         242      1
##    re_subj exclaim_subj urgent_subj exclaim_mess number       num_char_cat
## 1        1            0           0            8  small at or above median
## 2        0            0           0            1    big at or above median
## 3        0            0           0            2   none       below median
## 4        0            0           0            1  small       below median
## 5        0            0           0           43  small at or above median
## 6        0            0           0            0  small       below median
## 7        0            0           0            0  small       below median
## 8        1            0           0            2  small       below median
## 9        1            1           0           22  small at or above median
## 10       0            0           0            3  small at or above median
## 11       0            0           0           13  small at or above median
## 12       0            0           0            1    big at or above median
## 13       0            0           0            2    big at or above median
## 14       0            0           0            2  small       below median
## 15       0            0           0           21  small at or above median
## 16       1            0           0           10  small at or above median
## 17       1            0           0            0  small       below median
## 18       0            0           0            0   none       below median
## 19       0            0           0            2  small       below median
## 20       1            0           0            4  small at or above median
## 21       0            0           0            0  small       below median
## 22       0            0           0            3  small       below median
## 23       0            1           0            0   none       below median
## 24       1            0           0            0  small       below median
## 25       0            1           0            7    big at or above median
## 26       0            0           0            1  small at or above median
## 27       1            0           0            8  small       below median
## 28       1            0           0            1  small       below median
## 29       1            0           0            6  small       below median
## 30       0            0           0            4  small at or above median
## 31       0            0           0            2  small at or above median
## 32       0            0           0            2    big       below median
## 33       0            0           0            1  small at or above median
## 34       0            0           0            1  small at or above median
## 35       0            0           0            1  small       below median
## 36       0            0           0            9    big at or above median
## 37       0            0           0            0  small       below median
## 38       0            0           0            1   none       below median
## 39       0            0           0            9    big at or above median
## 40       0            0           0            0   none       below median
## 41       0            0           0            1  small       below median
## 42       1            0           0            0  small at or above median
## 43       1            0           0           18  small at or above median
## 44       0            0           0            1  small at or above median
## 45       0            0           0            1  small       below median
## 46       0            0           0            1   none at or above median
## 47       0            0           0            0  small       below median
## 48       1            0           0            0  small       below median
## 49       1            0           0            2  small at or above median
## 50       0            0           0            4  small at or above median

# Count emails in each category
email50_fortified %>%
  count(num_char_cat)
## # A tibble: 2 x 2
##   num_char_cat           n
##   <chr>              <int>
## 1 at or above median    25
## 2 below median          25

Interpreation

  • There are exactly half below the median and half above the median.
  • This makes sense because the median marks the 50th percentile, or midpoint, of a distribution, so half of the emails should fall in one category and the other half in the other.

1.6 Combining levels of a different factor

Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50 dataset has a categorical variable called number with levels “none”, “small”, and “big”, but suppose you’re only interested in whether an email contains a number.

# Create number_yn variable in email50
email50_fortified <- email50 %>%
  mutate(number_yn = case_when(
    number == "none" ~ "no",
    number != "none" ~ "yes"
    )
  )

# Visualize number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
  geom_bar()

1.7 Visualizing numerical and categorical data

Visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam.

Recall that in the ggplot() function, the first argument gives the dataset, then the aesthetics map the variables to certain features of the plot, and finally the geom_*() layer informs the type of plot you want to make.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Chapter 2: Study types and cautionary tales

In this chapter, you will learn about observational studies and experiments, scope of inference, and Simpson’s paradox.

  • Observational studies
    • Collect data in a way that does not directly interfere with how the data arise
    • Only correlation can be inferred
  • Experiments
    • Randomly assign various subjects to various treatments
    • Causation can be inferred
  • Example: evaluating the relationship between using screens at bedtime and attention span during the day. We can design the study as an observational study or an experiment. In an observational study, we can sample two types of people from the population - those who choose to use screens at bedtime and those who don’t; and find the average attention span for the two groups; and then compare. On the other hand, in an experiment, we sample a group of people from the population; we can randomly assign these people to two groups - those who are asked to use screens at bedtime. The difference is that the decision of whether to use screens or not is not left up to the subjects as was in an observational study but is, instead, imposed by the researcher.

2.1 Identify the type of study

Look at data from a different study on country characteristics. You’ll load the data first and view it, then you’ll be asked to identify the type of study. Remember, an experiment requires random assignment.

# Install gapminder R package
#install.packages("gapminder") #Once it's installed, you won't have to run this code again

# Load gapminder R package
library(gapminder)

# Load data
data(gapminder)

# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

# Identify type of study
type_of_study <- "observational"

2.2 Random sampling or random assignment

  • Random sampling (observational studies): can only infer association but is generalizable
  • Random assignments (experiments): can infer causation but is not generalizable
  • Consider the following exameple.
    • One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.
    • Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study; random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn’t be appropriate to apply the findings back to the population as a whole.

2.3 Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.

There is no random sampling since the subjects of the study were volunteers, so the results cannot be generalized to all people. However, due to random assignment, the subjects’ memory can be inferred based on these results.

2.4 Number of males and females admitted

Simpson’s Paradox? It is a phenomenon in probability and statistics where a trend appears in different groups of data but disappears or reverses when these groups are combined.

Calculate the number of males and females admitted

# Import data
ucb_admit <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/ucb_admit.csv") 
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admit...
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ Dept   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
summary(ucb_admit)
##       Admit         Gender         Dept          
##  Admitted:1755   Female:1835   Length:4526       
##  Rejected:2771   Male  :2691   Class :character  
##                                Mode  :character

# Load packages
library(dplyr)

# Count number of male and female applicants admitted
ucb_admit %>%
  count(Gender, Admit)
## # A tibble: 4 x 3
##   Gender Admit        n
##   <fct>  <fct>    <int>
## 1 Female Admitted   557
## 2 Female Rejected  1278
## 3 Male   Admitted  1198
## 4 Male   Rejected  1493

2.5 Proportion of males admitted overall

Calculate the percentage of males admitted.

# Define ucb_admission_counts
ucb_admission_counts <-
  ucb_admit %>%
  count(Gender, Admit)

ucb_admission_counts %>%
  # Group by gender
  group_by(Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for admitted
  filter(Admit == "Admitted")
## # A tibble: 2 x 4
## # Groups:   Gender [2]
##   Gender Admit        n  prop
##   <fct>  <fct>    <int> <dbl>
## 1 Female Admitted   557 0.304
## 2 Male   Admitted  1198 0.445

2.6 Proportion of males admitted for each department

Make a table similar to the one you constructed earlier, except you will first group the data by department. Then, you’ll use this table to calculate the proportion of males admitted in each department.

ucb_admission_counts <- ucb_admit %>%
  # Counts by department, then gender, then admission status
  count(Dept, Gender, Admit)

# See the result
ucb_admission_counts
## # A tibble: 24 x 4
##    Dept  Gender Admit        n
##    <chr> <fct>  <fct>    <int>
##  1 A     Female Admitted    89
##  2 A     Female Rejected    19
##  3 A     Male   Admitted   512
##  4 A     Male   Rejected   313
##  5 B     Female Admitted    17
##  6 B     Female Rejected     8
##  7 B     Male   Admitted   353
##  8 B     Male   Rejected   207
##  9 C     Female Admitted   202
## 10 C     Female Rejected   391
## # ... with 14 more rows

ucb_admission_counts  %>%
  # Group by department, then gender
  group_by(Dept, Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for male and admitted
  filter(Admit == "Admitted", Gender == "Male")
## # A tibble: 6 x 5
## # Groups:   Dept, Gender [6]
##   Dept  Gender Admit        n   prop
##   <chr> <fct>  <fct>    <int>  <dbl>
## 1 A     Male   Admitted   512 0.621 
## 2 B     Male   Admitted   353 0.630 
## 3 C     Male   Admitted   120 0.369 
## 4 D     Male   Admitted   138 0.331 
## 5 E     Male   Admitted    53 0.277 
## 6 F     Male   Admitted    22 0.0590

Interpretation

  • Within most departments, female applicants are more likely to be admitted.

Chapter 3: Sampling strategies and experimental design

Why not take a census?

  • It’s cost-prohibitive.
  • It’s impossible to collect from all indivisuals. If these individuals are different from the population, the sample would be biased.
  • Populations constantly change.

Sampling is like cooking. You take a spoonful of soup to to get an idea of the dish as a whole: i.e., whether it’s too salty. You wouldn’t eat a whole pot of soup. This would be an exploratory analysis. If you then generalize and conclude that the entire soup need more salt, that’s making an inference. For your inference to be valid, your spoonful you tasted, your sample, should be representative of the entire pot, your population.

Sampling methods

  • simple random sampling: we randomly select sample such that each case is equally likely to be selected
  • stratified sammpling: we first devide the population into homogeneous groups called strata. And then we randomly sample from each stratum. For example, stratified sampling may be used if we want to make sure that low, medium and high-income class is equally represented in a study.
  • cluster sampling: we divide the population into clusters; randomly sample a few clusters; and use all observations within these clusters. While clusters are heterogenous within themselves, each cluster is similar to other cluster so that we can get away from sampling just a few clusters.
  • multi-state sampling: we add another step to cluster sampling. Just like in cluster sampling, we divide the population into clusters; randomly sample a few clusters; but instead of using all observations within these clusters, randomly sample observations from within those clusters. Multi-state sampling and cluster sampling are often used for economical reasons. For example, one might divide a city into geographical regions that on average are similar to each other and then sample randomly from within a few randomly picked regions in order to avoid traveling to all regions.

Sampling in R Suppose we want to collect data from counties in the United States. But we don’t have resources to collect data from all the counties. Conveniently, however, the list of all counties are contained in the openintro R package.

# Load county data
data(county) #this data is from the openintro package

# Remove DC
county_noDC <- county %>%
  filter(state != "District of Columbia") %>%
  droplevels()

Simple random sample

# Simple random sample of 150 counties
county_srs <- county_noDC %>%
  sample_n(size = 150)

glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fct> Hale County, Perry County, Jasper County, Nolan ...
## $ state         <fct> Texas, Pennsylvania, Iowa, Texas, Arkansas, Mich...
## $ pop2000       <dbl> 36602, 43602, 37213, 15802, 17119, 31314, 51335,...
## $ pop2010       <dbl> 36273, 45969, 36842, 15216, 17264, 29598, 53597,...
## $ fed_spend     <dbl> 7.912855, 5.985207, 6.660849, 8.638473, 9.883341...
## $ poverty       <dbl> 19.0, 9.1, 12.3, 19.4, 21.5, 15.9, 18.9, 14.4, 1...
## $ homeownership <dbl> 65.3, 80.9, 73.4, 68.1, 80.8, 80.6, 78.3, 84.6, ...
## $ multiunit     <dbl> 12.1, 8.6, 16.4, 12.2, 3.6, 10.6, 4.8, 5.4, 13.0...
## $ income        <dbl> 16322, 23701, 23160, 19973, 16570, 21140, 19600,...
## $ med_income    <dbl> 36509, 52659, 46396, 37102, 31135, 36695, 37580,...

# State distribution of SRS counties
county_srs %>%
  group_by(state) %>%
  count()
## # A tibble: 43 x 2
## # Groups:   state [43]
##    state          n
##    <fct>      <int>
##  1 Alabama        2
##  2 Alaska         3
##  3 Arizona        2
##  4 Arkansas       4
##  5 California     1
##  6 Colorado       4
##  7 Florida        4
##  8 Georgia        4
##  9 Illinois       5
## 10 Indiana        3
## # ... with 33 more rows

Stratified Sampling

# Stratified sample of 150 counties, each state is a stratum
county_str <- county_noDC %>%
  group_by(state) %>%
  sample_n(size = 3)    # 3 counties from each of the 50 states

glimpse(county_str)
## Observations: 150
## Variables: 10
## $ name          <fct> St. Clair County, Sumter County, Barbour County,...
## $ state         <fct> Alabama, Alabama, Alabama, Alaska, Alaska, Alask...
## $ pop2000       <dbl> 64742, 14798, 29038, 30711, 3436, 9196, 3072149,...
## $ pop2010       <dbl> 83593, 13763, 27457, 31275, 2150, 9492, 3817117,...
## $ fed_spend     <dbl> 5.738698, 13.621086, 8.752158, 37.590184, 24.156...
## $ poverty       <dbl> 10.6, 34.8, 25.0, 6.5, 15.9, 24.6, 13.9, 13.5, 1...
## $ homeownership <dbl> 82.2, 68.3, 68.0, 64.0, 64.0, 56.2, 66.3, 46.9, ...
## $ multiunit     <dbl> 5.5, 14.5, 11.1, 32.2, 8.6, 17.4, 25.1, 6.1, 4.8...
## $ income        <dbl> 22192, 14460, 15875, 34923, 24932, 20549, 27816,...
## $ med_income    <dbl> 48837, 25338, 33219, 75517, 43750, 53899, 55054,...

3.1 Simple random sample in R

Suppose you want to collect some data from a sample of eight states.

# Import us_regions
us_regions <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/us_regions.csv")

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(size = 8)
states_srs
##             state  region
## 51     Washington    West
## 49         Hawaii    West
## 26 North Carolina   South
## 14      Wisconsin Midwest
## 42        Montana    West
## 15           Iowa Midwest
## 30  West Virginia   South
## 34      Tennessee   South

# Count states by region
states_srs %>%
  count(region)
## # A tibble: 3 x 2
##   region      n
##   <fct>   <int>
## 1 Midwest     2
## 2 South       3
## 3 West        3

Interpretation

  • Notice that this strategy selects an unequal number of states from each region.

3.2 Stratified sample in R

With stratified sampling, select an equal number of states from each region.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)
states_str
## # A tibble: 8 x 2
## # Groups:   region [4]
##   state         region   
##   <fct>         <fct>    
## 1 North Dakota  Midwest  
## 2 South Dakota  Midwest  
## 3 Rhode Island  Northeast
## 4 New Hampshire Northeast
## 5 Florida       South    
## 6 Arkansas      South    
## 7 Hawaii        West     
## 8 Oregon        West

# Count states by region
states_str %>%
  count(region)
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region        n
##   <fct>     <int>
## 1 Midwest       2
## 2 Northeast     2
## 3 South         2
## 4 West          2

Principles of experimental design

  • Control: compare treatment of interest to a control group
  • Randomize: randomly assign subjects to treatments
  • Replicate: collect a sufficiently large sample within a study, or replicate the entire study
  • Block: account for the potential impact of confounding variables
    • Group subjects into blocks based on these variables
    • Randomize within each block to treatment groups
    • Example: Suppose that we want to investigate whether students learn better in a traditional lecture-based course or in an interactive online platform. The courses that teach exact same materials are designed and the only difference between these two courses is the method of delivery. We sample a group of students for our study that we randomly assign into these courses. But we need to consider any confounding variables: prior programming experience may affect how students learn in these two settings and we know that some of the students in our sample have prior experience and some don’t. So we decide to block for having prior programming experience. To do so, we divide our sample into two: those with no prior experience and those without; then we randomly assign individuals from each block into the two courses. This ensures that those with or without prior experience is equally represented in the two treatment groups. The explanatory variable is the course type - lecture versus interactive online. And the variable we are blocking for is prior programming experience. This way, if we find the difference in the mastery of R language between the two courses, we will be able to attribute it to the course type and can be assured that the difference is not due to previsous programming experience since those with and without experience are equally represented in both courses.

3.3 Identifying components of a study

Example: A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

  • 2 explanatory variables: light and noise
  • 1 confounding variable: gender
  • 1 response variable: exam performance

3.4 Experimental design terminology

Control variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

3.5 Connect blocking and stratifying

In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.

Chapter 4: Case study

Consider a case study looking at how the physical appearance of instructors impacts their students’ course evaluations. The data used is student evaluation collected at the University of Texas Austin. Plus, six students were presented with the photos of professors and asked to rate their physical attractiveness.

4.1 Inspect the data

# Import data
evals <- read.csv("/resources/rstudio/Bus Statistics/data/Introduction to data/evals.csv") 

# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity     <fct> minority, minority, minority, minority, not mino...
## $ gender        <fct> female, female, female, female, male, male, male...
## $ language      <fct> english, english, english, english, english, eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs     <fct> single, single, single, single, multiple, multip...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color     <fct> color, color, color, color, color, color, color,...

4.2 Identify type of study

What type of study is this?

  • It’s an observational study because the data used for the study is mere observations and the decision to take class with a good-looking instructor was not imposed on students.

4.3 Sampling / experimental attributes

The data from this study were gathered by randomly selecting classes. Only the students who took the class can fill out evaluations of the teacher that taught it.

4.4 Identify variable types

Start your exploration of a dataset by identifying variable types. The results from this exercise will help you design appropriate visualizations and calculate useful summary statistics later in your analysis.

# Inspect variable types
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity     <fct> minority, minority, minority, minority, not mino...
## $ gender        <fct> female, female, female, female, male, male, male...
## $ language      <fct> english, english, english, english, english, eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs     <fct> single, single, single, single, multiple, multip...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color     <fct> color, color, color, color, color, color, color,...

# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language", "cls_level", "cls_profs",
              "cls_credits", "pic_outfit", "pic_color")

4.5 Recode a variable

The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is small, midsize, or large.

# Recode cls_students as cls_type: evals
evals <- evals %>%
  # Create new variable
  mutate(cls_type = ifelse(cls_students <= 18, "small", 
                      ifelse(cls_students >= 60, "large", "midsize")))

4.6 Create a scatterplot

The bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.

# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point()

Interpretation

  • There appears to be no clear relationship.

4.7 Create a scatterplot, with an added layer

Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
  geom_point()

Quiz / Data exercise

How can we revitalize a region’s economy? You’re tasked to examine whether federal spending is positively related to the standard of living. Use the county data set in the openintro package. Examine the relationship between fed_spend and income by following instructions below.

# Randomly sample 150 counties in the US.
county_srs <- county %>%
  sample_n(size = 150)

# What type of variables are they?
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fct> Dodge County, Colonial Heights city, Baraga Coun...
## $ state         <fct> Wisconsin, Virginia, Michigan, Texas, North Caro...
## $ pop2000       <dbl> 85897, 16897, 8746, 16361, 130454, 7304, 14422, ...
## $ pop2010       <dbl> 88759, 17411, 8860, 16921, 141752, 7818, 18395, ...
## $ fed_spend     <dbl> 4.285087, 21.815404, 9.549097, 7.399149, 5.31524...
## $ poverty       <dbl> 7.8, 7.5, 12.0, 17.7, 17.2, 24.1, 15.9, 10.9, 9....
## $ homeownership <dbl> 73.9, 65.8, 75.5, 71.0, 73.6, 67.5, 75.9, 79.6, ...
## $ multiunit     <dbl> 21.4, 20.4, 9.6, 10.9, 10.1, 11.6, 4.8, 11.5, 5....
## $ income        <dbl> 23663, 26115, 19107, 22424, 21297, 15635, 19497,...
## $ med_income    <dbl> 52571, 50571, 40541, 42401, 40346, 29513, 40455,...
  • Discuss the nature of this study by addressing the following:
    • Is this an observational study or an experiment? Why? It’s an observational study because the data used for the study is mere observations.
    • Does it involve random sampling or random assignment? Why? The 150 counties are randomly sampled. And randome assignment was not employed because the decision of whether to use federal funds is not left up to the counties.
    • Can you infer causation? Or just association? Why? It only reveals association but causation because the data represents simple observations such as…
    • Is your conclusion generalizable to the population as a whole? Why? It’s generalizable to all counties in the U.S. because the 150 counties are randomly selected from a pool of all US counties and thus are a good representaiton of all US counties.
  • Create a scatterplot of fed_spend on the y axis and income on the x axis. Interpret.
ggplot(county_srs, aes(x = fed_spend, y = income)) +
  geom_point()

  • Can you think of any confounding variable? Briefly discuss.

  • Census API: have students choose these variables and retrieve data on their own?

Alternative Quiz

Real estate data of your neighborhood + Have stuents ask a research question that they want to answer given Zillow data + Have students choose a data set of their own interest for their research question + Is it an observational study or experiment? And why? Explain in at least 100 words. + This may not work b/c the data sets are already pre-tabulated. For example, we can’t calculate mean, standard deviation and such. + Zillow real estate data + geographic unit: state, metro, county, city, zip code, neighborhood + metrics: home types and housing stock (e.g., condo, multifamily unit); types of ZHVI (e.g., Median estimated home value for all homes with one bedroom within a given region.); rental metrics (e.g., Median Rent List Price Per Sq Ft); other metrics (e.g., Homes Foreclosed) +