Introduction to Data

# Install packages
#install.packages("dplyr") #Once it's installed, you won't have to run this code again
#install.packages("ggplot2")
#install.packages("openintro") 

# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr)     #for the use of dplyr functions such as mutate
library(ggplot2) #for use of ggplot2 functions such ggplot()

Chapter 1: Language of data

This chapter introduces terminology of datasets and data frames in R

1.1 Loading data into R

# Load data
data(email50) 

# View its structure
str(email50)
## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

1.2 Identify variable types

glimpse() function from dplyr: alternative to str() for previewing a dataset. In addition to telling you the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.

# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fctr> small, big, none, small, small, small, small, sm...

1.3 Filtering based on a factor

Categorical data are often stored as factors in R. practice working with a factor variable, number, which tells you what type of number (none, small, or big) an email contains.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fctr> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fctr> big, big, big, big, big, big, big

1.4 Complete filtering based on a factor

The droplevels() function removes unused levels of factor variables from your dataset. it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.

# Table of number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7

# Drop levels
email50_big$number <- droplevels(email50_big$number)

# Another table of number variable
table(email50_big$number)
## 
## big 
##   7

1.5 Discretize a different variable

Create a categorical version of the num_char variable in the email50 dataset, which tells you the number of characters in an email, in thousands. This new variable will have two levels—“below median” and “at or above median”—depending on whether an email has less than the median number of characters or equal to or more than that value.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50 <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))

# Count emails in each category
table(email50$num_char_cat)
## 
## at or above median       below median 
##                 25                 25

There are exactly half below the median and half above the median, because the median marks the 50th percentile, or midpoint, of a distribution.

1.6 Combining levels of a different factor

A different way of creating a new variable based on an existing one is by combining levels of a categorical variable. Ex: For example, the email50 dataset has a categorical variable called number with levels “none”, “small”, and “big”, but suppose you’re only interested in whether an email contains a number.

# Create number_yn column in email50
email50 <- email50 %>%
  mutate(number_yn = ifelse(number == "none", "no", "yes"))

# Visualize number_yn
ggplot(email50, aes(x = number_yn)) +
  geom_bar()

1.7 Visualizing numerical and categorical data

Visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam. in the ggplot() function, the first argument gives the dataset, then the aesthetics map the variables to certain features of the plot, and finally the geom_*() layer informs the type of plot you want to make.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Chapter 2: Study types and cautionary tales

observational studies and experiments, scope of inference, and Simpson’s paradox.

Observational studies: Only correlation can be inferred

Experiments: Causation can be inferred

2.1 Identify the type of study

# Load gapminder R package
library(gapminder)

# Load data
data(gapminder)

# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

# Identify type of study
type_of_study <- "observational"

2.2 Random sampling or random assignment

Random sampling (observational studies): can only infer association but is generalizable. Random assignments (experiments): can infer causation but is not generalizable.

2.3 Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.

There is no random sampling since the subjects of the study were volunteers, so the results cannot be generalized to all people. However, due to random assignment, the subjects’ memory can be inferred based on these results.

2.4 Number of males and females admitted

Simpson’s Paradox

Calculate the number of males and females admitted

# Import data
ucb_admit <- read.csv("~/resources/rstudio/ucb_admit.csv") 
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit  <fctr> Admitted, Admitted, Admitted, Admitted, Admitted, Admi...
## $ Gender <fctr> Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
## $ Dept   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...

summary(ucb_admit)
##       Admit         Gender         Dept          
##  Admitted:1755   Female:1835   Length:4526       
##  Rejected:2771   Male  :2691   Class :character  
##                                Mode  :character

# Load packages
library(dplyr)
library(tidyr)

# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
  count(Admit, Gender)

# View result
ucb_counts
## Source: local data frame [4 x 3]
## Groups: Admit [?]
## 
##      Admit Gender     n
##     <fctr> <fctr> <int>
## 1 Admitted Female   557
## 2 Admitted   Male  1198
## 3 Rejected Female  1278
## 4 Rejected   Male  1493

# Spread the output across columns
ucb_counts %>%
  spread(Admit, n)
## # A tibble: 2 × 3
##   Gender Admitted Rejected
## * <fctr>    <int>    <int>
## 1 Female      557     1278
## 2   Male     1198     1493

2.5 Proportion of males admitted overall

Calculate the percentage of males admitted.

ucb_admit %>%
  # Table of counts of admission status and gender
  count(Admit, Gender) %>%
  # Spread output across columns based on admission status
  spread(Admit, n) %>%
  # Create new variable
  mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 2 × 4
##   Gender Admitted Rejected Perc_Admit
##   <fctr>    <int>    <int>      <dbl>
## 1 Female      557     1278  0.3035422
## 2   Male     1198     1493  0.4451877

2.6 Proportion of males admitted for each department

Make a table similar to the one constructed earlier, except first, group the data by department. Then, use this table to calculate the proportion of males admitted in each department.

# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
  count(Admit, Dept, Gender) %>%
  spread(Admit, n)

# View result
admit_by_dept 
## Source: local data frame [12 x 4]
## Groups: Dept [6]
## 
##     Dept Gender Admitted Rejected
## *  <chr> <fctr>    <int>    <int>
## 1      A Female       89       19
## 2      A   Male      512      313
## 3      B Female       17        8
## 4      B   Male      353      207
## 5      C Female      202      391
## 6      C   Male      120      205
## 7      D Female      131      244
## 8      D   Male      138      279
## 9      E Female       94      299
## 10     E   Male       53      138
## 11     F Female       24      317
## 12     F   Male       22      351

# Percentage of those admitted to each department
admit_by_dept %>%
  mutate(Perc_Admit = Admitted  / (Admitted + Rejected))
## Source: local data frame [12 x 5]
## Groups: Dept [6]
## 
##     Dept Gender Admitted Rejected Perc_Admit
##    <chr> <fctr>    <int>    <int>      <dbl>
## 1      A Female       89       19 0.82407407
## 2      A   Male      512      313 0.62060606
## 3      B Female       17        8 0.68000000
## 4      B   Male      353      207 0.63035714
## 5      C Female      202      391 0.34064081
## 6      C   Male      120      205 0.36923077
## 7      D Female      131      244 0.34933333
## 8      D   Male      138      279 0.33093525
## 9      E Female       94      299 0.23918575
## 10     E   Male       53      138 0.27748691
## 11     F Female       24      317 0.07038123
## 12     F   Male       22      351 0.05898123

Chapter 3: Sampling strategies and experimental design

Census: It’s cost-prohibitive. It’s impossible to collect from all indivisuals. If these individuals are different from the population, the sample would be biased. Populations constantly change.

Sampling is like cooking. You take a spoonful of soup to to get an idea of the dish as a whole: i.e., whether it’s too salty. You wouldn’t eat a whole pot of soup. This would be an exploratory analysis. If you then generalize and conclude that the entire soup need more salt, that’s making an inference. For your inference to be valid, your spoonful you tasted, your sample, should be representative of the entire pot, your population.

Sampling methods:

simple random sampling: we randomly select sample such that each case is equally likely to be selected stratified sammpling: we first devide the population into homogeneous groups called strata. And then we randomly sample from each stratum. For example, stratified sampling may be used if we want to make sure that low, medium and high-income class is equally represented in a study.

cluster sampling: we divide the population into clusters; randomly sample a few clusters; and use all observations within these clusters. While clusters are heterogenous within themselves, each cluster is similar to other cluster so that we can get away from sampling just a few clusters.

multi-state sampling: we add another step to cluster sampling. Just like in cluster sampling, we divide the population into clusters; randomly sample a few clusters; but instead of using all observations within these clusters, randomly sample observations from within those clusters. Multi-state sampling and cluster sampling are often used for economical reasons. For example, one might divide a city into geographical regions that on average are similar to each other and then sample randomly from within a few randomly picked regions in order to avoid traveling to all regions.

Sampling in R:
Suppose we want to collect data from counties in the United States. But we don’t have resources to collect data from all the counties. Conveniently, however, the list of all counties are contained in the openintro R package.

# Load county data
data(county) 

# Remove DC
county_noDC <- county %>%
  filter(state != "District of Columbia") %>%
  droplevels()

Simple random sample

# Simple random sample of 150 counties
county_srs <- county_noDC %>%
  sample_n(size = 150)

glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fctr> Staunton city, Washington County, Titus County,...
## $ state         <fctr> Virginia, Nebraska, Texas, Wisconsin, Kansas, U...
## $ pop2000       <dbl> 23853, 18780, 28118, 18643, 59482, 33779, 32080,...
## $ pop2010       <dbl> 23746, 20234, 32334, 20875, 65880, 46163, 34273,...
## $ fed_spend     <dbl> 10.939948, 5.484531, 6.161533, 7.388216, 5.04629...
## $ poverty       <dbl> 15.2, 4.4, 17.9, 12.6, 7.3, 20.9, 14.8, 14.1, 15...
## $ homeownership <dbl> 60.4, 81.5, 70.2, 82.2, 77.8, 63.2, 79.3, 66.4, ...
## $ multiunit     <dbl> 27.9, 12.0, 10.0, 4.7, 8.6, 26.3, 3.5, 4.2, 12.0...
## $ income        <dbl> 24077, 27884, 17520, 21917, 26436, 16898, 20774,...
## $ med_income    <dbl> 42724, 61940, 39423, 39885, 56290, 42247, 42282,...

# State distribution of SRS counties
county_srs %>%
  group_by(state) %>%
  count()
## # A tibble: 40 × 2
##         state     n
##        <fctr> <int>
## 1     Alabama     2
## 2      Alaska     1
## 3    Arkansas     4
## 4  California     4
## 5    Colorado     5
## 6     Florida     6
## 7     Georgia    10
## 8       Idaho     2
## 9    Illinois     3
## 10    Indiana     2
## # ... with 30 more rows

Stratified Sampling

# Stratified sample of 150 counties, each state is a stratum
county_str <- county_noDC %>%
  group_by(state) %>%
  sample_n(size = 3)    # 3 counties from each of the 50 states

glimpse(county_str)
## Observations: 150
## Variables: 10
## $ name          <fctr> Mobile County, Cleburne County, Clarke County, ...
## $ state         <fctr> Alabama, Alabama, Alabama, Alaska, Alaska, Alas...
## $ pop2000       <dbl> 399843, 14123, 27867, 6146, 8835, 5465, 155032, ...
## $ pop2010       <dbl> 412992, 14972, 25833, 5559, 8881, 5561, 200186, ...
## $ fed_spend     <dbl> 10.605181, 6.840035, 9.781442, 10.248966, 11.194...
## $ poverty       <dbl> 19.2, 17.1, 29.2, 14.0, 7.0, 12.6, 16.1, 13.9, 1...
## $ homeownership <dbl> 68.4, 74.9, 80.0, 69.0, 55.9, 36.3, 71.5, 66.3, ...
## $ multiunit     <dbl> 17.7, 5.3, 6.3, 9.7, 24.4, 30.9, 9.8, 25.1, 6.1,...
## $ income        <dbl> 21548, 17490, 17372, 24193, 29982, 29920, 21523,...
## $ med_income    <dbl> 40996, 36077, 27439, 45728, 62024, 72917, 39785,...

3.1 Simple random sample in R

collect some data from a sample of eight states:

# Import us_regions
us_regions <- read.csv("~/resources/rstudio/us_regions.csv")

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(size = 8)

# Count states by region
states_srs %>%
  group_by(region) %>%
  count()
## # A tibble: 3 × 2
##      region     n
##      <fctr> <int>
## 1 Northeast     2
## 2     South     5
## 3      West     1

3.2 Straified sample in R

With stratified sampling, select an equal number of states from each region:

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  group_by(region) %>%
  count()
## # A tibble: 4 × 2
##      region     n
##      <fctr> <int>
## 1   Midwest     2
## 2 Northeast     2
## 3     South     2
## 4      West     2

Principles of experimental design

Control: compare treatment of interest to a control group.

Randomize: randomly assign subjects to treatments.

Replicate: collect a sufficiently large sample within a study, or replicate the entire study.

Block: account for the potential impact of confounding variables Group subjects into blocks based on these variables Randomize within each block to treatment groups

3.3 Identifying Components of a study

Example: A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

2 explanatory variables: light and noise 1 confounding variable: gender 1 response variable: exam performance

3.4 EXperimental design terminology

Control variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

3.5 Connect blocking and stratifying

In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.

Chapter 4: Case study

4.1 Inspect the data

# Import data
evals <- read.csv("~/resources/rstudio/evals.csv") 

# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...

4.2 Identify type of study

What type of study is this? It’s an observational study

4.3 Sampling/experimental attributes

The data from this study were gathered by randomly selecting classes.

4.4 identify variable types

# Inspect variable types
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...

# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language", "cls_level", "cls_profs",
              "cls_credits", "pic_outfit", "pic_color")

4.5 Recode a variable

The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is small, midsize, or large.

# Recode cls_students as cls_type: evals
evals <- evals %>%
  # Create new variable
  mutate(cls_type = ifelse(cls_students <= 18, "small", 
                      ifelse(cls_students >= 60, "large", "midsize")))

4.6 Create a scatterplot

he bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.

# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point()

4.7 Create a scatterplot, with an added layer

Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
  geom_point()

Quiz 3

1

library(openintro)
library(dplyr)

county_srs <- county_noDC %>% 
  sample_n(size=150)

2

glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name          <fctr> Lamar County, Merrick County, Mason County, Was...
## $ state         <fctr> Alabama, Nebraska, West Virginia, Wisconsin, Ok...
## $ pop2000       <dbl> 15904, 8204, 25957, 16036, 12623, 31839, 21139, ...
## $ pop2010       <dbl> 14564, 7845, 27324, 15911, 13488, 31499, 22185, ...
## $ fed_spend     <dbl> 9.965394, 8.996048, 8.014786, 10.611212, 9.55516...
## $ poverty       <dbl> 18.5, 10.7, 18.9, 13.1, 15.5, 20.6, 17.7, 8.5, 1...
## $ homeownership <dbl> 75.1, 73.3, 78.1, 82.1, 79.8, 79.4, 70.1, 82.2, ...
## $ multiunit     <dbl> 9.0, 5.0, 6.9, 6.7, 4.7, 6.9, 7.3, 8.0, 6.4, 11....
## $ income        <dbl> 19789, 21819, 19609, 23221, 20634, 18538, 16345,...
## $ med_income    <dbl> 33887, 46116, 36027, 41641, 40870, 36750, 36606,...

name
state
pop
fed_spend
poverty
homeownership multiunit
income
med_income

3

Is this an observational study or an experiment? Why?

This is an observational study, because there is no treatment being imposed on the subjects.

Does it involve random sampling or random assignment?

The sample above is a random sample.

Can you infer causation? Or just association? Why?

You can only infer association because, in an observational study there could be other factors that would be relative. You can only infer causation from an experimental study.

Is your conclusion generalizable to the population as a whole? Why?

Yes, because identifiable trends in the data can be generalized to the population as a whole, with confidency on the generalizion depending on the size of the population.

4


ggplot(county_srs, aes(x = income, y = fed_spend)) +
  geom_point()

Federal spending seemed to increase steadily with amount of income, despite a few outliers, especially one who is spending over twice the amount of most people with the same income.

5

Can you think of any confounding variable? Briefly discuss. A confounding variable is a variable that is not taken into account but that could have an impact on the results. A confounding variable for this situation could be counties with higher employment rates having higher incomes than counties with low employment rates, even if that county recieved more federal spending.

Intro to Data

Gabriel Nichols

7/2/2017