• Introduction
    • Whats Covered
    • Libraries and Data
  • Language of data
    • Welcome to the course!
      • – Loading data into R
    • Types of variables
      • – Identify variable types
    • Categorical data in R: factors
      • – Filtering based on a factor
      • – Complete filtering based on a factor
    • Discretize a variable
      • – Discretize a different variable
      • – Combining levels of a different factor
    • Visualizing numerical data
      • – Visualizing numerical and categorical data
  • Study tpes and cautionary tales
    • Observational studies and experiments
      • – Identify study type
      • – Identify the type of study
    • Random sampling and random assignment
      • – Random sampling or random assignment?
      • – Identify the scope of inference of study
    • Simpson’s paradox
      • – Number of males and females admitted
      • – Proportion of males admitted overall
      • – Proportion of males admitted for each department
      • – Contingency table results by group
    • Recap: Simpson’s paradox
  • Sampling strategie and experimental design
    • Sampling strategies
      • – Sampling strategies, determine which
    • Sampling in R
      • – Simple random sample in R
      • – Stratified sample in R
    • Principles of experimental design
      • – Identifying components of a study
      • – Experimental design terminology
      • – Connect blocking and stratifying
  • Case Study
    • Beauty in the classroom
      • – Inspect the data
      • – Identify type of study
      • – Sampling / experimental attributes
    • Variables in the data
      • – Identify variable types
      • – Recode a variable
      • – Create a scatterplot
      • – Create a scatterplot, with an added layer
    • Conclusion

Introduction

Whats Covered

  • Language of data
  • Study types and cautionary tales
  • Sampling strategies and experimental design
  • Case study

Libraries and Data

# source("create_datasets.R")
# load('data/test_datasets.RData')

library(dplyr)
library(tidyr)
library(ggplot2)
library(openintro)
library(gapminder)

   


Language of data


Welcome to the course!

– Loading data into R

## email50 dataset is in the openintro library which has been loaded

# Load data
data(email50)

# View its structure
str(email50)
## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 08:19:16" "2012-02-16 15:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

Types of variables

  • Numerical (qualitative): numerical values
    • Continuous: infinite number of values within a given range, often measured
    • Discrete: specific set of numeric values that can be counted or enumerated, often counted
  • Categorical (qualitative): limited number of distinct categories
    • Ordinal: finite number of values within a given range, often measured.

– Identify variable types

glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 08:19:16, 2012-02-16 15:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fctr> small, big, none, small, small, small, small, sm...

Categorical data in R: factors

  • Often stored as factors in R
    • Importatn use: statsitical modeling
    • Sometimes undesirable, sometimes, essential
  • Common in subgroup analysis
    • Only interested in a subset of the data
    • Filter for specific levels of categorical variable

– Filtering based on a factor

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 15:10:06, 2012-02-04 18:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fctr> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fctr> big, big, big, big, big, big, big

– Complete filtering based on a factor

# Table of number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7
# Drop levels
email50_big$number <- droplevels(email50_big$number)

# Another table of number variable
table(email50_big$number)
## 
## big 
##   7

Discretize a variable

  • Use an ifelse statement to convert a numerical variable to a categorical variable based on a set criteria

– Discretize a different variable

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50 <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
table(email50$num_char_cat)
## 
## at or above median       below median 
##                 25                 25

– Combining levels of a different factor

# Create number_yn column in email50
email50 <- email50 %>%
  mutate(number_yn = ifelse(number == "none", "no", "yes"))

# Visualize number_yn
ggplot(email50, aes(x = number_yn)) +
  geom_bar()

Visualizing numerical data

– Visualizing numerical and categorical data

# Load ggplot2

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

   


Study tpes and cautionary tales


Observational studies and experiments

  • Observational study:
    • Collect data in a way that does not directly interfere with how the data arise
    • Only correlation can be inferred
  • Experiment:
    • Randomly assign subjects to various treatments
    • Causation can be inferred
  • In experiments, the decision to do something or not is not left of to the participants but decided by the researchers

– Identify study type

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.

What type of study is this?

  • Experiment

– Identify the type of study

# Load data
data(gapminder)

# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
  • What type of study is this data from?
    • observational

Random sampling and random assignment

  • Random samping:
    • At selection of subjects from popultion
    • Helps generalizability of results
  • Random assignmnet:
    • Asssignment of subjects to various treatments
    • Helps infer causation from results

– Random sampling or random assignment?

One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.

Does this study employ random sampling and/or random assignment?

  • Neither random sampling
    • Dealth only with patients that were already hospitalized. It would not be appropriate to apply the findings back to the population as a whole.
  • nor random assignment
    • The conditions are not imposed on the patients by the people conducting the study
    • If the researchers has one group of people smoke and the other not, this would be random assignment.

– Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

  • The results of the study cannot be generalized to all people
  • A causal link between believing information is stored and memory can be inferred based on these results.

Simpson’s paradox

  • When the relationship between two variable is reversed when a new variable is introduced
  • e.g. a grouping variable is added and its then clear that the trend is negative for both groups.

– Number of males and females admitted

# dplyr and tidyr are already loaded

load('data/ucb_admit.Rdata')
str(ucb_admit)
## 'data.frame':    4526 obs. of  3 variables:
##  $ Admit : Factor w/ 2 levels "Admitted","Rejected": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender: Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept  : chr  "A" "A" "A" "A" ...
# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
  count(Gender, Admit)

# View result
ucb_counts
## # A tibble: 4 x 3
##   Gender    Admit     n
##   <fctr>   <fctr> <int>
## 1   Male Admitted  1198
## 2   Male Rejected  1493
## 3 Female Admitted   557
## 4 Female Rejected  1278
# Spread the output across columns
ucb_counts %>%
  spread(Admit, n)
## # A tibble: 2 x 3
##   Gender Admitted Rejected
## * <fctr>    <int>    <int>
## 1   Male     1198     1493
## 2 Female      557     1278

– Proportion of males admitted overall

str(ucb_admit)
## 'data.frame':    4526 obs. of  3 variables:
##  $ Admit : Factor w/ 2 levels "Admitted","Rejected": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender: Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept  : chr  "A" "A" "A" "A" ...
ucb_admit %>%
  # Table of counts of admission status and gender
  count(Gender, Admit) %>%
  # Spread output across columns based on admission status
  spread(Admit, n) %>%
  # Create new variable
  mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 2 x 4
##   Gender Admitted Rejected Perc_Admit
##   <fctr>    <int>    <int>      <dbl>
## 1   Male     1198     1493  0.4451877
## 2 Female      557     1278  0.3035422

– Proportion of males admitted for each department

str(ucb_admit)
## 'data.frame':    4526 obs. of  3 variables:
##  $ Admit : Factor w/ 2 levels "Admitted","Rejected": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender: Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept  : chr  "A" "A" "A" "A" ...
# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
  count(Dept, Gender, Admit) %>%
  spread(Admit, n)

# View result
admit_by_dept
## # A tibble: 12 x 4
##     Dept Gender Admitted Rejected
##  * <chr> <fctr>    <int>    <int>
##  1     A   Male      512      313
##  2     A Female       89       19
##  3     B   Male      353      207
##  4     B Female       17        8
##  5     C   Male      120      205
##  6     C Female      202      391
##  7     D   Male      138      279
##  8     D Female      131      244
##  9     E   Male       53      138
## 10     E Female       94      299
## 11     F   Male       22      351
## 12     F Female       24      317
# Percentage of those admitted to each department
admit_by_dept %>%
  mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 12 x 5
##     Dept Gender Admitted Rejected Perc_Admit
##    <chr> <fctr>    <int>    <int>      <dbl>
##  1     A   Male      512      313 0.62060606
##  2     A Female       89       19 0.82407407
##  3     B   Male      353      207 0.63035714
##  4     B Female       17        8 0.68000000
##  5     C   Male      120      205 0.36923077
##  6     C Female      202      391 0.34064081
##  7     D   Male      138      279 0.33093525
##  8     D Female      131      244 0.34933333
##  9     E   Male       53      138 0.27748691
## 10     E Female       94      299 0.23918575
## 11     F   Male       22      351 0.05898123
## 12     F Female       24      317 0.07038123

– Contingency table results by group

  • Within most departments, female applicants are more likely to be admitted.

Recap: Simpson’s paradox

  • Overall: males are more likely to be admitted
  • But within most departments: females more likely
  • When controlling for department, relationship between gender and admission status is reversed
  • Potential reason:
    • Women tended to apply to competitive departments with low admission rates
    • Men tended to apply to less competitive departments with high admission rates

   


Sampling strategie and experimental design


Sampling strategies

  • Why not take a census?
    • Conducting a census is very resource intensive
    • (Nearly) impossible to collect data from all individuals, hence no guarantee of unbiased results. Some types of people may have more reason to avoid your survey.
    • Populations constantly change
  • Sampling is like tasting your soup as you make it to see if its salty.
    • stir well and then you can infer the taste of the soup from the small sample.
    • but there are many sampling strategies in the real world…

Sample strategies:

  • Simple Random sample
    • each case is equally likekly to be selected
  • Stratified sample
    • Divide the population into homogeneous groups and then randomly sample from within each group
    • e.g. using zipcode or income level as a stratum and sampling equal numbers of people from each.
  • Cluster sample
    • Divide population into clusters, randomly pick a few clusters, then sample all of these clusters.
    • The clusters are heterogenous and each cluster is similar to the other cluster so we can get away with just sampling a few of the clusters.
    • e.g. cities could be clusters
  • Multistage sample
    • Multiple clusters
    • Often used for economical reasons
    • e.g. divide a city into similar geographical regions and then sample from some of them to avoid having to travel to every region.

– Sampling strategies, determine which

A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.

What sampling strategy has this company used?

  • Stratified sample

Sampling in R

  • simple random sample
    • dplyr: sample_n
  • stratified sample,
    • first group by state than sample

– Simple random sample in R

load('data/us_regions.RData')
str(us_regions)
## 'data.frame':    51 obs. of  2 variables:
##  $ state : Factor w/ 51 levels "Alabama","Alaska",..: 7 20 22 30 40 46 31 33 39 14 ...
##  $ region: Factor w/ 4 levels "Midwest","Northeast",..: 2 2 2 2 2 2 2 2 2 1 ...
# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(8)

# Count states by region
states_srs %>%
  count(region)
## # A tibble: 4 x 2
##      region     n
##      <fctr> <int>
## 1   Midwest     2
## 2 Northeast     2
## 3     South     2
## 4      West     2

– Stratified sample in R

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(2)

# Count states by region
states_str %>%
  count(region)
## # A tibble: 4 x 2
## # Groups:   region [4]
##      region     n
##      <fctr> <int>
## 1   Midwest     2
## 2 Northeast     2
## 3     South     2
## 4      West     2

Principles of experimental design

  • Control: compare treatment of interest to a control group
  • Randomize: randomly assign subjects to treatments
  • Replicate: collect a sufficiently large sample within a study, or replicate the entire study
  • Block: account for the potential effect of confounding variables
    • Group subjects into blocks based on these variables
    • Randomize within each bolock to treatment group
    • e.g. male and female, or prior programming experience

– Identifying components of a study

A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

What variables are involved in this study?

  • 2 explanatory variables (light and noise)
  • 1 blocking variable (gender)
  • 1 response variable (exam performance)

– Experimental design terminology

Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

– Connect blocking and stratifying

  • In random sampling, you use stratifying to control for a variable.
  • In random assignment, you use blocking to achieve the same goal.

   


Case Study


Beauty in the classroom

– Inspect the data

# Inspect evals
load('data/evals.RData')
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...

– Identify type of study

  • This is an observational study

– Sampling / experimental attributes

  • The data from this study were gathered by randomly sampling classes

Variables in the data

  • score - range form 1 to 5 with 1 being poor evaluation
  • rank - the type of position of the professor
  • cls_ - summary information about the class
  • bty_ - beauty rating from females and males, lower and upper level (junior or senior)
  • pic_ - data on the outifu and if the photo was black and white

– Identify variable types

# Inspect variable types
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...
# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language",
              "cls_level", "cls_profs", "cls_credits",
              "pic_outfit", "pic_color")

– Recode a variable

# Recode cls_students as cls_type: evals
evals <- evals %>%
  # Create new variable
  mutate(cls_type = ifelse(cls_students <= 18, 'small', 
                      ifelse(cls_students <= 59, 'midsize', 'large')))

table(evals$cls_type)
## 
##   large midsize   small 
##     117     233     113

– Create a scatterplot

# Scatterplot of score vs. bty_avg
ggplot(evals, aes(bty_avg, score)) +
  geom_point()

– Create a scatterplot, with an added layer

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(bty_avg, score, color = cls_type)) +
  geom_point()

Conclusion

  • This was a pretty simple class.
  • It could have used a lot more examples and actual work
  • It was more like a lecture from a university than a real online class with examples and doing.
    • This was taught by a university professor so maybe that is why.