# Install packages
#install.packages("dplyr") #Once it's installed, you won't have to run this code again
#install.packages("ggplot2")
#install.packages("openintro")
# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr) #for the use of dplyr functions such as mutate
library(ggplot2) #for use of ggplot2 functions such ggplot()
This chapter introduces terminology of datasets and data frames in R
# Load data
data(email50)
# View its structure
str(email50)
## 'data.frame': 50 obs. of 21 variables:
## $ spam : num 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
## $ image : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
glimpse() function from dplyr: alternative to str() for previewing a dataset. In addition to telling you the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.
# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number <fctr> small, big, none, small, small, small, small, sm...
Categorical data are often stored as factors in R. practice working with a factor variable, number, which tells you what type of number (none, small, or big) an email contains.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fctr> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fctr> big, big, big, big, big, big, big
The droplevels() function removes unused levels of factor variables from your dataset. it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.
# Table of number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number <- droplevels(email50_big$number)
# Another table of number variable
table(email50_big$number)
##
## big
## 7
Create a categorical version of the num_char variable in the email50 dataset, which tells you the number of characters in an email, in thousands. This new variable will have two levels—“below median” and “at or above median”—depending on whether an email has less than the median number of characters or equal to or more than that value.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50 <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
# Count emails in each category
table(email50$num_char_cat)
##
## at or above median below median
## 25 25
There are exactly half below the median and half above the median, because the median marks the 50th percentile, or midpoint, of a distribution.
A different way of creating a new variable based on an existing one is by combining levels of a categorical variable. Ex: For example, the email50 dataset has a categorical variable called number with levels “none”, “small”, and “big”, but suppose you’re only interested in whether an email contains a number.
# Create number_yn column in email50
email50 <- email50 %>%
mutate(number_yn = ifelse(number == "none", "no", "yes"))
# Visualize number_yn
ggplot(email50, aes(x = number_yn)) +
geom_bar()
Visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam. in the ggplot() function, the first argument gives the dataset, then the aesthetics map the variables to certain features of the plot, and finally the geom_*() layer informs the type of plot you want to make.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
observational studies and experiments, scope of inference, and Simpson’s paradox.
Observational studies: Only correlation can be inferred
Experiments: Causation can be inferred
# Load gapminder R package
library(gapminder)
# Load data
data(gapminder)
# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Identify type of study
type_of_study <- "observational"
Random sampling (observational studies): can only infer association but is generalizable. Random assignments (experiments): can infer causation but is not generalizable.
Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.
Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.
The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.
There is no random sampling since the subjects of the study were volunteers, so the results cannot be generalized to all people. However, due to random assignment, the subjects’ memory can be inferred based on these results.
Simpson’s Paradox
Calculate the number of males and females admitted
# Import data
ucb_admit <- read.csv("~/resources/rstudio/ucb_admit.csv")
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit <fctr> Admitted, Admitted, Admitted, Admitted, Admitted, Admi...
## $ Gender <fctr> Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
## $ Dept <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
summary(ucb_admit)
## Admit Gender Dept
## Admitted:1755 Female:1835 Length:4526
## Rejected:2771 Male :2691 Class :character
## Mode :character
# Load packages
library(dplyr)
library(tidyr)
# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
count(Admit, Gender)
# View result
ucb_counts
## Source: local data frame [4 x 3]
## Groups: Admit [?]
##
## Admit Gender n
## <fctr> <fctr> <int>
## 1 Admitted Female 557
## 2 Admitted Male 1198
## 3 Rejected Female 1278
## 4 Rejected Male 1493
# Spread the output across columns
ucb_counts %>%
spread(Admit, n)
## # A tibble: 2 × 3
## Gender Admitted Rejected
## * <fctr> <int> <int>
## 1 Female 557 1278
## 2 Male 1198 1493
Calculate the percentage of males admitted.
ucb_admit %>%
# Table of counts of admission status and gender
count(Admit, Gender) %>%
# Spread output across columns based on admission status
spread(Admit, n) %>%
# Create new variable
mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 2 × 4
## Gender Admitted Rejected Perc_Admit
## <fctr> <int> <int> <dbl>
## 1 Female 557 1278 0.3035422
## 2 Male 1198 1493 0.4451877
Make a table similar to the one constructed earlier, except first, group the data by department. Then, use this table to calculate the proportion of males admitted in each department.
# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
count(Admit, Dept, Gender) %>%
spread(Admit, n)
# View result
admit_by_dept
## Source: local data frame [12 x 4]
## Groups: Dept [6]
##
## Dept Gender Admitted Rejected
## * <chr> <fctr> <int> <int>
## 1 A Female 89 19
## 2 A Male 512 313
## 3 B Female 17 8
## 4 B Male 353 207
## 5 C Female 202 391
## 6 C Male 120 205
## 7 D Female 131 244
## 8 D Male 138 279
## 9 E Female 94 299
## 10 E Male 53 138
## 11 F Female 24 317
## 12 F Male 22 351
# Percentage of those admitted to each department
admit_by_dept %>%
mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## Source: local data frame [12 x 5]
## Groups: Dept [6]
##
## Dept Gender Admitted Rejected Perc_Admit
## <chr> <fctr> <int> <int> <dbl>
## 1 A Female 89 19 0.82407407
## 2 A Male 512 313 0.62060606
## 3 B Female 17 8 0.68000000
## 4 B Male 353 207 0.63035714
## 5 C Female 202 391 0.34064081
## 6 C Male 120 205 0.36923077
## 7 D Female 131 244 0.34933333
## 8 D Male 138 279 0.33093525
## 9 E Female 94 299 0.23918575
## 10 E Male 53 138 0.27748691
## 11 F Female 24 317 0.07038123
## 12 F Male 22 351 0.05898123
Census: It’s cost-prohibitive. It’s impossible to collect from all indivisuals. If these individuals are different from the population, the sample would be biased. Populations constantly change.
Sampling is like cooking. You take a spoonful of soup to to get an idea of the dish as a whole: i.e., whether it’s too salty. You wouldn’t eat a whole pot of soup. This would be an exploratory analysis. If you then generalize and conclude that the entire soup need more salt, that’s making an inference. For your inference to be valid, your spoonful you tasted, your sample, should be representative of the entire pot, your population.
Sampling methods:
simple random sampling: we randomly select sample such that each case is equally likely to be selected stratified sammpling: we first devide the population into homogeneous groups called strata. And then we randomly sample from each stratum. For example, stratified sampling may be used if we want to make sure that low, medium and high-income class is equally represented in a study.
cluster sampling: we divide the population into clusters; randomly sample a few clusters; and use all observations within these clusters. While clusters are heterogenous within themselves, each cluster is similar to other cluster so that we can get away from sampling just a few clusters.
multi-state sampling: we add another step to cluster sampling. Just like in cluster sampling, we divide the population into clusters; randomly sample a few clusters; but instead of using all observations within these clusters, randomly sample observations from within those clusters. Multi-state sampling and cluster sampling are often used for economical reasons. For example, one might divide a city into geographical regions that on average are similar to each other and then sample randomly from within a few randomly picked regions in order to avoid traveling to all regions.
Sampling in R:
Suppose we want to collect data from counties in the United States. But we don’t have resources to collect data from all the counties. Conveniently, however, the list of all counties are contained in the openintro R package.
# Load county data
data(county)
# Remove DC
county_noDC <- county %>%
filter(state != "District of Columbia") %>%
droplevels()
Simple random sample
# Simple random sample of 150 counties
county_srs <- county_noDC %>%
sample_n(size = 150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name <fctr> Staunton city, Washington County, Titus County,...
## $ state <fctr> Virginia, Nebraska, Texas, Wisconsin, Kansas, U...
## $ pop2000 <dbl> 23853, 18780, 28118, 18643, 59482, 33779, 32080,...
## $ pop2010 <dbl> 23746, 20234, 32334, 20875, 65880, 46163, 34273,...
## $ fed_spend <dbl> 10.939948, 5.484531, 6.161533, 7.388216, 5.04629...
## $ poverty <dbl> 15.2, 4.4, 17.9, 12.6, 7.3, 20.9, 14.8, 14.1, 15...
## $ homeownership <dbl> 60.4, 81.5, 70.2, 82.2, 77.8, 63.2, 79.3, 66.4, ...
## $ multiunit <dbl> 27.9, 12.0, 10.0, 4.7, 8.6, 26.3, 3.5, 4.2, 12.0...
## $ income <dbl> 24077, 27884, 17520, 21917, 26436, 16898, 20774,...
## $ med_income <dbl> 42724, 61940, 39423, 39885, 56290, 42247, 42282,...
# State distribution of SRS counties
county_srs %>%
group_by(state) %>%
count()
## # A tibble: 40 × 2
## state n
## <fctr> <int>
## 1 Alabama 2
## 2 Alaska 1
## 3 Arkansas 4
## 4 California 4
## 5 Colorado 5
## 6 Florida 6
## 7 Georgia 10
## 8 Idaho 2
## 9 Illinois 3
## 10 Indiana 2
## # ... with 30 more rows
Stratified Sampling
# Stratified sample of 150 counties, each state is a stratum
county_str <- county_noDC %>%
group_by(state) %>%
sample_n(size = 3) # 3 counties from each of the 50 states
glimpse(county_str)
## Observations: 150
## Variables: 10
## $ name <fctr> Mobile County, Cleburne County, Clarke County, ...
## $ state <fctr> Alabama, Alabama, Alabama, Alaska, Alaska, Alas...
## $ pop2000 <dbl> 399843, 14123, 27867, 6146, 8835, 5465, 155032, ...
## $ pop2010 <dbl> 412992, 14972, 25833, 5559, 8881, 5561, 200186, ...
## $ fed_spend <dbl> 10.605181, 6.840035, 9.781442, 10.248966, 11.194...
## $ poverty <dbl> 19.2, 17.1, 29.2, 14.0, 7.0, 12.6, 16.1, 13.9, 1...
## $ homeownership <dbl> 68.4, 74.9, 80.0, 69.0, 55.9, 36.3, 71.5, 66.3, ...
## $ multiunit <dbl> 17.7, 5.3, 6.3, 9.7, 24.4, 30.9, 9.8, 25.1, 6.1,...
## $ income <dbl> 21548, 17490, 17372, 24193, 29982, 29920, 21523,...
## $ med_income <dbl> 40996, 36077, 27439, 45728, 62024, 72917, 39785,...
collect some data from a sample of eight states:
# Import us_regions
us_regions <- read.csv("~/resources/rstudio/us_regions.csv")
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)
# Count states by region
states_srs %>%
group_by(region) %>%
count()
## # A tibble: 3 × 2
## region n
## <fctr> <int>
## 1 Northeast 2
## 2 South 5
## 3 West 1
With stratified sampling, select an equal number of states from each region:
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
# Count states by region
states_str %>%
group_by(region) %>%
count()
## # A tibble: 4 × 2
## region n
## <fctr> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
Principles of experimental design
Control: compare treatment of interest to a control group.
Randomize: randomly assign subjects to treatments.
Replicate: collect a sufficiently large sample within a study, or replicate the entire study.
Block: account for the potential impact of confounding variables Group subjects into blocks based on these variables Randomize within each block to treatment groups
Example: A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.
2 explanatory variables: light and noise 1 confounding variable: gender 1 response variable: exam performance
Control variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.
In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.
# Import data
evals <- read.csv("~/resources/rstudio/evals.csv")
# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity <fctr> minority, minority, minority, minority, not min...
## $ gender <fctr> female, female, female, female, male, male, mal...
## $ language <fctr> english, english, english, english, english, en...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs <fctr> single, single, single, single, multiple, multi...
## $ cls_credits <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color <fctr> color, color, color, color, color, color, color...
What type of study is this? It’s an observational study
The data from this study were gathered by randomly selecting classes.
# Inspect variable types
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity <fctr> minority, minority, minority, minority, not min...
## $ gender <fctr> female, female, female, female, male, male, mal...
## $ language <fctr> english, english, english, english, english, en...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs <fctr> single, single, single, single, multiple, multi...
## $ cls_credits <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color <fctr> color, color, color, color, color, color, color...
# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language", "cls_level", "cls_profs",
"cls_credits", "pic_outfit", "pic_color")
The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is small, midsize, or large.
# Recode cls_students as cls_type: evals
evals <- evals %>%
# Create new variable
mutate(cls_type = ifelse(cls_students <= 18, "small",
ifelse(cls_students >= 60, "large", "midsize")))
he bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point()
Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).
# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
geom_point()
library(openintro)
library(dplyr)
county_srs <- county_noDC %>%
sample_n(size=150)
glimpse(county_srs)
## Observations: 150
## Variables: 10
## $ name <fctr> Lamar County, Merrick County, Mason County, Was...
## $ state <fctr> Alabama, Nebraska, West Virginia, Wisconsin, Ok...
## $ pop2000 <dbl> 15904, 8204, 25957, 16036, 12623, 31839, 21139, ...
## $ pop2010 <dbl> 14564, 7845, 27324, 15911, 13488, 31499, 22185, ...
## $ fed_spend <dbl> 9.965394, 8.996048, 8.014786, 10.611212, 9.55516...
## $ poverty <dbl> 18.5, 10.7, 18.9, 13.1, 15.5, 20.6, 17.7, 8.5, 1...
## $ homeownership <dbl> 75.1, 73.3, 78.1, 82.1, 79.8, 79.4, 70.1, 82.2, ...
## $ multiunit <dbl> 9.0, 5.0, 6.9, 6.7, 4.7, 6.9, 7.3, 8.0, 6.4, 11....
## $ income <dbl> 19789, 21819, 19609, 23221, 20634, 18538, 16345,...
## $ med_income <dbl> 33887, 46116, 36027, 41641, 40870, 36750, 36606,...
name
state
pop
fed_spend
poverty
homeownership multiunit
income
med_income
This is an observational study, because there is no treatment being imposed on the subjects.
The sample above is a random sample.
You can only infer association because, in an observational study there could be other factors that would be relative. You can only infer causation from an experimental study.
Yes, because identifiable trends in the data can be generalized to the population as a whole, with confidency on the generalizion depending on the size of the population.
ggplot(county_srs, aes(x = income, y = fed_spend)) +
geom_point()
Federal spending seemed to increase steadily with amount of income, despite a few outliers, especially one who is spending over twice the amount of most people with the same income.
Can you think of any confounding variable? Briefly discuss. A confounding variable is a variable that is not taken into account but that could have an impact on the results. A confounding variable for this situation could be counties with higher employment rates having higher incomes than counties with low employment rates, even if that county recieved more federal spending.