You can load data into R with the command data(). You can view the structure of the data with the command str().
# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr)
# Load data
data(email50)
# View its structure
str(email50)
## 'data.frame': 50 obs. of 21 variables:
## $ spam : num 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
## $ image : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
The glimpse command will tell you the number of observations and variables, the name and type of each column, and a neatly printed preview of its values.
# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number <fctr> small, big, none, small, small, small, small, sm...
You can filter data using a factor variable by typing the filter() command.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fctr> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fctr> big, big, big, big, big, big, big
You can use the droplevels() function to remove unused levels of factor variables from your dataset. You can determine which levels are unused (i.e. contain zero values) with the table() function.
# Table of number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number <- droplevels(email50_big$number)
# Another table of number variable
table(email50_big$number)
##
## big
## 7
You can create a categorical version of a numerical variable.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50 <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
# Count emails in each category
table(email50$num_char_cat)
##
## at or above median below median
## 25 25
You can create a new variable based on an existing one by combining levels of a categorical variable.
# Load package ggplot2
library(ggplot2)
# Create number_yn column in email50
email50 <- email50 %>%
mutate(number_yn = ifelse(number == "none", "no", "yes"))
# Visualize number_yn
ggplot(email50, aes(x = number_yn)) +
geom_bar()
You can create scatter diagrams using the ggplot function.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.
What type of study is this?
Experiment
You can identify the type of study by viewing the data it generated.
# Load gapminder package
library(gapminder)
# Load data
data(gapminder)
# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Identify type of study
type_of_study <- "observational"
This exercise asks you to identify whether random sampling and/or random assignment was used in a study that compared the smoking habits of patients who were already hospitalized with lung cancer to similar patients without lung cancer.
This study did not employ either random assignment of random sampling. Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study; random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn’t be appropriate to apply the findings back to the population as a whole.
In a study using volunteer subjects who were randomly assigned between 2 groups: V
The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.
You can use count() to group data by certain variables and then count the number of observations in each category. These counts are available under a new variable called n. You can use spread() to reorganize the output across columns based on a key-value pair, where a pair contains a key that explains what the information describes and a value that contains the actual information. spread() takes the name of the dataset as its first argument, the name of the key column as its second argument, and the name of the value column as its third argument, all specified without quotation marks.
# Load packages
library(tidyr)
# Import data
ucb_admit <- read.csv("ucb_admit.csv", stringsAsFactors = FALSE)
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit <chr> "Admitted", "Admitted", "Admitted", "Admitted", "Admitt...
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",...
## $ Dept <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
summary(ucb_admit)
## Admit Gender Dept
## Length:4526 Length:4526 Length:4526
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
count(Gender, Admit)
# View result
ucb_counts
## # A tibble: 4 x 3
## Gender Admit n
## <chr> <chr> <int>
## 1 Female Admitted 557
## 2 Female Rejected 1278
## 3 Male Admitted 1198
## 4 Male Rejected 1493
# Spread the output across columns
ucb_counts %>%
spread(Admit, n)
## # A tibble: 2 x 3
## Gender Admitted Rejected
## * <chr> <int> <int>
## 1 Female 557 1278
## 2 Male 1198 1493
You can calculate the percentage of males admitted by creating a new variable with mutate() from the dplyr package.
ucb_admit %>%
# Table of counts of admission status and gender
count(Admit, Gender) %>%
# Spread output across columns based on admission status
spread(Admit,n) %>%
# Create new variable
mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 2 x 4
## Gender Admitted Rejected Perc_Admit
## <chr> <int> <int> <dbl>
## 1 Female 557 1278 0.3035422
## 2 Male 1198 1493 0.4451877
You can make a table that groups the data by department. Then, you can use this table to calculate the proportion of males admitted in each department.
# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
count(Admit, Dept, Gender) %>%
spread(Admit, n)
# View result
admit_by_dept
## # A tibble: 12 x 4
## Dept Gender Admitted Rejected
## * <chr> <chr> <int> <int>
## 1 A Female 89 19
## 2 A Male 512 313
## 3 B Female 17 8
## 4 B Male 353 207
## 5 C Female 202 391
## 6 C Male 120 205
## 7 D Female 131 244
## 8 D Male 138 279
## 9 E Female 94 299
## 10 E Male 53 138
## 11 F Female 24 317
## 12 F Male 22 351
# Percentage of those admitted to each department
admit_by_dept %>%
mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 12 x 5
## Dept Gender Admitted Rejected Perc_Admit
## <chr> <chr> <int> <int> <dbl>
## 1 A Female 89 19 0.82407407
## 2 A Male 512 313 0.62060606
## 3 B Female 17 8 0.68000000
## 4 B Male 353 207 0.63035714
## 5 C Female 202 391 0.34064081
## 6 C Male 120 205 0.36923077
## 7 D Female 131 244 0.34933333
## 8 D Male 138 279 0.33093525
## 9 E Female 94 299 0.23918575
## 10 E Male 53 138 0.27748691
## 11 F Female 24 317 0.07038123
## 12 F Male 22 351 0.05898123
The admit_by _dept results show that in most departments, females are more likely to be admitted than males.
A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.
What sampling strategy has this company used? A stratified sample.
A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.
Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.
Which approach would likely be the least effective for selecting the schools where the survey will be conducted?
Cluster sampling where each cluster is a neighborhood. This sampling strategy would be a bad idea because each neighborhood has a unique socioeconomic status. A good study would collect information about every neighborhood.
Suppose you want to collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.
# Import data
us_regions <- read.csv("us_regions.csv", stringsAsFactors = FALSE)
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)
# Count states by region
states_srs %>%
group_by(region) %>%
count()
## # A tibble: 3 x 2
## # Groups: region [3]
## region n
## <chr> <int>
## 1 Midwest 3
## 2 Northeast 3
## 3 West 2
A simple random sample is unlikely to select an equal number of states from each region. The goal of stratified sampling is to select an equal number of states from each region.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
# Count states by region
states_str %>%
group_by(region)%>%
count()
## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <chr> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
Which method, simple random sampling or stratified sampling, ensures an equal number of states from each region?
Stratified sampling.
A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.
Which of the below is correct?
There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).
Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.
In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.
Use a technique you have learned to inspect the data in evals.
# Import data
evals <- read.csv("evals.csv", stringsAsFactors = FALSE)
# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <chr> "tenure track", "tenure track", "tenure track", ...
## $ ethnicity <chr> "minority", "minority", "minority", "minority", ...
## $ gender <chr> "female", "female", "female", "female", "male", ...
## $ language <chr> "english", "english", "english", "english", "eng...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <chr> "upper", "upper", "upper", "upper", "upper", "up...
## $ cls_profs <chr> "single", "single", "single", "single", "multipl...
## $ cls_credits <chr> "multi credit", "multi credit", "multi credit", ...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <chr> "not formal", "not formal", "not formal", "not f...
## $ pic_color <chr> "color", "color", "color", "color", "color", "co...
What type of study is this?
Observational study
The data from this study were gathered by randomly sampling classes.
It’s always useful to start your exploration of a dataset by identifying variable types. You can do this using either the glimpse() or str() tool.
# Inspect variable types
str(evals)
## 'data.frame': 463 obs. of 21 variables:
## $ score : num 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
## $ rank : chr "tenure track" "tenure track" "tenure track" "tenure track" ...
## $ ethnicity : chr "minority" "minority" "minority" "minority" ...
## $ gender : chr "female" "female" "female" "female" ...
## $ language : chr "english" "english" "english" "english" ...
## $ age : int 36 36 36 36 59 59 59 51 51 40 ...
## $ cls_perc_eval: num 55.8 68.8 60.8 62.6 85 ...
## $ cls_did_eval : int 24 86 76 77 17 35 39 55 111 40 ...
## $ cls_students : int 43 125 125 123 20 40 44 55 195 46 ...
## $ cls_level : chr "upper" "upper" "upper" "upper" ...
## $ cls_profs : chr "single" "single" "single" "single" ...
## $ cls_credits : chr "multi credit" "multi credit" "multi credit" "multi credit" ...
## $ bty_f1lower : int 5 5 5 5 4 4 4 5 5 2 ...
## $ bty_f1upper : int 7 7 7 7 4 4 4 2 2 5 ...
## $ bty_f2upper : int 6 6 6 6 2 2 2 5 5 4 ...
## $ bty_m1lower : int 2 2 2 2 2 2 2 2 2 3 ...
## $ bty_m1upper : int 4 4 4 4 3 3 3 3 3 3 ...
## $ bty_m2upper : int 6 6 6 6 3 3 3 3 3 2 ...
## $ bty_avg : num 5 5 5 5 3 ...
## $ pic_outfit : chr "not formal" "not formal" "not formal" "not formal" ...
## $ pic_color : chr "color" "color" "color" "color" ...
# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language", "cls_level", "cls_profs", "cls_credits","pic_outfit", "pic_color")
The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is “small” (18 students or fewer), “midsize” (19 - 59 students), or “large” (60 students or more). You can do this with a nested call to ifelse(), which means that you’ll call ifelse() a second time from within your first call to ifelse().
# Recode cls_students as cls_type: evals
evals <- evals %>%
# Create new variable
mutate(cls_type = ifelse(cls_students <= 18, "small",
ifelse(cls_students >= 19 & cls_students <= 59, "midsize",
"large")))
# The cls_type variable is a categorical variable, stored as a character vector.
You can visualize the relationship between the variables for score and bty_avg by using a scatter plot.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) + geom_point()
You can evaluate how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large) by coloring the points by class type.
# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
geom_point()
You’re tasked to examine whether federal spending is positively related to the standard of living. Use the county data set in the openintro package. Examine the relationship between fed_spend and income by following instructions below.
data(county)
# Sample 150 counties
US_states <- county %>%
sample_n(size = 150)
#glimpse sample
glimpse(US_states)
## Observations: 150
## Variables: 10
## $ name <fctr> Boone County, Beaver County, Imperial County, B...
## $ state <fctr> Arkansas, Pennsylvania, California, Georgia, Oh...
## $ pop2000 <dbl> 33948, 181412, 142361, 23417, 73894, 425257, 741...
## $ pop2010 <dbl> 36903, 170539, 174528, 30233, 69709, 437994, 713...
## $ fed_spend <dbl> 8.032707, 9.037974, 7.641674, 68.863130, 10.6088...
## $ poverty <dbl> 16.0, 11.1, 21.4, 11.0, 17.7, 6.8, 13.5, 13.5, 1...
## $ homeownership <dbl> 72.7, 75.2, 56.6, 74.2, 72.9, 66.5, 76.0, 80.2, ...
## $ multiunit <dbl> 12.1, 17.3, 21.4, 8.1, 14.9, 22.4, 8.2, 4.3, 12....
## $ income <dbl> 20507, 24168, 16395, 28365, 20470, 30873, 19114,...
## $ med_income <dbl> 36977, 46190, 38685, 63244, 37527, 64618, 38133,...
# The variables for name and state are categorical, all of the other variables are numerical.
This is an observational study. An experiment would require that you impose a treatment on the subjects, this study just looks at existing data.
The sample above is a random sample.
You can only infer association because, in an observational study there could be other factors that would be relative. You can only infer causation from an experimental study.
Yes, if you see identifiable trends in the data you could generalize this to the population as a whole. The larger the sample and the more specific the trend is the more accurate the generalization should be. For example, if the sample shows that the rate of home ownership is highest in counties where income is above a certain level, that conclusion could be generalized to the entire US population.
# Scatterplot of fed_spend vs. income
ggplot(US_states, aes(x = income, y = fed_spend)) +
geom_point()
Analysis - most counties received federal spending between 5 and 15. Most incomes were between 15000 and 25000. The amount of federal spending did not seem to have a direct impact on income levels. The counties with the highest income levels generally had varying federal spending levels. In the county where federal spending was the highest, income levels were about average. My conclusion is that federal spending is not a significant factor in increasing income levels.
A confounding variable is a variable that is not taken into account but that could have an impact on the results. There are a number of possible confounding variables in this analysis. For example, a county with a high employment rate might have higher incomes than a county with a low employment rate even if that county received more federal money. The same could be said for education level.