Assignment 3 - Introduction to Data

Chapter 1 Language of Data

1.1 Loading data into R

You can load data into R with the command data(). You can view the structure of the data with the command str().




# Load packages
library(openintro) #for the use of email50 and county data
library(dplyr)

# Load data
data(email50)

# View its structure
str(email50)
## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 13:19:16" "2012-02-16 20:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

1.2 Identifying variable types

The glimpse command will tell you the number of observations and variables, the name and type of each column, and a neatly printed preview of its values.

# Glimpse email50
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 13:19:16, 2012-02-16 20:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fctr> small, big, none, small, small, small, small, sm...

1.3 Filtering based on a factor

You can filter data using a factor variable by typing the filter() command.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Glimpse the subset
glimpse(email50_big)
## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 20:10:06, 2012-02-04 23:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fctr> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fctr> big, big, big, big, big, big, big

1.4 Complete filtering based on a factor

You can use the droplevels() function to remove unused levels of factor variables from your dataset. You can determine which levels are unused (i.e. contain zero values) with the table() function.

# Table of number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7

# Drop levels
email50_big$number <- droplevels(email50_big$number)

# Another table of number variable
table(email50_big$number)
## 
## big 
##   7

1.5 Discretize a different variable

You can create a categorical version of a numerical variable.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50 <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
table(email50$num_char_cat)
## 
## at or above median       below median 
##                 25                 25

1.6 Combining levels of a different factor

You can create a new variable based on an existing one by combining levels of a categorical variable.

# Load package ggplot2
library(ggplot2)

# Create number_yn column in email50
email50 <- email50 %>%
  mutate(number_yn = ifelse(number == "none", "no", "yes"))

# Visualize number_yn
ggplot(email50, aes(x = number_yn)) +
  geom_bar()

1.7 Visualizing numerical and categorical data

You can create scatter diagrams using the ggplot function.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Chapter 2 Study types and cautionary tales

2.1 Identify study type

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.

What type of study is this?

Experiment

2.2 Identify the type of study

You can identify the type of study by viewing the data it generated.

# Load gapminder package
library(gapminder)

# Load data
data(gapminder)

# Glimpse data
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

# Identify type of study
type_of_study <- "observational"

2.3 Random sampling or random assignment?

This exercise asks you to identify whether random sampling and/or random assignment was used in a study that compared the smoking habits of patients who were already hospitalized with lung cancer to similar patients without lung cancer.

This study did not employ either random assignment of random sampling. Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study; random sampling is not employed because the study records the patients who are already hospitalized, so it wouldn’t be appropriate to apply the findings back to the population as a whole.

2.4 Identify the scope of inference of study

In a study using volunteer subjects who were randomly assigned between 2 groups: V

The results of the study cannot be generalized to all people and a causal link between believing information is stored and memory can be inferred based on these results.

2.5 Number of males and females admitted

You can use count() to group data by certain variables and then count the number of observations in each category. These counts are available under a new variable called n. You can use spread() to reorganize the output across columns based on a key-value pair, where a pair contains a key that explains what the information describes and a value that contains the actual information. spread() takes the name of the dataset as its first argument, the name of the key column as its second argument, and the name of the value column as its third argument, all specified without quotation marks.

# Load packages
library(tidyr)

# Import data
ucb_admit <- read.csv("ucb_admit.csv", stringsAsFactors = FALSE) 
ucb_admit$Dept <- as.character(ucb_admit$Dept)
glimpse(ucb_admit)
## Observations: 4,526
## Variables: 3
## $ Admit  <chr> "Admitted", "Admitted", "Admitted", "Admitted", "Admitt...
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",...
## $ Dept   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
summary(ucb_admit)
##     Admit              Gender              Dept          
##  Length:4526        Length:4526        Length:4526       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

# Count number of male and female applicants admitted
ucb_counts <- ucb_admit %>%
  count(Gender, Admit)

# View result
ucb_counts
## # A tibble: 4 x 3
##   Gender    Admit     n
##    <chr>    <chr> <int>
## 1 Female Admitted   557
## 2 Female Rejected  1278
## 3   Male Admitted  1198
## 4   Male Rejected  1493

# Spread the output across columns
ucb_counts %>%
  spread(Admit, n)
## # A tibble: 2 x 3
##   Gender Admitted Rejected
## *  <chr>    <int>    <int>
## 1 Female      557     1278
## 2   Male     1198     1493

2.6 Proportion of males admitted overall

You can calculate the percentage of males admitted by creating a new variable with mutate() from the dplyr package.

ucb_admit %>%
  # Table of counts of admission status and gender
  count(Admit, Gender) %>%
  # Spread output across columns based on admission status
  spread(Admit,n) %>%
  # Create new variable
  mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 2 x 4
##   Gender Admitted Rejected Perc_Admit
##    <chr>    <int>    <int>      <dbl>
## 1 Female      557     1278  0.3035422
## 2   Male     1198     1493  0.4451877

2.7 Proportion of males admitted for each department

You can make a table that groups the data by department. Then, you can use this table to calculate the proportion of males admitted in each department.

# Table of counts of admission status and gender for each department
admit_by_dept <- ucb_admit %>%
  count(Admit, Dept, Gender) %>%
  spread(Admit, n)

# View result
admit_by_dept
## # A tibble: 12 x 4
##     Dept Gender Admitted Rejected
##  * <chr>  <chr>    <int>    <int>
##  1     A Female       89       19
##  2     A   Male      512      313
##  3     B Female       17        8
##  4     B   Male      353      207
##  5     C Female      202      391
##  6     C   Male      120      205
##  7     D Female      131      244
##  8     D   Male      138      279
##  9     E Female       94      299
## 10     E   Male       53      138
## 11     F Female       24      317
## 12     F   Male       22      351

# Percentage of those admitted to each department
admit_by_dept %>%
  mutate(Perc_Admit = Admitted / (Admitted + Rejected))
## # A tibble: 12 x 5
##     Dept Gender Admitted Rejected Perc_Admit
##    <chr>  <chr>    <int>    <int>      <dbl>
##  1     A Female       89       19 0.82407407
##  2     A   Male      512      313 0.62060606
##  3     B Female       17        8 0.68000000
##  4     B   Male      353      207 0.63035714
##  5     C Female      202      391 0.34064081
##  6     C   Male      120      205 0.36923077
##  7     D Female      131      244 0.34933333
##  8     D   Male      138      279 0.33093525
##  9     E Female       94      299 0.23918575
## 10     E   Male       53      138 0.27748691
## 11     F Female       24      317 0.07038123
## 12     F   Male       22      351 0.05898123

2.8 Contingency table results by group

The admit_by _dept results show that in most departments, females are more likely to be admitted than males.

Chapter 3 Sampling strategies and experimental design

3.1 Sampling strategies, determine which

A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.

What sampling strategy has this company used? A stratified sample.

3.2 Sampling strategies, choose worst

A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.

Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.

Which approach would likely be the least effective for selecting the schools where the survey will be conducted?

Cluster sampling where each cluster is a neighborhood. This sampling strategy would be a bad idea because each neighborhood has a unique socioeconomic status. A good study would collect information about every neighborhood.

3.3 Simple random sample in r

Suppose you want to collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.


# Import data
us_regions <- read.csv("us_regions.csv", stringsAsFactors = FALSE) 

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(size = 8)
  
# Count states by region
states_srs %>%
  group_by(region) %>%
  count()
## # A tibble: 3 x 2
## # Groups:   region [3]
##      region     n
##       <chr> <int>
## 1   Midwest     3
## 2 Northeast     3
## 3      West     2

3.4 Stratified sample in R

A simple random sample is unlikely to select an equal number of states from each region. The goal of stratified sampling is to select an equal number of states from each region.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  group_by(region)%>%
  count()
## # A tibble: 4 x 2
## # Groups:   region [4]
##      region     n
##       <chr> <int>
## 1   Midwest     2
## 2 Northeast     2
## 3     South     2
## 4      West     2

3.5 Compare SRS vs. stratified sample

Which method, simple random sampling or stratified sampling, ensures an equal number of states from each region?

Stratified sampling.

3.6 Identifying components of a study

A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

Which of the below is correct?

There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).

3.7 Experimental design technology

Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

3.8 Connect blocking and stratifying

In random sampling, you use stratifying to control for a variable. In random assignment, you use blocking to achieve the same goal.

Chapter 4 Case study

4.1 Inspect the data

Use a technique you have learned to inspect the data in evals.

# Import data
evals <- read.csv("evals.csv", stringsAsFactors = FALSE) 

# Inspect evals
glimpse(evals)
## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <chr> "tenure track", "tenure track", "tenure track", ...
## $ ethnicity     <chr> "minority", "minority", "minority", "minority", ...
## $ gender        <chr> "female", "female", "female", "female", "male", ...
## $ language      <chr> "english", "english", "english", "english", "eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <chr> "upper", "upper", "upper", "upper", "upper", "up...
## $ cls_profs     <chr> "single", "single", "single", "single", "multipl...
## $ cls_credits   <chr> "multi credit", "multi credit", "multi credit", ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <chr> "not formal", "not formal", "not formal", "not f...
## $ pic_color     <chr> "color", "color", "color", "color", "color", "co...

4.2 Identify type of study

What type of study is this?

Observational study

4.3 Sampling / experimental attributes

The data from this study were gathered by randomly sampling classes.

4.4 Identify variable types

It’s always useful to start your exploration of a dataset by identifying variable types. You can do this using either the glimpse() or str() tool.

# Inspect variable types
str(evals)
## 'data.frame':    463 obs. of  21 variables:
##  $ score        : num  4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
##  $ rank         : chr  "tenure track" "tenure track" "tenure track" "tenure track" ...
##  $ ethnicity    : chr  "minority" "minority" "minority" "minority" ...
##  $ gender       : chr  "female" "female" "female" "female" ...
##  $ language     : chr  "english" "english" "english" "english" ...
##  $ age          : int  36 36 36 36 59 59 59 51 51 40 ...
##  $ cls_perc_eval: num  55.8 68.8 60.8 62.6 85 ...
##  $ cls_did_eval : int  24 86 76 77 17 35 39 55 111 40 ...
##  $ cls_students : int  43 125 125 123 20 40 44 55 195 46 ...
##  $ cls_level    : chr  "upper" "upper" "upper" "upper" ...
##  $ cls_profs    : chr  "single" "single" "single" "single" ...
##  $ cls_credits  : chr  "multi credit" "multi credit" "multi credit" "multi credit" ...
##  $ bty_f1lower  : int  5 5 5 5 4 4 4 5 5 2 ...
##  $ bty_f1upper  : int  7 7 7 7 4 4 4 2 2 5 ...
##  $ bty_f2upper  : int  6 6 6 6 2 2 2 5 5 4 ...
##  $ bty_m1lower  : int  2 2 2 2 2 2 2 2 2 3 ...
##  $ bty_m1upper  : int  4 4 4 4 3 3 3 3 3 3 ...
##  $ bty_m2upper  : int  6 6 6 6 3 3 3 3 3 2 ...
##  $ bty_avg      : num  5 5 5 5 3 ...
##  $ pic_outfit   : chr  "not formal" "not formal" "not formal" "not formal" ...
##  $ pic_color    : chr  "color" "color" "color" "color" ...
# Remove non-factor variables from this vector
cat_vars <- c("rank", "ethnicity", "gender", "language",  "cls_level", "cls_profs", "cls_credits","pic_outfit", "pic_color")

4.5 Recode a variable

The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is “small” (18 students or fewer), “midsize” (19 - 59 students), or “large” (60 students or more). You can do this with a nested call to ifelse(), which means that you’ll call ifelse() a second time from within your first call to ifelse().

# Recode cls_students as cls_type: evals
evals <- evals %>%
  # Create new variable
  mutate(cls_type = ifelse(cls_students <= 18, "small",
                      ifelse(cls_students >= 19 & cls_students <= 59, "midsize", 
                        "large")))
# The cls_type variable is a categorical variable, stored as a character vector.

4.6 Create a scatterplot

You can visualize the relationship between the variables for score and bty_avg by using a scatter plot.

# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) + geom_point()

4.7 Create a scatterplot, with an added layer

You can evaluate how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large) by coloring the points by class type.

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals, aes(x = bty_avg, y = score, color = cls_type)) +
  geom_point()

Quiz 3

You’re tasked to examine whether federal spending is positively related to the standard of living. Use the county data set in the openintro package. Examine the relationship between fed_spend and income by following instructions below.

1. Randomly sample 150 counties in the US.

data(county)

# Sample 150 counties
US_states <- county %>%
  sample_n(size = 150)

2. What type of variables are they? Use the glimpse function.

#glimpse sample

glimpse(US_states)
## Observations: 150
## Variables: 10
## $ name          <fctr> Boone County, Beaver County, Imperial County, B...
## $ state         <fctr> Arkansas, Pennsylvania, California, Georgia, Oh...
## $ pop2000       <dbl> 33948, 181412, 142361, 23417, 73894, 425257, 741...
## $ pop2010       <dbl> 36903, 170539, 174528, 30233, 69709, 437994, 713...
## $ fed_spend     <dbl> 8.032707, 9.037974, 7.641674, 68.863130, 10.6088...
## $ poverty       <dbl> 16.0, 11.1, 21.4, 11.0, 17.7, 6.8, 13.5, 13.5, 1...
## $ homeownership <dbl> 72.7, 75.2, 56.6, 74.2, 72.9, 66.5, 76.0, 80.2, ...
## $ multiunit     <dbl> 12.1, 17.3, 21.4, 8.1, 14.9, 22.4, 8.2, 4.3, 12....
## $ income        <dbl> 20507, 24168, 16395, 28365, 20470, 30873, 19114,...
## $ med_income    <dbl> 36977, 46190, 38685, 63244, 37527, 64618, 38133,...

# The variables for name and state are categorical, all of the other variables are numerical.

3. Discuss the nature of this study by addressing the following:

Is this an observational study or an experiment? Why?

This is an observational study. An experiment would require that you impose a treatment on the subjects, this study just looks at existing data.

Does it involve random sampling or random assignment?

The sample above is a random sample.

Can you infer causation? Or just association? Why?

You can only infer association because, in an observational study there could be other factors that would be relative. You can only infer causation from an experimental study.

Is your conclusion generalizable to the population as a whole? Why?

Yes, if you see identifiable trends in the data you could generalize this to the population as a whole. The larger the sample and the more specific the trend is the more accurate the generalization should be. For example, if the sample shows that the rate of home ownership is highest in counties where income is above a certain level, that conclusion could be generalized to the entire US population.

4. Create a scatter plot of fed_spend on the y axis and income on the x axis. Interpret.

# Scatterplot of fed_spend vs. income 
ggplot(US_states, aes(x = income, y = fed_spend)) +
  geom_point()

Analysis - most counties received federal spending between 5 and 15. Most incomes were between 15000 and 25000. The amount of federal spending did not seem to have a direct impact on income levels. The counties with the highest income levels generally had varying federal spending levels. In the county where federal spending was the highest, income levels were about average. My conclusion is that federal spending is not a significant factor in increasing income levels.

5. Can you think of any confounding variable? Briefly discuss.

A confounding variable is a variable that is not taken into account but that could have an impact on the results. There are a number of possible confounding variables in this analysis. For example, a county with a high employment rate might have higher incomes than a county with a low employment rate even if that county received more federal money. The same could be said for education level.