Instructions

Exercises: 2,3 (Pg. 151); 2,4 (Pg. 156); 1,2 (Pgs. 160-161); 2 (Pg. 163); 2,3,4 (Pg. 168), Open Response

Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.

Chapter 9 (Pg. 151)

Exercise 2

#rate for table2

cases <- table2 %>% filter(year %in% c(1999, 2000), type == "cases")
population <- table2 %>% filter(year %in% c(1999, 2000), type == "population")
table2new <- tibble(
  country = cases$country,
  year = cases$year,
  cases = cases$count,
  population = population$count) %>% mutate(rate = cases / population * 10000)
head(table2new)
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

#rate for table4a + table4b

table4new <- tibble(
  country = table4a$country) %>% 
  mutate(rate1999 = table4a[['1999']] / table4b[['1999']] * 10000,
  rate2000 = table4a[['2000']]  / table4b[['2000']] * 10000)
head(table4new)
## # A tibble: 3 x 3
##   country     rate1999 rate2000
##   <chr>          <dbl>    <dbl>
## 1 Afghanistan    0.373     1.29
## 2 Brazil         2.19      4.61
## 3 China          1.67      1.67

I think the second representation was easiest to work with because cases and population were already established as variables. I think the first representation was hardest to work with because I had to filter and mutate rather than just mutate.

Exercise 3

library(ggplot2)
ggplot(table2new, aes(year, cases)) + geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country))

I did not do much - I simply made table2new my new data. The plot show change in cases over time.

Chapter 9 (Pg. 156)

Exercise 2

table4a %>% gather("1999", "2000", key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

The code fails because 1999 and 2000 are not in quotation marks. Since the column names are numbers, they must be called as a string.

Exercise 4

preg <- tribble(~pregnant, ~male, ~female, "yes", NA, 10, "no", 20, 12)
head(preg)
## # A tibble: 2 x 3
##   pregnant  male female
##   <chr>    <dbl>  <dbl>
## 1 yes         NA     10
## 2 no          20     12
preg %>% gather("male", "female", key = "gender", value = "cases")
## # A tibble: 4 x 3
##   pregnant gender cases
##   <chr>    <chr>  <dbl>
## 1 yes      male      NA
## 2 no       male      20
## 3 yes      female    10
## 4 no       female    12

I needed to gather the tibble to tidy it because some of the column names from the original code were values of a variable rather than names of a variable. The variables are if the sample population is pregnant, gender, and the number of cases.

Chapter 9 (Pgs. 160-161)

Exercise 1

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "warn")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

The extra argument is used when there are extra characters. You can use the argument to drop the extra character or merge the extra character to one of the values in the last column. You can use the fill argument to put “NA” for missing values and columns. The argument is used for when there are not enough characters.

Exercise 2

table5 %>% unite(new, century, year, sep = "", remove = TRUE)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583
table5 %>% unite(new, century, year, sep = "", remove = FALSE)
## # A tibble: 6 x 5
##   country     new   century year  rate             
##   <chr>       <chr> <chr>   <chr> <chr>            
## 1 Afghanistan 1999  19      99    745/19987071     
## 2 Afghanistan 2000  20      00    2666/20595360    
## 3 Brazil      1999  19      99    37737/172006362  
## 4 Brazil      2000  20      00    80488/174504898  
## 5 China       1999  19      99    212258/1272915272
## 6 China       2000  20      00    213766/1280428583

The remove argument is used to create new columns from uniting or separating old columns. You would set it to FALSE to keep the old columns after making the new columns.

Chapter 9 (Pg. 163)

Exercise 2

stocks <- tibble(year = c(2015,2015,2015,2015,2016,2016,2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88,0.59,0.35, NA,0.92,0.17,2.66))
stocks %>% spread(year, return, fill = 15)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  15   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  15      2.66
stocks %>% complete(year, qtr, fill = list(return = 15))
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  15   
## 5  2016     1  15   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

The fill argument in spread() is used to fill missing values with the specified value in the argument. The fill argument in complete() is used to fill specified missing values but fills in the other columns too. A list goes into the argument, so you can specify the missing value for all columns. For example, return is missing a value, so the fill argument specified NA should be replaced by 15 and in addition to doing so, the year and qtr columns added an appropriate value too.

Chapter 9 (Pg. 168)

Exercise 2

who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>%
  select(-new, -iso2, -iso3) %>%
  separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 76,036 more rows
who %>% gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  separate(key, c("new", "var", "sexage")) %>%
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [73467,
## 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476, 73477, 73478,
## 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 76,036 more rows

When the mutate() step is neglected, the arguments that follow, like separating codes at each underscore and creating new columns to keep the dataset constant, will be unable to report all the necessary values - hence, the warning that states there are a lot of missing pieces.

Exercise 3

who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>%
  select(-new, -iso2, -iso3) %>%
  separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 76,036 more rows
who %>% count(country)
## # A tibble: 219 x 2
##    country                 n
##    <chr>               <int>
##  1 Afghanistan            34
##  2 Albania                34
##  3 Algeria                34
##  4 American Samoa         34
##  5 Andorra                34
##  6 Angola                 34
##  7 Anguilla               34
##  8 Antigua and Barbuda    34
##  9 Argentina              34
## 10 Armenia                34
## # … with 209 more rows
who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>%
  separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 9
##    country     iso2  iso3   year new   var   sex   age   value
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m     014       0
##  2 Afghanistan AF    AFG    1998 new   sp    m     014      30
##  3 Afghanistan AF    AFG    1999 new   sp    m     014       8
##  4 Afghanistan AF    AFG    2000 new   sp    m     014      52
##  5 Afghanistan AF    AFG    2001 new   sp    m     014     129
##  6 Afghanistan AF    AFG    2002 new   sp    m     014      90
##  7 Afghanistan AF    AFG    2003 new   sp    m     014     127
##  8 Afghanistan AF    AFG    2004 new   sp    m     014     139
##  9 Afghanistan AF    AFG    2005 new   sp    m     014     151
## 10 Afghanistan AF    AFG    2006 new   sp    m     014     193
## # … with 76,036 more rows
who %>% count(country, iso2, iso3)
## # A tibble: 219 x 4
##    country             iso2  iso3      n
##    <chr>               <chr> <chr> <int>
##  1 Afghanistan         AF    AFG      34
##  2 Albania             AL    ALB      34
##  3 Algeria             DZ    DZA      34
##  4 American Samoa      AS    ASM      34
##  5 Andorra             AD    AND      34
##  6 Angola              AO    AGO      34
##  7 Anguilla            AI    AIA      34
##  8 Antigua and Barbuda AG    ATG      34
##  9 Argentina           AR    ARG      34
## 10 Armenia             AM    ARM      34
## # … with 209 more rows

I agree with the claim that iso2 and iso3 are redundant. If we do not select these values to remove, they simply report shortened versions of the country names. I calculated the count for the countries, with and without iso2 and iso3, and the counts stayed the same, so the variables itself are simply not that useful because it is just repetitive.

Exercise 4

who1 <- who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases",na.rm = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage")) %>%
  select(-new, -iso2, -iso3) %>%
  separate(sexage, c("sex", "age"), sep = 1)
who1 %>% group_by(country, year, sex) %>% summarize(total = sum(cases)) %>%
ggplot(aes(year, total, group = country)) + geom_line(alpha = 1) + facet_wrap(~ sex)

Open Response

For this exercise, I want you to use the dataset your group selected for your final project and create 4 plots (bar plot, box plot, histogram, scatter plot). You should use the ggplot() function with different geoms for all figures. All the plots should use different variables (4 total variables), and these plots can be used in your exploratory data analysis.

library(readr)
crime <- read_csv("crime.csv")
## Parsed with column specification:
## cols(
##   INCIDENT_NUMBER = col_character(),
##   OFFENSE_CODE = col_character(),
##   OFFENSE_CODE_GROUP = col_character(),
##   OFFENSE_DESCRIPTION = col_character(),
##   DISTRICT = col_character(),
##   REPORTING_AREA = col_double(),
##   SHOOTING = col_logical(),
##   OCCURRED_ON_DATE = col_datetime(format = ""),
##   YEAR = col_double(),
##   MONTH = col_double(),
##   DAY_OF_WEEK = col_character(),
##   HOUR = col_double(),
##   UCR_PART = col_character(),
##   STREET = col_character(),
##   Lat = col_double(),
##   Long = col_double(),
##   Location = col_character()
## )
## Warning: 1019 parsing failures.
##  row      col           expected actual        file
## 1296 SHOOTING 1/0/T/F/TRUE/FALSE      Y 'crime.csv'
## 1861 SHOOTING 1/0/T/F/TRUE/FALSE      Y 'crime.csv'
## 3260 SHOOTING 1/0/T/F/TRUE/FALSE      Y 'crime.csv'
## 3261 SHOOTING 1/0/T/F/TRUE/FALSE      Y 'crime.csv'
## 4108 SHOOTING 1/0/T/F/TRUE/FALSE      Y 'crime.csv'
## .... ........ .................. ...... ...........
## See problems(...) for more details.
head(crime)
## # A tibble: 6 x 17
##   INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GR… OFFENSE_DESCRIP… DISTRICT
##   <chr>           <chr>        <chr>            <chr>            <chr>   
## 1 I182070945      00619        Larceny          LARCENY ALL OTH… D14     
## 2 I182070943      01402        Vandalism        VANDALISM        C11     
## 3 I182070941      03410        Towed            TOWED MOTOR VEH… D4      
## 4 I182070940      03114        Investigate Pro… INVESTIGATE PRO… D4      
## 5 I182070938      03114        Investigate Pro… INVESTIGATE PRO… B3      
## 6 I182070936      03820        Motor Vehicle A… M/V ACCIDENT IN… C11     
## # … with 12 more variables: REPORTING_AREA <dbl>, SHOOTING <lgl>,
## #   OCCURRED_ON_DATE <dttm>, YEAR <dbl>, MONTH <dbl>, DAY_OF_WEEK <chr>,
## #   HOUR <dbl>, UCR_PART <chr>, STREET <chr>, Lat <dbl>, Long <dbl>,
## #   Location <chr>

Step 1: Bar plot

ggplot(data = crime) +
      geom_bar(mapping = aes(x = OFFENSE_CODE_GROUP))

Step 2: Box plot

 ggplot(data = crime, mapping = aes(x = HOUR, y = DAY_OF_WEEK)) +
      geom_boxplot()

Step 3: Histogram

 ggplot(data = crime) +
      geom_histogram(mapping = aes(x = HOUR), binwidth = 1, stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Step 4: Scatterplot

  ggplot(data = crime) +
      geom_point(mapping = aes(x = OFFENSE_CODE, y = HOUR))