Exercises: 2,3 (Pg. 151); 2,4 (Pg. 156); 1,2 (Pgs. 160-161); 2 (Pg. 163); 2,3,4 (Pg. 168), Open Response
Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.
#rate for table2
cases <- table2 %>% filter(year %in% c(1999, 2000), type == "cases")
population <- table2 %>% filter(year %in% c(1999, 2000), type == "population")
table2new <- tibble(
country = cases$country,
year = cases$year,
cases = cases$count,
population = population$count) %>% mutate(rate = cases / population * 10000)
head(table2new)
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
#rate for table4a + table4b
table4new <- tibble(
country = table4a$country) %>%
mutate(rate1999 = table4a[['1999']] / table4b[['1999']] * 10000,
rate2000 = table4a[['2000']] / table4b[['2000']] * 10000)
head(table4new)
## # A tibble: 3 x 3
## country rate1999 rate2000
## <chr> <dbl> <dbl>
## 1 Afghanistan 0.373 1.29
## 2 Brazil 2.19 4.61
## 3 China 1.67 1.67
I think the second representation was easiest to work with because cases and population were already established as variables. I think the first representation was hardest to work with because I had to filter and mutate rather than just mutate.
library(ggplot2)
ggplot(table2new, aes(year, cases)) + geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country))
I did not do much - I simply made table2new my new data. The plot show change in cases over time.
table4a %>% gather("1999", "2000", key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
The code fails because 1999 and 2000 are not in quotation marks. Since the column names are numbers, they must be called as a string.
preg <- tribble(~pregnant, ~male, ~female, "yes", NA, 10, "no", 20, 12)
head(preg)
## # A tibble: 2 x 3
## pregnant male female
## <chr> <dbl> <dbl>
## 1 yes NA 10
## 2 no 20 12
preg %>% gather("male", "female", key = "gender", value = "cases")
## # A tibble: 4 x 3
## pregnant gender cases
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 no male 20
## 3 yes female 10
## 4 no female 12
I needed to gather the tibble to tidy it because some of the column names from the original code were values of a variable rather than names of a variable. The variables are if the sample population is pregnant, gender, and the number of cases.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "warn")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The extra argument is used when there are extra characters. You can use the argument to drop the extra character or merge the extra character to one of the values in the last column. You can use the fill argument to put “NA” for missing values and columns. The argument is used for when there are not enough characters.
table5 %>% unite(new, century, year, sep = "", remove = TRUE)
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table5 %>% unite(new, century, year, sep = "", remove = FALSE)
## # A tibble: 6 x 5
## country new century year rate
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 19 99 745/19987071
## 2 Afghanistan 2000 20 00 2666/20595360
## 3 Brazil 1999 19 99 37737/172006362
## 4 Brazil 2000 20 00 80488/174504898
## 5 China 1999 19 99 212258/1272915272
## 6 China 2000 20 00 213766/1280428583
The remove argument is used to create new columns from uniting or separating old columns. You would set it to FALSE to keep the old columns after making the new columns.
stocks <- tibble(year = c(2015,2015,2015,2015,2016,2016,2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88,0.59,0.35, NA,0.92,0.17,2.66))
stocks %>% spread(year, return, fill = 15)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 15
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 15 2.66
stocks %>% complete(year, qtr, fill = list(return = 15))
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 15
## 5 2016 1 15
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
The fill argument in spread() is used to fill missing values with the specified value in the argument. The fill argument in complete() is used to fill specified missing values but fills in the other columns too. A list goes into the argument, so you can specify the missing value for all columns. For example, return is missing a value, so the fill argument specified NA should be replaced by 15 and in addition to doing so, the year and qtr columns added an appropriate value too.
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 76,036 more rows
who %>% gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [73467,
## 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476, 73477, 73478,
## 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 76,046 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 76,036 more rows
When the mutate() step is neglected, the arguments that follow, like separating codes at each underscore and creating new columns to keep the dataset constant, will be unable to report all the necessary values - hence, the warning that states there are a lot of missing pieces.
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 76,036 more rows
who %>% count(country)
## # A tibble: 219 x 2
## country n
## <chr> <int>
## 1 Afghanistan 34
## 2 Albania 34
## 3 Algeria 34
## 4 American Samoa 34
## 5 Andorra 34
## 6 Angola 34
## 7 Anguilla 34
## 8 Antigua and Barbuda 34
## 9 Argentina 34
## 10 Armenia 34
## # … with 209 more rows
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 9
## country iso2 iso3 year new var sex age value
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 014 0
## 2 Afghanistan AF AFG 1998 new sp m 014 30
## 3 Afghanistan AF AFG 1999 new sp m 014 8
## 4 Afghanistan AF AFG 2000 new sp m 014 52
## 5 Afghanistan AF AFG 2001 new sp m 014 129
## 6 Afghanistan AF AFG 2002 new sp m 014 90
## 7 Afghanistan AF AFG 2003 new sp m 014 127
## 8 Afghanistan AF AFG 2004 new sp m 014 139
## 9 Afghanistan AF AFG 2005 new sp m 014 151
## 10 Afghanistan AF AFG 2006 new sp m 014 193
## # … with 76,036 more rows
who %>% count(country, iso2, iso3)
## # A tibble: 219 x 4
## country iso2 iso3 n
## <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 34
## 2 Albania AL ALB 34
## 3 Algeria DZ DZA 34
## 4 American Samoa AS ASM 34
## 5 Andorra AD AND 34
## 6 Angola AO AGO 34
## 7 Anguilla AI AIA 34
## 8 Antigua and Barbuda AG ATG 34
## 9 Argentina AR ARG 34
## 10 Armenia AM ARM 34
## # … with 209 more rows
I agree with the claim that iso2 and iso3 are redundant. If we do not select these values to remove, they simply report shortened versions of the country names. I calculated the count for the countries, with and without iso2 and iso3, and the counts stayed the same, so the variables itself are simply not that useful because it is just repetitive.
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases",na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
who1 %>% group_by(country, year, sex) %>% summarize(total = sum(cases)) %>%
ggplot(aes(year, total, group = country)) + geom_line(alpha = 1) + facet_wrap(~ sex)
For this exercise, I want you to use the dataset your group selected for your final project and create 4 plots (bar plot, box plot, histogram, scatter plot). You should use the ggplot() function with different geoms for all figures. All the plots should use different variables (4 total variables), and these plots can be used in your exploratory data analysis.
library(readr)
crime <- read_csv("crime.csv")
## Parsed with column specification:
## cols(
## INCIDENT_NUMBER = col_character(),
## OFFENSE_CODE = col_character(),
## OFFENSE_CODE_GROUP = col_character(),
## OFFENSE_DESCRIPTION = col_character(),
## DISTRICT = col_character(),
## REPORTING_AREA = col_double(),
## SHOOTING = col_logical(),
## OCCURRED_ON_DATE = col_datetime(format = ""),
## YEAR = col_double(),
## MONTH = col_double(),
## DAY_OF_WEEK = col_character(),
## HOUR = col_double(),
## UCR_PART = col_character(),
## STREET = col_character(),
## Lat = col_double(),
## Long = col_double(),
## Location = col_character()
## )
## Warning: 1019 parsing failures.
## row col expected actual file
## 1296 SHOOTING 1/0/T/F/TRUE/FALSE Y 'crime.csv'
## 1861 SHOOTING 1/0/T/F/TRUE/FALSE Y 'crime.csv'
## 3260 SHOOTING 1/0/T/F/TRUE/FALSE Y 'crime.csv'
## 3261 SHOOTING 1/0/T/F/TRUE/FALSE Y 'crime.csv'
## 4108 SHOOTING 1/0/T/F/TRUE/FALSE Y 'crime.csv'
## .... ........ .................. ...... ...........
## See problems(...) for more details.
head(crime)
## # A tibble: 6 x 17
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GR… OFFENSE_DESCRIP… DISTRICT
## <chr> <chr> <chr> <chr> <chr>
## 1 I182070945 00619 Larceny LARCENY ALL OTH… D14
## 2 I182070943 01402 Vandalism VANDALISM C11
## 3 I182070941 03410 Towed TOWED MOTOR VEH… D4
## 4 I182070940 03114 Investigate Pro… INVESTIGATE PRO… D4
## 5 I182070938 03114 Investigate Pro… INVESTIGATE PRO… B3
## 6 I182070936 03820 Motor Vehicle A… M/V ACCIDENT IN… C11
## # … with 12 more variables: REPORTING_AREA <dbl>, SHOOTING <lgl>,
## # OCCURRED_ON_DATE <dttm>, YEAR <dbl>, MONTH <dbl>, DAY_OF_WEEK <chr>,
## # HOUR <dbl>, UCR_PART <chr>, STREET <chr>, Lat <dbl>, Long <dbl>,
## # Location <chr>
ggplot(data = crime) +
geom_bar(mapping = aes(x = OFFENSE_CODE_GROUP))
ggplot(data = crime, mapping = aes(x = HOUR, y = DAY_OF_WEEK)) +
geom_boxplot()
ggplot(data = crime) +
geom_histogram(mapping = aes(x = HOUR), binwidth = 1, stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(data = crime) +
geom_point(mapping = aes(x = OFFENSE_CODE, y = HOUR))