Preparation Instructions

EXERCISES

Tidy Data

The United Nations collects information about the estimates of international migration by age, sex, and origin. In this Challenge Problem, you are going to examine the data from 2015 that provides information on migrants by destination country and origin country.

For Questions 1 to 3, refer to the United Nations’ migration spreadsheet (worksheet titled Table 16) found on the Canvas page (titled UN_MigrantStockByOriginAndDestination_2015.xlsx).

  1. Identify at least three things that make this spreadsheet untidy.
  • The column headers contain values rather than names
  • The variables (origins) are stored in both rows and columns
  • There are multiple observations stored in a single table
  1. Describe how you would make this spreadsheet into a tidy dataset. Be sure to clearly define the cases (or observational units) and the variables in your tidy dataset.
  • The cases for this spreadsheet would be each migrants country of origin matched to the destination in 2015, while the variables for this spreadsheet would be Country of Origin, Destination Country/Region, and Migrant Stock. To make this spreadsheet into tidy dataset, each row would represent a case, and each column would represent a variable.
  1. While tidy data is useful for analyzing the data within a software, when might a spreadsheet or presentation like this one be relevant?
  • If an infectious disease outbreak occurred, public health professionals could use this spreadsheet to evaluate where people are immigrating from to determine if the disease came from there.

Extra challenge (not required): Re-write the spreadsheet in a tidy form for the first 10 cases using Google Sheets and providing a link to the data in this document. Be sure to change the Share settings to “Anyone with the link” can view so that the link works.

R programming

Note: The following questions are adapted from the exercises presented in Chapter 3 of the Data Computing textbook.

  1. Explain why the following sentence would result in an error message:
Result <- %>% filter(BabyNames, name == "Prince")
  • Currently, the data table (BabyNames) is next to the argument (name = “Prince”), which is incorrect. The pipe (%>%) has to have the data table (BabyNames) in front of it so that the function (filter) can work properly.
  1. Consider these R expressions. (You don’t have to know what the various functions do to solve this problem.)
# Wrangling the data: to count the number babies named Prince, grouped by year and sex
Princes <-
  BabyNames %>%
  filter(name == "Prince") %>%
  group_by(year, sex) %>%
  summarise(yearlyTotal = sum(count))

# Graphing the results
Princes %>%
  ggplot(data = Princes, aes(x = year, y = yearlyTotal)) + 
  geom_point(aes(color = sex)) + 
  geom_vline(xintercept = 1978)

There are several kinds of components in the above expressions.

  1. function name

  2. data table name

  3. variable name

  4. argument name

  5. constant

Match each of the following to what kind of component ( (a) through (e) ) it is.

  • ggplot(a)

  • data = (d)

  • Princes (b)

  • aes(a)

  • x =(d)

  • year(c)

  • geom_point(a)

  • color =(d)

  • xintercept =(d)

  • 1978(e)

Putting It All Together

Now let’s put the topics of R Markdown, R programming, and tidy data together. You are going to do this by using the msleep data table from the {ggplot2} package.

  1. Load the {ggplot2} package and the msleep dataset. Be sure to show your work in the code chunk below.
library(ggplot2)
data(msleep)
  1. Examine the msleep dataset in at least two ways (using at least two functions). Also, modify the code chunk option so that only the output appears in the knitted document but prevents the code from displaying. Be sure to show your work.
## tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
##  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
##  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
##  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...
## [1] 83
##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"
## [1] "carni"   "omni"    "herbi"   NA        "insecti"
  1. How many cases are there are in the msleep dataset? How many variables?
  • There are 83 cases and 11 variables in the ‘msleep’ dataset.
  1. Define what a case is (in the context of the dataset).
  • In this dataset each case represents the type of animal species, matched with sleep data
  1. Describe the attribute or measurement that is contained in the brainwt variable.
  • The measurement ‘brainwt’ variable represents brain weight for the different animals.
  1. How many categories are in the vore variable and what do they represent? Note: the documentation in the help file for the msleep data set is not consistent with the number of groups in the data. Please base your answer on the number of categories in the actual data.
  • There are 5 categories in the ‘vore’ variable.

    *carni represents carnivore animals

    *omni represents omnivore animals

    *herbi represents herbivore animals

    *insecti represents insect animals

    *NA represents missing values

  1. Tweak each of the following R commands so that they run correctly [Note: Take out eval=FALSE in the options of the code chunk so that the code executes in your assignment]:
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(data = msleep) + 
  geom_point(mapping = aes(x = bodywt, y = sleep_total))

omnivore <- msleep %>%
  filter(vore == "omni")
omnivore
---
title: "Week 2 Challenge Problem"
author: "Bilsuma Adema"
output:
  html_document:
    df_print: paged
    code_download: true
---

### Preparation Instructions

- Reminder: Make sure to save this file (File > Save As ...) in the folder you created for this class (NOT in the Downloads folder!).

- Knit your document to HTML frequently so you can more easily find your mistakes. [Things to consider when you knit your documents: Does everything look as you suspected? If not, try to figure out the problem and fix it.]

- Be sure to change the author in the YAML to your name. Remember to keep it inside the quotes.


#### EXERCISES

**Tidy Data**

The [United Nations](https://www.un.org/development/desa/pd/content/international-migrant-stock) collects information about the estimates of international migration by age, sex, and origin. In this Challenge Problem, you are going to examine the data from 2015 that provides information on migrants by destination country and origin country.

For Questions 1 to 3, refer to the United Nations' migration spreadsheet (worksheet titled Table 16) found on the Canvas page (titled *UN_MigrantStockByOriginAndDestination_2015.xlsx*). 

(@) Identify at least three things that make this spreadsheet untidy. 
* The column headers contain values rather than names 
* The variables (origins) are stored in both rows and columns 
* There are multiple observations stored in a single table

(@) Describe how you would make this spreadsheet into a tidy dataset. Be sure to clearly define the cases (or observational units) and the variables in your tidy dataset. 
* The cases for this spreadsheet would be each migrants country of origin matched to the destination in 2015, while the variables for this spreadsheet would be Country of Origin, Destination Country/Region, and Migrant Stock. To make this spreadsheet into tidy dataset, each row would represent a case, and each column would represent a variable. 

(@) While tidy data is useful for analyzing the data within a software, when might a spreadsheet or presentation like this one be relevant?
* If an infectious disease outbreak occurred, public health professionals could use this spreadsheet to evaluate where people are immigrating from to determine if the disease came from there.   

**Extra challenge (not required):** Re-write the spreadsheet in a tidy form for the first 10 cases using Google Sheets and providing a link to the data in this document. Be sure to change the Share settings to "Anyone with the link" can view so that the link works.

**R programming**

Note: The following questions are adapted from the exercises presented in Chapter 3 of the [Data Computing](https://dtkaplan.github.io/DataComputingEbook/chap-basic-r-commands.html#chap:basic-r-commands) textbook.

(@) Explain why the following sentence would result in an error message:

```{r eval=FALSE}
Result <- %>% filter(BabyNames, name == "Prince")
```
* Currently, the data table (BabyNames) is next to the argument (name = "Prince"), which is incorrect. The pipe (%>%) has to have the data table (BabyNames) in front of it so that the function (filter) can work properly. 
(@) Consider these R expressions. (You don’t have to know what the various functions do to solve this problem.)

```{r eval=FALSE}
# Wrangling the data: to count the number babies named Prince, grouped by year and sex
Princes <-
  BabyNames %>%
  filter(name == "Prince") %>%
  group_by(year, sex) %>%
  summarise(yearlyTotal = sum(count))

# Graphing the results
Princes %>%
  ggplot(data = Princes, aes(x = year, y = yearlyTotal)) + 
  geom_point(aes(color = sex)) + 
  geom_vline(xintercept = 1978)
```

There are several kinds of components in the above expressions.

a. function name

#. data table name

#. variable name

#. argument name

#. constant

Match each of the following to what kind of component ( (a) through (e) ) it is.

- `ggplot(a)`

- `data = (d)`

- `Princes (b)`

- `aes(a)`

- `x =(d)`

- `year(c)`

- `geom_point(a)`

- `color =(d)`

- `xintercept =(d)`

- `1978(e)`

**Putting It All Together**

Now let's put the topics of R Markdown, R programming, and tidy data together. You are going to do this by using the `msleep` data table from the `{ggplot2}` package. 

(@) Load the `{ggplot2}` package and the `msleep` dataset. Be sure to show your work in the code chunk below.

```{r}
library(ggplot2)
data(msleep)
```

(@) Examine the `msleep` dataset in at least two ways (using at least two functions). Also, modify the code chunk option so that only the output appears in the knitted document but prevents the code from displaying. Be sure to show your work.

```{r echo=FALSE}
str(msleep)
nrow(msleep)
names(msleep)
unique(msleep$vore)
```

(@) How many cases are there are in the `msleep` dataset? How many variables?
* There are 83 cases and 11 variables in the 'msleep' dataset. 

(@) Define what a case is (in the context of the dataset).
* In this dataset each case represents the type of animal species, matched with sleep data 

(@) Describe the attribute or measurement that is contained in the `brainwt` variable.
* The measurement 'brainwt' variable represents brain weight for the different animals. 

(@) How many categories are in the `vore` variable and what do they represent? Note: the documentation in the help file for the `msleep` data set is not consistent with the number of groups in the data. Please base your answer on the number of categories in the actual data. 
* There are 5 categories in the 'vore' variable. 

  *carni represents carnivore animals
  
  *omni represents omnivore animals 
  
  *herbi represents herbivore animals 
  
  *insecti represents insect animals 
  
  *NA represents missing values

(@) Tweak each of the following R commands so that they run correctly [Note: Take out `eval=FALSE` in the options of the code chunk so that the code executes in your assignment]:

```{r}
library(ggplot2)
library(dplyr)

ggplot(data = msleep) + 
  geom_point(mapping = aes(x = bodywt, y = sleep_total))

omnivore <- msleep %>%
  filter(vore == "omni")
omnivore
```

