Question 1. Wrong file extension

As the instructor mentioned in class, the file extensions does not dictate the contents of a file. We receive student grade file which is supposed to be a csv file. We are interested in students with letter_grade “A”. But our filtering does not work. Please fix the code below:

download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/student_grades.csv","student_grades.csv")
grades <- read_csv("student_grades.csv")

grades %>% 
  filter(letter_grade=="A")

The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/student_grades.csv

Solution 1

The file is not a csv file, thus we should not be using read.csv or read_csv functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there’s generic functions read.table and read_delim. The delimeter is | character and there are two comment line on top. As you can see, both functions have completely different options.

library(dplyr)
# this is base R solution and it automatically skips first 2 commnents ilne, due to `comment.char = "#"` default option
grades <- read.table("student_grades.csv", sep = "|", header = TRUE, stringsAsFactors = FALSE)  
                                                                                                
grades %>% 
  filter(letter_grade=="A")
# OR
library(readr)
grades2 <- read_delim("student_grades.csv",  delim = "|", skip=2)  #this is tidyverse solution
grades2 %>% 
  filter(letter_grade=="A")

Below is the related slide from “Importing Data in R” course, Chapter 2.

Import with delim or sep options

Import with delim or sep options

Question 2. Genes and phenotypes

A. Genes without phenotype

Using human_genes and phenotypes tables (which are loaded for you), please calculate the following: The number of genes which does not have any phenotype associatied with them. (please use join)

Solution 2.A

# gene_id column is the key
# let's use anti_join first
anti_join(human_genes,phenotypes) %>% 
  count()
Joining, by = "gene_id"
# OR we can use left_join
left_join(human_genes, phenotypes) %>% 
  filter(is.na(phenotype)) %>% 
#  summarise(count=n_distinct(gene_id))
  count()
Joining, by = "gene_id"

B. Bad chromosomes

Using human_genes and phenotypes tables, please calculate how many genes each chromosome carry which are associated with a phenotype. Please do not count a gene multiple time if it’s associated with multiple phenotypes.

Expected output:

Chromosome count
.. ..

Solution 2.B

inner_join(human_genes,phenotypes) %>% 
  group_by(chr) %>% 
  summarise(count=n_distinct(gene_id))
Joining, by = "gene_id"

Question 3. A Bad Day of Month

Figure: Total and Delayed number of flights for days of month with highest percentage

Figure: Total and Delayed number of flights for days of month with highest percentage

The plot describes top 10 days of month which have highest percentage of delays (percentage= number of delayed flights on that Day of Month / total number of flights on that Day of Month). You can consider a flight as “delayed” if Departure delay is equal to or greater than 5 minutes. Blue dots show total flights and red dots show delayed flight counts. Please write down the code below which can produce the plot above.

Solution 3

library(ggplot2)
hflights %>% 
  group_by(DayofMonth) %>% 
  summarise(n_flights=n(),
            n_delays=sum(DepDelay > 5, na.rm=TRUE),  # counting delays via logical vector
            p_delay=n_delays/n_flights*100) %>%      # percentage calculation
  filter(rank(-p_delay) < 11 ) %>%    # extracting top 10 with rank(), minus means 
                                      # according to reverse order, highest p_delay value
#  arrange(-p_delay) %>% 
  ggplot(aes(x=DayofMonth)) +
  geom_point(aes(y=n_flights),color="blue") +
  geom_point(aes(y=n_delays),color="red") +
  ylab("Number of \ndelayed or total flights")  # this labeling was not required

By the way, DayOfMonth should not be treated as number in this graph, it should be treated as factor. If you add mutate(DayofMonth=as.factor(DayofMonth)) just before group_by line, then you get the following plot, which is more accurate.

Here’s the screenshot from DataCamp exercise which uses percentage calculation and counting with sum() over logical vector technique. “Data Manipulation in R with dplyr” lecture, Chapter 5, Exercise title “Unite and conquer using group_by”

dplyr percentage and sum

dplyr percentage and sum

Question 4. A Bad Day for Everyone

You already studied and are aware of hflights table from DataCamp courses. Two more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let’s use join to combine all data. Use airline_codes and airport_codes tables (already loaded for you) to find out the worst day for each airline. The worst day is defined as the maximum departure delay, so please find out max departure delay of each airline and then join with necessary tables to get the output below:

Expected output:

Airline Destination Max Dep Delay
.. .. ..

“Airline” column should have carrier’s full name and “Destination” column should have full name of airport.

Hint: please consider using rank() function

Solution 4

hflights %>% 
  inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>%   # the keys have different names
  inner_join(airport_codes, by=c("Dest"="faa")) %>%             # the keys have different names
  group_by(UniqueCarrier) %>%        # for using rank() within airline groups, we need group_by
  filter(rank(-DepDelay)==1) %>%     # select lines whose rank is 1 in reverse (highest DepDelay)
  ungroup() %>%                      # this is not required but generates cleaner result
  select(Airline=value,Destination=name,Max_Dep_Delay=DepDelay)  # select and rename at the same time

Here’s the screenshot from DataCamp exercise which uses mutate(rank=rank(desc(n))) %>% filter(rank==1) to pick rows with highest rank. In our solution we used filter(rank(-DepDelay)==1) to skip mutate() step. “Data Manipulation in R with dplyr” lecture, Chapter 5, Exercise title “Advanced group_by exercises”

dplyr rank and filter

dplyr rank and filter

---
title: "Gr1 Midterm Solutions"
author: "alper yilmaz"
date: "November 15, 2017"
output: html_notebook
---

```{r setup, include=FALSE}
# there are missing/necessary libraries please load them yourself
library(hflights)

files_to_download <- c("airline_codes","phenotypes","airport_codes","human_genes")
url <- "https://s3-us-west-2.amazonaws.com/veri-analizi/"
# we should not be downloading files at each run - for single run this is better: readRDS(gzcon(url("remote-url-rds-file")))
downloads <- lapply(files_to_download, function(x) {
  if (!file.exists(paste0(x , ".rds"))) { download.file(paste0( url , x , ".rds" ), paste0(x , ".rds"))
  }  
  assign( x, readRDS( paste0( x , ".rds" )), envir = parent.frame(2) ) 
NULL
})
```

# Question 1. Wrong file extension

As the instructor mentioned in class, the file extensions does not dictate the contents of a file. We receive student grade file which is supposed to be a csv file. We are interested in students with letter_grade "A". But our filtering does not work. Please fix the code below:

```{r answer1, eval=FALSE}
download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/student_grades.csv","student_grades.csv")
grades <- read_csv("student_grades.csv")

grades %>% 
  filter(letter_grade=="A")
```

The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/student_grades.csv

# Solution 1

The file is not a csv file, thus we should not be using `read.csv` or `read_csv` functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there's generic functions `read.table` and `read_delim`. The delimeter is `|` character and there are two comment line on top. As you can see, both functions have completely different options. 

```{r solution1, warning= FALSE, message=FALSE}
library(dplyr)
# this is base R solution and it automatically skips first 2 commnents ilne, due to `comment.char = "#"` default option
grades <- read.table("student_grades.csv", sep = "|", header = TRUE, stringsAsFactors = FALSE)  
                                                                                                
grades %>% 
  filter(letter_grade=="A")

# OR

library(readr)
grades2 <- read_delim("student_grades.csv",  delim = "|", skip=2)  #this is tidyverse solution
grades2 %>% 
  filter(letter_grade=="A")

```

Below is the related slide from "Importing Data in R" course, Chapter 2.

![Import with delim or sep options](import_delim.png)

# Question 2. Genes and phenotypes

## A. Genes without phenotype

Using `human_genes` and `phenotypes` tables (which are loaded for you), please calculate the following: The number of genes which does not have any phenotype associatied with them. (please use `join`)

## Solution 2.A

```{r answer2a}
# gene_id column is the key
# let's use anti_join first
anti_join(human_genes,phenotypes) %>% 
  count()

# OR we can use left_join
left_join(human_genes, phenotypes) %>% 
  filter(is.na(phenotype)) %>% 
#  summarise(count=n_distinct(gene_id))
  count()
```


## B. Bad chromosomes

Using `human_genes` and `phenotypes` tables, please calculate how many genes each chromosome carry **which are associated with a phenotype**. Please do not count a gene multiple time if it's associated with multiple phenotypes.

Expected output:

| Chromosome | count |
|------------|-------|
|     ..     |  ..   |

## Solution 2.B

```{r answer2b}
inner_join(human_genes,phenotypes) %>% 
  group_by(chr) %>% 
  summarise(count=n_distinct(gene_id))
```

# Question 3. A Bad Day of Month

![Figure: Total and Delayed number of flights for days of month with highest percentage](https://s3-us-west-2.amazonaws.com/veri-analizi/percent_delay.png)

The plot describes **top 10** days of month which have **highest percentage** of delays (percentage= *number of delayed flights on that Day of Month* / *total number of flights on that Day of Month*). You can consider a flight as "delayed" if Departure delay is equal to or greater than 5 minutes. Blue dots show total flights and red dots show delayed flight counts. Please write down the code below which can produce the plot above.

# Solution 3

```{r answer3}
library(ggplot2)
hflights %>% 
  group_by(DayofMonth) %>% 
  summarise(n_flights=n(),
            n_delays=sum(DepDelay > 5, na.rm=TRUE),  # counting delays via logical vector
            p_delay=n_delays/n_flights*100) %>%      # percentage calculation
  filter(rank(-p_delay) < 11 ) %>%    # extracting top 10 with rank(), minus means 
                                      # according to reverse order, highest p_delay value
#  arrange(-p_delay) %>% 
  ggplot(aes(x=DayofMonth)) +
  geom_point(aes(y=n_flights),color="blue") +
  geom_point(aes(y=n_delays),color="red") +
  ylab("Number of \ndelayed or total flights")  # this labeling was not required
```

By the way, DayOfMonth should not be treated as number in this graph, it should be treated as factor. If you add `mutate(DayofMonth=as.factor(DayofMonth))` just before `group_by` line, then you get the following plot, which is more accurate.

```{r answer3_alternate, echo=FALSE}
library(ggplot2)
hflights %>% 
  mutate(DayofMonth=as.factor(DayofMonth)) %>% 
  group_by(DayofMonth) %>% 
  summarise(n_flights=n(),
            n_delays=sum(DepDelay > 5, na.rm=TRUE),  # counting delays via logical vector
            p_delay=n_delays/n_flights*100) %>%      # percentage calculation
  filter(rank(-p_delay) < 11 ) %>%    # extracting top 10 with rank(), minus means 
                                      # according to reverse order, highest p_delay value
  ggplot(aes(x=DayofMonth)) +
  geom_point(aes(y=n_flights),color="blue") +
  geom_point(aes(y=n_delays),color="red") +
  ylab("Number of \ndelayed or total flights")  # this labeling was not required
```


Here's the screenshot from DataCamp exercise which uses percentage calculation and counting with `sum()` over logical vector technique. "Data Manipulation in R with dplyr" lecture, Chapter 5, Exercise title "Unite and conquer using group_by" 

![dplyr percentage and sum](dplyr_percentage.png)

# Question 4. A Bad Day for Everyone

You already studied and are aware of `hflights` table from DataCamp courses. Two more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let's use `join` to combine all data. Use `airline_codes` and `airport_codes` tables (already loaded for you) to find out the worst day for each airline. *The worst day* is defined as the maximum departure delay, so please find out max departure delay of **each airline** and then join with necessary tables to get the output below:

Expected output:

| Airline | Destination | Max Dep Delay |
|---------|-------------|---------------|
|   ..    |   ..        |       ..      |

"Airline" column should have carrier's full name and "Destination" column should have full name of airport.

*Hint:* please consider using `rank()` function
 
# Solution 4

```{r answer4}
hflights %>% 
  inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>%   # the keys have different names
  inner_join(airport_codes, by=c("Dest"="faa")) %>%             # the keys have different names
  group_by(UniqueCarrier) %>%        # for using rank() within airline groups, we need group_by
  filter(rank(-DepDelay)==1) %>%     # select lines whose rank is 1 in reverse (highest DepDelay)
  ungroup() %>%                      # this is not required but generates cleaner result
  select(Airline=value,Destination=name,Max_Dep_Delay=DepDelay)  # select and rename at the same time
```

Here's the screenshot from DataCamp exercise which uses `mutate(rank=rank(desc(n))) %>% filter(rank==1)` to pick rows with highest rank. In our solution we used `filter(rank(-DepDelay)==1)` to skip `mutate()` step. "Data Manipulation in R with dplyr" lecture, Chapter 5, Exercise title "Advanced group_by exercises" 

![dplyr rank and filter](dplyr_rank_filter.png)
