Question 1. File import

a problematic file has been received. here’s the first 5 lines of the file:

#human gene information
gene_id;gene_name;chr;start;end;strand;tx_count;gc_perc;biotype
ENSG00000210049;MT-TF;MT;577;647;1;1;40.85;Mt_tRNA
ENSG00000211459;MT-RNR1;MT;648;1601;1;1;45.49;Mt_rRNA
ENSG00000210077;MT-TV;MT;1602;1670;1;1;42.03;Mt_tRNA

The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/problem_file.txt

Please write a code to import this file. (Import means can you successfully get the data into data frame structure. For example, does filter(gc_perc > 42) work on imported table?)

Solution 1.

The file is not a csv file, thus we should not be using read.csv or read_csv functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there’s generic functions read.table and read_delim. The delimeter is | character and there are two comment line on top. As you can see, both functions have completely different options.

# this is base R solution and it automatically skips first commnent line, 
# due to `comment.char = "#"` default option
read.table("problem_file.txt", sep = ";", header = TRUE, stringsAsFactors = FALSE)  
                                                                                                
# OR
library(readr)
read_delim("problem_file.txt",  delim = ";", skip=1)  #this is tidyverse solution

Below is the related slide from “Importing Data in R” course, Chapter 2.

Import with delim or sep options

Import with delim or sep options

Question 2. A simple question - Summary of columns

A. general summary

Using human_genes table (which is loaded for you), please calculate the following summary

  • total number of genes
  • total number of chromosomes
  • total number of biotypes

Expected output:

n_genes n_chr n_biotypes
.. .. ..

Solution 2.A

library(dplyr)
human_genes %>%
  summarise(n_genes=n(),
            n_chr=n_distinct(chr),
            n_biotypes=n_distinct(biotype))

B. protein coding genes not located on mitochondria chromosome

Using the same table please calculate the following summary, considering only protein_coding genes which are not on mitochandrial chromosome (MT in table)

  • total number of genes (after filtering)
  • average gene length
  • average GC percentage

Expected output:

n_genes avg_length_genes avg_gc_perc
.. .. ..

Solution 2.B

human_genes %>%
  filter(biotype=="protein_coding" & chr != "MT") %>% 
  summarise(n_genes=n(),
            avg_length_genes=mean(end-start),
            avg_gc_perc=mean(gc_perc))

Here’s the screenshot from DataCamp exercise which uses aggregate functions with summarise(). “Data Manipulation in R with dplyr” lecture, Chapter 4, Exercise title “dplyr aggregate functions”

dplyr summarise aggregate functions

dplyr summarise aggregate functions

Question 3. Airline carrier with better delay stats

Average Delays per Carrier

Average Delays per Carrier

Please write down the code below which can produce the plot above The plot describes average departure delay (Red) and arrival delay (Blue) for each carrier.

Solution 3

library(ggplot2)
hflights %>% 
  group_by(UniqueCarrier) %>% 
  summarise(avg_arr_delay=mean(ArrDelay,na.rm = TRUE),
            avg_dep_delay=mean(DepDelay,na.rm = TRUE)) %>% 
  ggplot() +
  # it would be difficult to plot two separate delay data as scatter plot. remember, ggplot draws in layers, 
  geom_point(aes(x=UniqueCarrier,y=avg_arr_delay,col="arr_delay")) +   # this is first point layer
  geom_point(aes(x=UniqueCarrier,y=avg_dep_delay,col="dep_delay")) +   # second point layer for departure delays
  ylab("Average \nArr or Dep Delay") +                                 # this is not required
  scale_colour_manual(name = "Delays",                     # this part is completely cosmetic please ignore
                       values =c("arr_delay" = 'blue',"dep_delay" = 'red'),      # theming - you can ignore
                       labels = c('Arr Delay','Dep Delay'),                      # theming - you can ignore
                       breaks=c("arr_delay","dep_delay")) +                      # theming - you can ignore
  theme_light()                                                                  # theming - you can ignore

Question 4. A troublesome join

You already studied and are aware of hflights table from DataCamp courses. Three more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let’s use join to combine all data. Use airline_codes, cancellation_codes and airport_codes tables (already loaded for you) to find out most frequent weather related cancellations for airline and destination.

“airline” column should have carriers full name, “destination” column should have full name of airport and “cancellation” column should have the explanation of cancellation.

Expected output:

airline destination cancellation count
.. .. .. ..

Solution 4

hflights %>% 
  inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>%           # the keys have different names 
  inner_join(airport_codes, by=c("Dest"="faa")) %>%                     # the keys have different names
  inner_join(cancellation_codes, by=c("CancellationCode"="name")) %>%   # the keys have different names 
  # you can use suffix option in joins to keep track of which column comes from which table
  select(airline=value.x,destination=name,cancellation=value.y) %>%     # selecting and renaming columns
  filter(cancellation=="weather") %>% 
  group_by(airline,destination,cancellation) %>%  # to count per airline and destination pairs, we group them
  count() %>% 
  arrange(-n)
---
title: "Gr2 Midterm Solutions"
author: "alper yilmaz"
date: "November 15, 2017"
output: html_notebook
---

```{r setup, include=FALSE}
# there are missing/necessary libraries please load them yourself
library(hflights)

files_to_download <- c("airline_codes","cancellation_codes","airport_codes","human_genes")
url <- "https://s3-us-west-2.amazonaws.com/veri-analizi/"
# we should not be downloading files at each run - for single run this is better: readRDS(gzcon(url("remote-url-rds-file")))
downloads <- lapply(files_to_download, function(x) {
  if (!file.exists(paste0(x , ".rds"))) { download.file(paste0( url , x , ".rds" ), paste0(x , ".rds"))
  }  
  assign( x, readRDS( paste0( x , ".rds" )), envir = parent.frame(2) ) 
NULL
})
```

# Question 1. File import

a problematic file has been received. here's the first 5 lines of the file:

```
#human gene information
gene_id;gene_name;chr;start;end;strand;tx_count;gc_perc;biotype
ENSG00000210049;MT-TF;MT;577;647;1;1;40.85;Mt_tRNA
ENSG00000211459;MT-RNR1;MT;648;1601;1;1;45.49;Mt_rRNA
ENSG00000210077;MT-TV;MT;1602;1670;1;1;42.03;Mt_tRNA
```

The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/problem_file.txt

Please write a code to import this file. (**Import** means can you successfully get the data into data frame structure. For example, does `filter(gc_perc > 42)` work on imported table?)

# Solution 1. 

The file is not a csv file, thus we should not be using `read.csv` or `read_csv` functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there's generic functions `read.table` and `read_delim`. The delimeter is `|` character and there are two comment line on top. As you can see, both functions have completely different options. 

```{r solution1, warning= FALSE, message=FALSE}
# this is base R solution and it automatically skips first commnent line, 
# due to `comment.char = "#"` default option
read.table("problem_file.txt", sep = ";", header = TRUE, stringsAsFactors = FALSE)  
                                                                                                
# OR

library(readr)
read_delim("problem_file.txt",  delim = ";", skip=1)  #this is tidyverse solution
```

Below is the related slide from "Importing Data in R" course, Chapter 2.

![Import with delim or sep options](import_delim.png)

# Question 2. A simple question - Summary of columns

## A. general summary

Using `human_genes` table (which is loaded for you), please calculate the following summary

* total number of genes
* total number of chromosomes
* total number of biotypes

Expected output:

| n_genes | n_chr | n_biotypes |
|---------|-------|------------|
|   ..    |   ..  |     ..     |

## Solution 2.A

```{r answer2a, warning=FALSE, message=FALSE}
library(dplyr)
human_genes %>%
  summarise(n_genes=n(),
            n_chr=n_distinct(chr),
            n_biotypes=n_distinct(biotype))
```


## B. protein coding genes not located on mitochondria chromosome

Using the same table please calculate the following summary, considering only `protein_coding` genes which are not on mitochandrial chromosome (`MT` in table)

* total number of genes (after filtering)
* average gene length
* average GC percentage

Expected output:

| n_genes | avg_length_genes | avg_gc_perc | 
|---------|------------------|-------------|
|   ..    |       ..         |      ..     |

## Solution 2.B

```{r answer2b}
human_genes %>%
  filter(biotype=="protein_coding" & chr != "MT") %>% 
  summarise(n_genes=n(),
            avg_length_genes=mean(end-start),
            avg_gc_perc=mean(gc_perc))
```

Here's the screenshot from DataCamp exercise which uses aggregate functions with `summarise()`. "Data Manipulation in R with dplyr" lecture, Chapter 4, Exercise title "dplyr aggregate functions" 

![dplyr summarise aggregate functions](dplyr_summarise_aggregate.png)



# Question 3. Airline carrier with better delay stats

![Average Delays per Carrier](https://s3-us-west-2.amazonaws.com/veri-analizi/average_delay.png)

Please write down the code below which can produce the plot above The plot describes average departure delay (Red) and arrival delay (Blue) for each carrier.

# Solution 3

```{r answer3}
library(ggplot2)
hflights %>% 
  group_by(UniqueCarrier) %>% 
  summarise(avg_arr_delay=mean(ArrDelay,na.rm = TRUE),
            avg_dep_delay=mean(DepDelay,na.rm = TRUE)) %>% 
  ggplot() +
  # it would be difficult to plot two separate delay data as scatter plot. remember, ggplot draws in layers, 
  geom_point(aes(x=UniqueCarrier,y=avg_arr_delay,col="arr_delay")) +   # this is first point layer
  geom_point(aes(x=UniqueCarrier,y=avg_dep_delay,col="dep_delay")) +   # second point layer for departure delays
  ylab("Average \nArr or Dep Delay") +                                 # this is not required
  scale_colour_manual(name = "Delays",                     # this part is completely cosmetic please ignore
                       values =c("arr_delay" = 'blue',"dep_delay" = 'red'),      # theming - you can ignore
                       labels = c('Arr Delay','Dep Delay'),                      # theming - you can ignore
                       breaks=c("arr_delay","dep_delay")) +                      # theming - you can ignore
  theme_light()                                                                  # theming - you can ignore
```

# Question 4. A troublesome join

You already studied and are aware of `hflights` table from DataCamp courses. Three more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let's use `join` to combine all data. Use `airline_codes`, `cancellation_codes` and `airport_codes` tables (already loaded for you) to find out most frequent `weather` related cancellations for airline and destination.

"airline" column should have carriers full name, "destination" column should have full name of airport and "cancellation" column should have the explanation of cancellation.

Expected output:

| airline | destination | cancellation |  count |
|---------|-------------|--------------|--------|
|   ..    |   ..        |       ..     |  ..    |

# Solution 4

```{r answer4}
hflights %>% 
  inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>%           # the keys have different names 
  inner_join(airport_codes, by=c("Dest"="faa")) %>%                     # the keys have different names
  inner_join(cancellation_codes, by=c("CancellationCode"="name")) %>%   # the keys have different names 
  # you can use suffix option in joins to keep track of which column comes from which table
  select(airline=value.x,destination=name,cancellation=value.y) %>%     # selecting and renaming columns
  filter(cancellation=="weather") %>% 
  group_by(airline,destination,cancellation) %>%  # to count per airline and destination pairs, we group them
  count() %>% 
  arrange(-n)
```

