Question 1. File import
a problematic file has been received. here’s the first 5 lines of the file:
#human gene information
gene_id;gene_name;chr;start;end;strand;tx_count;gc_perc;biotype
ENSG00000210049;MT-TF;MT;577;647;1;1;40.85;Mt_tRNA
ENSG00000211459;MT-RNR1;MT;648;1601;1;1;45.49;Mt_rRNA
ENSG00000210077;MT-TV;MT;1602;1670;1;1;42.03;Mt_tRNA
The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/problem_file.txt
Please write a code to import this file. (Import means can you successfully get the data into data frame structure. For example, does filter(gc_perc > 42)
work on imported table?)
Solution 1.
The file is not a csv file, thus we should not be using read.csv
or read_csv
functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there’s generic functions read.table
and read_delim
. The delimeter is |
character and there are two comment line on top. As you can see, both functions have completely different options.
# this is base R solution and it automatically skips first commnent line,
# due to `comment.char = "#"` default option
read.table("problem_file.txt", sep = ";", header = TRUE, stringsAsFactors = FALSE)
# OR
library(readr)
read_delim("problem_file.txt", delim = ";", skip=1) #this is tidyverse solution
Below is the related slide from “Importing Data in R” course, Chapter 2.
Question 2. A simple question - Summary of columns
A. general summary
Using human_genes
table (which is loaded for you), please calculate the following summary
- total number of genes
- total number of chromosomes
- total number of biotypes
Expected output:
Solution 2.A
library(dplyr)
human_genes %>%
summarise(n_genes=n(),
n_chr=n_distinct(chr),
n_biotypes=n_distinct(biotype))
B. protein coding genes not located on mitochondria chromosome
Using the same table please calculate the following summary, considering only protein_coding
genes which are not on mitochandrial chromosome (MT
in table)
- total number of genes (after filtering)
- average gene length
- average GC percentage
Expected output:
Solution 2.B
human_genes %>%
filter(biotype=="protein_coding" & chr != "MT") %>%
summarise(n_genes=n(),
avg_length_genes=mean(end-start),
avg_gc_perc=mean(gc_perc))
Here’s the screenshot from DataCamp exercise which uses aggregate functions with summarise()
. “Data Manipulation in R with dplyr” lecture, Chapter 4, Exercise title “dplyr aggregate functions”
Question 3. Airline carrier with better delay stats
Please write down the code below which can produce the plot above The plot describes average departure delay (Red) and arrival delay (Blue) for each carrier.
Solution 3
library(ggplot2)
hflights %>%
group_by(UniqueCarrier) %>%
summarise(avg_arr_delay=mean(ArrDelay,na.rm = TRUE),
avg_dep_delay=mean(DepDelay,na.rm = TRUE)) %>%
ggplot() +
# it would be difficult to plot two separate delay data as scatter plot. remember, ggplot draws in layers,
geom_point(aes(x=UniqueCarrier,y=avg_arr_delay,col="arr_delay")) + # this is first point layer
geom_point(aes(x=UniqueCarrier,y=avg_dep_delay,col="dep_delay")) + # second point layer for departure delays
ylab("Average \nArr or Dep Delay") + # this is not required
scale_colour_manual(name = "Delays", # this part is completely cosmetic please ignore
values =c("arr_delay" = 'blue',"dep_delay" = 'red'), # theming - you can ignore
labels = c('Arr Delay','Dep Delay'), # theming - you can ignore
breaks=c("arr_delay","dep_delay")) + # theming - you can ignore
theme_light() # theming - you can ignore

Question 4. A troublesome join
You already studied and are aware of hflights
table from DataCamp courses. Three more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let’s use join
to combine all data. Use airline_codes
, cancellation_codes
and airport_codes
tables (already loaded for you) to find out most frequent weather
related cancellations for airline and destination.
“airline” column should have carriers full name, “destination” column should have full name of airport and “cancellation” column should have the explanation of cancellation.
Expected output:
Solution 4
hflights %>%
inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>% # the keys have different names
inner_join(airport_codes, by=c("Dest"="faa")) %>% # the keys have different names
inner_join(cancellation_codes, by=c("CancellationCode"="name")) %>% # the keys have different names
# you can use suffix option in joins to keep track of which column comes from which table
select(airline=value.x,destination=name,cancellation=value.y) %>% # selecting and renaming columns
filter(cancellation=="weather") %>%
group_by(airline,destination,cancellation) %>% # to count per airline and destination pairs, we group them
count() %>%
arrange(-n)
---
title: "Gr2 Midterm Solutions"
author: "alper yilmaz"
date: "November 15, 2017"
output: html_notebook
---

```{r setup, include=FALSE}
# there are missing/necessary libraries please load them yourself
library(hflights)

files_to_download <- c("airline_codes","cancellation_codes","airport_codes","human_genes")
url <- "https://s3-us-west-2.amazonaws.com/veri-analizi/"
# we should not be downloading files at each run - for single run this is better: readRDS(gzcon(url("remote-url-rds-file")))
downloads <- lapply(files_to_download, function(x) {
  if (!file.exists(paste0(x , ".rds"))) { download.file(paste0( url , x , ".rds" ), paste0(x , ".rds"))
  }  
  assign( x, readRDS( paste0( x , ".rds" )), envir = parent.frame(2) ) 
NULL
})
```

# Question 1. File import

a problematic file has been received. here's the first 5 lines of the file:

```
#human gene information
gene_id;gene_name;chr;start;end;strand;tx_count;gc_perc;biotype
ENSG00000210049;MT-TF;MT;577;647;1;1;40.85;Mt_tRNA
ENSG00000211459;MT-RNR1;MT;648;1601;1;1;45.49;Mt_rRNA
ENSG00000210077;MT-TV;MT;1602;1670;1;1;42.03;Mt_tRNA
```

The file is located at: https://s3-us-west-2.amazonaws.com/veri-analizi/problem_file.txt

Please write a code to import this file. (**Import** means can you successfully get the data into data frame structure. For example, does `filter(gc_perc > 42)` work on imported table?)

# Solution 1. 

The file is not a csv file, thus we should not be using `read.csv` or `read_csv` functions. As you remember from the discussion in class, there are custom functions for specific file types (csv, tsv) and then there's generic functions `read.table` and `read_delim`. The delimeter is `|` character and there are two comment line on top. As you can see, both functions have completely different options. 

```{r solution1, warning= FALSE, message=FALSE}
# this is base R solution and it automatically skips first commnent line, 
# due to `comment.char = "#"` default option
read.table("problem_file.txt", sep = ";", header = TRUE, stringsAsFactors = FALSE)  
                                                                                                
# OR

library(readr)
read_delim("problem_file.txt",  delim = ";", skip=1)  #this is tidyverse solution
```

Below is the related slide from "Importing Data in R" course, Chapter 2.

![Import with delim or sep options](import_delim.png)

# Question 2. A simple question - Summary of columns

## A. general summary

Using `human_genes` table (which is loaded for you), please calculate the following summary

* total number of genes
* total number of chromosomes
* total number of biotypes

Expected output:

| n_genes | n_chr | n_biotypes |
|---------|-------|------------|
|   ..    |   ..  |     ..     |

## Solution 2.A

```{r answer2a, warning=FALSE, message=FALSE}
library(dplyr)
human_genes %>%
  summarise(n_genes=n(),
            n_chr=n_distinct(chr),
            n_biotypes=n_distinct(biotype))
```


## B. protein coding genes not located on mitochondria chromosome

Using the same table please calculate the following summary, considering only `protein_coding` genes which are not on mitochandrial chromosome (`MT` in table)

* total number of genes (after filtering)
* average gene length
* average GC percentage

Expected output:

| n_genes | avg_length_genes | avg_gc_perc | 
|---------|------------------|-------------|
|   ..    |       ..         |      ..     |

## Solution 2.B

```{r answer2b}
human_genes %>%
  filter(biotype=="protein_coding" & chr != "MT") %>% 
  summarise(n_genes=n(),
            avg_length_genes=mean(end-start),
            avg_gc_perc=mean(gc_perc))
```

Here's the screenshot from DataCamp exercise which uses aggregate functions with `summarise()`. "Data Manipulation in R with dplyr" lecture, Chapter 4, Exercise title "dplyr aggregate functions" 

![dplyr summarise aggregate functions](dplyr_summarise_aggregate.png)



# Question 3. Airline carrier with better delay stats

![Average Delays per Carrier](https://s3-us-west-2.amazonaws.com/veri-analizi/average_delay.png)

Please write down the code below which can produce the plot above The plot describes average departure delay (Red) and arrival delay (Blue) for each carrier.

# Solution 3

```{r answer3}
library(ggplot2)
hflights %>% 
  group_by(UniqueCarrier) %>% 
  summarise(avg_arr_delay=mean(ArrDelay,na.rm = TRUE),
            avg_dep_delay=mean(DepDelay,na.rm = TRUE)) %>% 
  ggplot() +
  # it would be difficult to plot two separate delay data as scatter plot. remember, ggplot draws in layers, 
  geom_point(aes(x=UniqueCarrier,y=avg_arr_delay,col="arr_delay")) +   # this is first point layer
  geom_point(aes(x=UniqueCarrier,y=avg_dep_delay,col="dep_delay")) +   # second point layer for departure delays
  ylab("Average \nArr or Dep Delay") +                                 # this is not required
  scale_colour_manual(name = "Delays",                     # this part is completely cosmetic please ignore
                       values =c("arr_delay" = 'blue',"dep_delay" = 'red'),      # theming - you can ignore
                       labels = c('Arr Delay','Dep Delay'),                      # theming - you can ignore
                       breaks=c("arr_delay","dep_delay")) +                      # theming - you can ignore
  theme_light()                                                                  # theming - you can ignore
```

# Question 4. A troublesome join

You already studied and are aware of `hflights` table from DataCamp courses. Three more tables are already defined for you (when you run the first chunk). Instead of using lookup tables as you did DataCamp, let's use `join` to combine all data. Use `airline_codes`, `cancellation_codes` and `airport_codes` tables (already loaded for you) to find out most frequent `weather` related cancellations for airline and destination.

"airline" column should have carriers full name, "destination" column should have full name of airport and "cancellation" column should have the explanation of cancellation.

Expected output:

| airline | destination | cancellation |  count |
|---------|-------------|--------------|--------|
|   ..    |   ..        |       ..     |  ..    |

# Solution 4

```{r answer4}
hflights %>% 
  inner_join(airline_codes, by=c("UniqueCarrier"="name")) %>%           # the keys have different names 
  inner_join(airport_codes, by=c("Dest"="faa")) %>%                     # the keys have different names
  inner_join(cancellation_codes, by=c("CancellationCode"="name")) %>%   # the keys have different names 
  # you can use suffix option in joins to keep track of which column comes from which table
  select(airline=value.x,destination=name,cancellation=value.y) %>%     # selecting and renaming columns
  filter(cancellation=="weather") %>% 
  group_by(airline,destination,cancellation) %>%  # to count per airline and destination pairs, we group them
  count() %>% 
  arrange(-n)
```

