Review of assignment topics

read_csv() (from readr package) vs. read.csv() (from utils package)
- easy to use
- compatible with dplyr functions (the tibble)
read_tsv() and read_csv() wraps read_delim()
- just like read.csv() and read.delim() wrapping read.table()
fread() from data.table package
- fast.. very fast..
readxl package
- excel_sheets() lists spreadsheets
- read_excel() import data into R
- for example: read_excel("cities.xlsx", sheet = 2)
- detailed usage:
```
read_excel(path, sheet = 1,
col_names = TRUE,
col_types = NULL,
skip = 0)
```
read.xls is an alternative which should be used if it’s absolutely necessary.

Let’s exercise what we learned last week

# download data by the help of command below
# then import it with read.csv. What are the types of columns 1,2,3? 
download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/students.txt","students.txt")

Now, let’s try read_tsv

# download data by the help of command below
# then import it with read_tsv.  
download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/students2.tsv","students2.tsv")

It looks like we have problem with column headers. How can we fix it?

Reminder: This exercise would have been more difficult if you were to download the file to your Desktop and then import it (the curse of the working directory)

On a side note, read_csv or read_tsv can read files from URL:

read_csv("https://s3-us-west-2.amazonaws.com/veri-analizi/students.txt")

Daily exercises at DataCamp

Please do daily exercises for:

Introduction to R
~~Importing Data Into R (Part 1)~~ no exercises for importing data

Important points before the assignment for next week

There are 5 verbs used by dplyr package and an important group_by function to manipulate data.

dplyr verbs	Description
`select()`	select columns
`filter()`	filter rows
`arrange()`	re-order or arrange rows
`mutate()`	create new columns
`summarise()`	summarise values
`group_by()`	allows for group operations, summarise and mutate on grouped items

Let’s view different verbs, step by step on students.txt file. The students.txt file is imported to students variable.

library(readr)
students <- read_csv("students.txt")

## Parsed with column specification:
## cols(
##   studentNo = col_integer(),
##   dept = col_character(),
##   gender = col_character(),
##   birthYear = col_integer()
## )

students

## # A tibble: 10 x 4
##    studentNo   dept gender birthYear
##        <int>  <chr>  <chr>     <int>
##  1    100101 bioeng      M      1995
##  2    100102 bioeng      F      1996
##  3    100103 bioeng      F      1995
##  4    100104 bioeng      F      1997
##  5    100105    mbg      M      1996
##  6    100106    mbg      F      1997
##  7    100107    mbg      F      1995
##  8    100108 bioeng      M      1996
##  9    100109 bioeng      M      1995
## 10    100110 bioeng      F      1997

With select, we can drop dept column.

students %>%
  select(-dept)

## # A tibble: 10 x 3
##    studentNo gender birthYear
##        <int>  <chr>     <int>
##  1    100101      M      1995
##  2    100102      F      1996
##  3    100103      F      1995
##  4    100104      F      1997
##  5    100105      M      1996
##  6    100106      F      1997
##  7    100107      F      1995
##  8    100108      M      1996
##  9    100109      M      1995
## 10    100110      F      1997

With mutate, we can create a new column

students %>%
  select(-dept) %>%
  mutate(age = 2017-birthYear)

## # A tibble: 10 x 4
##    studentNo gender birthYear   age
##        <int>  <chr>     <int> <dbl>
##  1    100101      M      1995    22
##  2    100102      F      1996    21
##  3    100103      F      1995    22
##  4    100104      F      1997    20
##  5    100105      M      1996    21
##  6    100106      F      1997    20
##  7    100107      F      1995    22
##  8    100108      M      1996    21
##  9    100109      M      1995    22
## 10    100110      F      1997    20

With filter, we can select certain rows.

students %>%
  select(-dept) %>%
  mutate(age = 2017-birthYear) %>%
  filter(gender == "M")

## # A tibble: 4 x 4
##   studentNo gender birthYear   age
##       <int>  <chr>     <int> <dbl>
## 1    100101      M      1995    22
## 2    100105      M      1996    21
## 3    100108      M      1996    21
## 4    100109      M      1995    22

With arrange we can order the table according a column, descending or ascending direction.

students %>%
  select(-dept) %>%
  mutate(age = 2017-birthYear) %>%
  filter(gender == "M") %>%
  arrange(-age)

## # A tibble: 4 x 4
##   studentNo gender birthYear   age
##       <int>  <chr>     <int> <dbl>
## 1    100101      M      1995    22
## 2    100109      M      1995    22
## 3    100105      M      1996    21
## 4    100108      M      1996    21

With summarise, we can generate summaries for columns.

students %>%
  select(-dept) %>%
  mutate(age = 2017-birthYear) %>%
  filter(gender == "M") %>%
  arrange(-age) %>%
  summarise(avg=mean(age))

## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  21.5

If group_by is used to group columns, then subsequent summarise will run within groups and will generate results per group. Below we calculate mean age per gender.

students %>%
  select(-dept) %>%
  mutate(age = 2017-birthYear) %>%
  group_by(gender) %>%
  summarise(avg=mean(age))

## # A tibble: 2 x 2
##   gender      avg
##    <chr>    <dbl>
## 1      F 20.83333
## 2      M 21.50000

Below we calculate mean age per department.

students %>%
  mutate(age = 2017-birthYear) %>%
  group_by(dept) %>%
  summarise(avg=mean(age))

## # A tibble: 2 x 2
##     dept      avg
##    <chr>    <dbl>
## 1 bioeng 21.14286
## 2    mbg 21.00000

group_by accepts multiple columns, generating multiple groups (kind of nested). Below we calculate mean age per department and per gender.

students %>%
  mutate(age = 2017-birthYear) %>%
  group_by(dept,gender) %>%
  summarise(avg=mean(age))

## # A tibble: 4 x 3
## # Groups:   dept [?]
##     dept gender      avg
##    <chr>  <chr>    <dbl>
## 1 bioeng      F 20.75000
## 2 bioeng      M 21.66667
## 3    mbg      F 21.00000
## 4    mbg      M 21.00000

Cheatsheets

Rstudio cheatsheets page has very useful and compart cheatsheets for

You can print them out and carry with you all times.

Assignments for next week

This week following chapters are the assignment.

Data Manipulation in R with dplyr: Chapters 2,3,4 and 5

Please use Github issues if you’re having problem with concepts or code.

About assignments

Last week our survey results were like this:

Case	Number of Students
I don’t understand the videos	6
I passed the exercises without understanding	7
I have no idea what we are doing	3
I really liked the content, everything is great!	8

Hopefully, the situation is better this week. In order to deliver assistance to students who are stuck, we’ll be experimenting Github Issues. Please visit dav-assignments repo and browse to week04 folder. You’ll notice that there are two files which have the chapter and section names from assigned courses.

dav-assignments repo week04 folder contents

When you click on one of the assignment files, you can view the contents of it. Let’s assume you had problem at “Five verbs and their meaning” section during DataCamp lecture. Then you can click on line number (number 7 in the example below).

Contents of dplyr assignment file

Then, a menu will appear next to line number you clicked. In the menu please select “Open new issue”.

Note: you need to be signed in at Github for following steps.

Opening a new issue

A new page will open in which you can write your issue. The issue will be pasted with link to the line number you were referring, pleae do not edit that line.

Below the link, please describe your issue. Please be as descriptive as possible. The more information you provide the more likely the case will be solved. If your issue is about a piece of code:

paste the code
paste error message if there’s any

After you’re done writing, click on “Submit new issue” button.

Describing a new issue

Github issues is not for line number only. If you have general concern about assignments or other course related material, you can start a issue for them also. To start a new issue, please go to Issues tab in the repo and then click on “New Issue” button.

After submission of the issue, you’ll see it posted in Issues tab of the repo. When you’re viewing issues you can add comments to it, which can turn into a conversation.