read_csv() (from readr package) vs. read.csv() (from utils package)
dplyr functions (the tibble)read_tsv() and read_csv() wraps read_delim()
read.csv() and read.delim() wrapping read.table()fread() from data.table package
readxl package
excel_sheets() lists spreadsheetsread_excel() import data into Rread_excel("cities.xlsx", sheet = 2)detailed usage:
read_excel(path, sheet = 1,
col_names = TRUE,
col_types = NULL,
skip = 0)read.xls is an alternative which should be used if it’s absolutely necessary.# download data by the help of command below
# then import it with read.csv. What are the types of columns 1,2,3?
download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/students.txt","students.txt")
Now, let’s try read_tsv
# download data by the help of command below
# then import it with read_tsv.
download.file("https://s3-us-west-2.amazonaws.com/veri-analizi/students2.tsv","students2.tsv")
It looks like we have problem with column headers. How can we fix it?
Reminder: This exercise would have been more difficult if you were to download the file to your Desktop and then import it (the curse of the working directory)
On a side note, read_csv or read_tsv can read files from URL:
read_csv("https://s3-us-west-2.amazonaws.com/veri-analizi/students.txt")
Please do daily exercises for:
There are 5 verbs used by dplyr package and an important group_by function to manipulate data.
| dplyr verbs | Description |
|---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarise() |
summarise values |
group_by() |
allows for group operations, summarise and mutate on grouped items |
Let’s view different verbs, step by step on students.txt file. The students.txt file is imported to students variable.
library(readr)
students <- read_csv("students.txt")
## Parsed with column specification:
## cols(
## studentNo = col_integer(),
## dept = col_character(),
## gender = col_character(),
## birthYear = col_integer()
## )
students
## # A tibble: 10 x 4
## studentNo dept gender birthYear
## <int> <chr> <chr> <int>
## 1 100101 bioeng M 1995
## 2 100102 bioeng F 1996
## 3 100103 bioeng F 1995
## 4 100104 bioeng F 1997
## 5 100105 mbg M 1996
## 6 100106 mbg F 1997
## 7 100107 mbg F 1995
## 8 100108 bioeng M 1996
## 9 100109 bioeng M 1995
## 10 100110 bioeng F 1997
With select, we can drop dept column.
students %>%
select(-dept)
## # A tibble: 10 x 3
## studentNo gender birthYear
## <int> <chr> <int>
## 1 100101 M 1995
## 2 100102 F 1996
## 3 100103 F 1995
## 4 100104 F 1997
## 5 100105 M 1996
## 6 100106 F 1997
## 7 100107 F 1995
## 8 100108 M 1996
## 9 100109 M 1995
## 10 100110 F 1997
With mutate, we can create a new column
students %>%
select(-dept) %>%
mutate(age = 2017-birthYear)
## # A tibble: 10 x 4
## studentNo gender birthYear age
## <int> <chr> <int> <dbl>
## 1 100101 M 1995 22
## 2 100102 F 1996 21
## 3 100103 F 1995 22
## 4 100104 F 1997 20
## 5 100105 M 1996 21
## 6 100106 F 1997 20
## 7 100107 F 1995 22
## 8 100108 M 1996 21
## 9 100109 M 1995 22
## 10 100110 F 1997 20
With filter, we can select certain rows.
students %>%
select(-dept) %>%
mutate(age = 2017-birthYear) %>%
filter(gender == "M")
## # A tibble: 4 x 4
## studentNo gender birthYear age
## <int> <chr> <int> <dbl>
## 1 100101 M 1995 22
## 2 100105 M 1996 21
## 3 100108 M 1996 21
## 4 100109 M 1995 22
With arrange we can order the table according a column, descending or ascending direction.
students %>%
select(-dept) %>%
mutate(age = 2017-birthYear) %>%
filter(gender == "M") %>%
arrange(-age)
## # A tibble: 4 x 4
## studentNo gender birthYear age
## <int> <chr> <int> <dbl>
## 1 100101 M 1995 22
## 2 100109 M 1995 22
## 3 100105 M 1996 21
## 4 100108 M 1996 21
With summarise, we can generate summaries for columns.
students %>%
select(-dept) %>%
mutate(age = 2017-birthYear) %>%
filter(gender == "M") %>%
arrange(-age) %>%
summarise(avg=mean(age))
## # A tibble: 1 x 1
## avg
## <dbl>
## 1 21.5
If group_by is used to group columns, then subsequent summarise will run within groups and will generate results per group. Below we calculate mean age per gender.
students %>%
select(-dept) %>%
mutate(age = 2017-birthYear) %>%
group_by(gender) %>%
summarise(avg=mean(age))
## # A tibble: 2 x 2
## gender avg
## <chr> <dbl>
## 1 F 20.83333
## 2 M 21.50000
Below we calculate mean age per department.
students %>%
mutate(age = 2017-birthYear) %>%
group_by(dept) %>%
summarise(avg=mean(age))
## # A tibble: 2 x 2
## dept avg
## <chr> <dbl>
## 1 bioeng 21.14286
## 2 mbg 21.00000
group_by accepts multiple columns, generating multiple groups (kind of nested). Below we calculate mean age per department and per gender.
students %>%
mutate(age = 2017-birthYear) %>%
group_by(dept,gender) %>%
summarise(avg=mean(age))
## # A tibble: 4 x 3
## # Groups: dept [?]
## dept gender avg
## <chr> <chr> <dbl>
## 1 bioeng F 20.75000
## 2 bioeng M 21.66667
## 3 mbg F 21.00000
## 4 mbg M 21.00000
Rstudio cheatsheets page has very useful and compart cheatsheets for
You can print them out and carry with you all times.
This week following chapters are the assignment.
Please use Github issues if you’re having problem with concepts or code.
Last week our survey results were like this:
| Case | Number of Students |
|---|---|
| I don’t understand the videos | 6 |
| I passed the exercises without understanding | 7 |
| I have no idea what we are doing | 3 |
| I really liked the content, everything is great! | 8 |
Hopefully, the situation is better this week. In order to deliver assistance to students who are stuck, we’ll be experimenting Github Issues. Please visit dav-assignments repo and browse to week04 folder. You’ll notice that there are two files which have the chapter and section names from assigned courses.
dav-assignments repo week04 folder contents
When you click on one of the assignment files, you can view the contents of it. Let’s assume you had problem at “Five verbs and their meaning” section during DataCamp lecture. Then you can click on line number (number 7 in the example below).
Contents of dplyr assignment file
Then, a menu will appear next to line number you clicked. In the menu please select “Open new issue”.
Note: you need to be signed in at Github for following steps.
Opening a new issue
A new page will open in which you can write your issue. The issue will be pasted with link to the line number you were referring, pleae do not edit that line.
Below the link, please describe your issue. Please be as descriptive as possible. The more information you provide the more likely the case will be solved. If your issue is about a piece of code:
After you’re done writing, click on “Submit new issue” button.
Describing a new issue
Github issues is not for line number only. If you have general concern about assignments or other course related material, you can start a issue for them also. To start a new issue, please go to
Issuestab in the repo and then click on “New Issue” button.
After submission of the issue, you’ll see it posted in Issues tab of the repo. When you’re viewing issues you can add comments to it, which can turn into a conversation.
Opening a new issue