Reading Raw Data

The first task before any data analysis can begin is of course importing the data into the software application that you are using. And this often means reading the data from a comma-separated values (CSV) file. This tutorial gives a brief introduction on how R imports a CSV file. There is more than one package that can do this, but here we shall focus on readr.

Let’s us assume we have a CSV file (test.csv) with four variables. Note the first row typically describes what the variables are.

test.csv

Patient ID, Gender, Date of Treatment, Time of Treatment
1, F, 2020-12-01, 09:01
2, F, 2020-12-02, 10:02
3, M, 2020-12-03, 12:03

To import this file we simply use the read_csv function as shown here. Note the skip = n option, instruct the reading starts from row (n + 1).

library(readr)

df <- read_csv(file = "test.csv", skip = 0)

df

## # A tibble: 3 x 4
##   `Patient ID` Gender `Date of Treatment` `Time of Treatment`
##          <dbl> <chr>  <date>              <time>             
## 1            1 F      2020-12-01          09:01              
## 2            2 F      2020-12-02          10:02              
## 3            3 M      2020-12-03          12:03

By default read_csv will name the variables and assign its type by scanning through the first n records. This isn’t advisable though, especially when you have a file with many records. In which case you can increase n but that will slow down the reading process. Alternatively you can explicitly specify the type for each variable and the read_csv function has a very compact way of doing this.

df <- read_csv(file      = "test.csv",  
               col_types = "dcDt",  
               col_names = c("ID", "SEX", "TRTDT", "TRTTM"),   
               skip      = 1  
               )  

df

## # A tibble: 3 x 4
##      ID SEX   TRTDT      TRTTM 
##   <dbl> <chr> <date>     <time>
## 1     1 F     2020-12-01 09:01 
## 2     2 F     2020-12-02 10:02 
## 3     3 M     2020-12-03 12:03

Note, the dcDt means import the four variables as double, character, date and time formats. There are other types too; listed below. And the assigned variables names are specified in the col_names parameter.

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

A word of caution when specifying a type for a variable though, that is when there is a large number of records and it is not clear cut, a variable is numeric or alphanumeric! In which case, the safest bet is to import all variables as characters. This can be achieved easily with the example below.

df <- read_csv(file      = "test.csv",  
               col_types = cols(.default = col_character()),  
               col_names = c("ID", "SEX", "TRTDT", "TRTTM"),    
               skip      = 1  
               )  
df

## # A tibble: 3 x 4
##   ID    SEX   TRTDT      TRTTM
##   <chr> <chr> <chr>      <chr>
## 1 1     F     2020-12-01 09:01
## 2 2     F     2020-12-02 10:02
## 3 3     M     2020-12-03 12:03

If you have adopted the strategy of importing all variables as characters then the type_convert function can be useful to re-convert the character variables.

df <- type_convert(df) 

df

## # A tibble: 3 x 4
##      ID SEX   TRTDT      TRTTM 
##   <dbl> <chr> <date>     <time>
## 1     1 F     2020-12-01 09:01 
## 2     2 F     2020-12-02 10:02 
## 3     3 M     2020-12-03 12:03

Including CSV Inline.

In practice this is unlikely but this is very useful for testing purposes, where you just want a few records to test your functions etc. So what you need to do is replace the file parameter with a block of data enclosed in quotes.

df <- read_csv(
"Patien ID, Gender, Date of Treatment, Time of Treatment
1, F, 2020-12-01, 09:01     
2, F, 2020-12-02, 10:02
3, M, 2020-12-03, 12:03",
col_types = cols(.default = col_character()), 
col_names = c("ID", "SEX", "TRTDT", "TRTTM"), 
skip      = 1
)

df

## # A tibble: 3 x 4
##   ID    SEX   TRTDT      TRTTM
##   <chr> <chr> <chr>      <chr>
## 1 1     F     2020-12-01 09:01
## 2 2     F     2020-12-02 10:02
## 3 3     M     2020-12-03 12:03

Reference

https://readr.tidyverse.org/reference/read_delim.html

Contact

Email: trand000@aol.com

Reading Raw Data

Duong Tran

05 Dec 2020

Including CSV Inline.

Reference

Contact