Directory


Demonstration Code

Data Input and Output


Data Management in R

Data Management plays a crucial role in the entire data analysis lifecycle.

In this class, we will explore how R can be a powerful tool for importing, exporting, cleaning, and managing your data efficiently.

Understanding these aspects is essential for any data scientist, analyst, or researcher working with R.


Reading in Data

There are several functions that assist with reading data into R.


Writing Data to File

There are analogous functions for writing data to files.


read.table()

To read an entire data frame directly, the external file will need to have the following form:

The fist line of the file should have the column names in the data frame.

Each additional line should have values for each column.

Note: By default, numeric items are read as numeric variables and non-numeric variables are read as factors. Specifying header = TRUE eliminates the need for column labels.

For small to moderately sized datasets

You can call read.table without specifying any other arguments.

By default, R will skip any lines that begin with # and figure out what the type of variable is in each column of the table. read.csv() operates similar.y

For large datasets

Use the colClasses argument to tell R the class of object for each column in the table


Writing to a File

You can write data (e.g a dataframne called ‘mydata’) to a csv file “my.data.csv” in your working directory using:

Note: These will overwrite existing files in the working directory unless you specify append = TRUE

Omit Column Names: read.table() with col.names = FALSE


Resources for the More Advanced User

CRAN (Comprehensive R Archive Network) provides an excellent, succinct user manual for R data importing and exporting that includes:

https://cran.r-project.org/doc/manuals/r-release/R-data.pdf


Data Management


Built-In Datasets

The R package datasets contains several datasets that can be loaded into R. Others are available in packages that can be downloaded from CRAN.


Example for data()

head() will call the first six rows of data

tails() will call the last six rows of data

data(mtcars)

head(mtcars) #Will return the first 6 rows
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#cyl will cause an object not found error becayse only the data frame has been attached

attach(mtcars) # all the columns are available in the global R environment

cyl #Now cyl can be used
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
data(chredlin)
## Warning in data(chredlin): data set 'chredlin' not found
data(chredlin, package = "faraway")

tail(chredlin) #Will return the last 6 rows, but can be customized
##       race fire theft  age involact income side
## 60655  1.0  4.8    19 15.2      0.0 13.323    s
## 60643 42.5 10.4    25 40.8      0.5 12.960    s
## 60628 35.1 15.6    28 57.8      1.0 11.260    s
## 60627 47.4  7.0     3 11.4      0.2 10.080    s
## 60633 34.0  7.1    23 49.2      0.3 11.428    s
## 60645  3.1  4.9    27 46.6      0.0 13.731    n
tail(chredlin, n=3) ## last 3 rows
##       race fire theft  age involact income side
## 60627 47.4  7.0     3 11.4      0.2 10.080    s
## 60633 34.0  7.1    23 49.2      0.3 11.428    s
## 60645  3.1  4.9    27 46.6      0.0 13.731    n

Reshaping Data

Sometimes data will come in “wide format” where repeated measurements on an experimental unit (a subject?) are represented by several variables in the dataset. For example, a clinical trial with 50 timepoints

To reshape it to ’long’ or ’stacked format’, you can apply the “stack” or reshape functions:

Note: You can also use reshape to go from long to wide. Packages reshape, reshape2, and plyr have useful tools for reshaping data.


Removing Missing Data

A common task in data analysis is identifying and (sometimes) removing missing values (NAs).

x <- c(1, 2, NA, 4, NA, 5) 

bad <- is.na(x) 

print(bad) 
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE
x[!bad] #"!" means not
## [1] 1 2 4 5

You can also create a subset of multiple objects:

x <- c(1, 2, NA, 4, NA, 5)
y <- c("a", "b", NA, "d", NA, "f") 

#THIS ONLY WORKS BECAUSE THE NA'S ARE IN THE SAME SPOT
good <- complete.cases(x,y) #Return a logical vector indicating which cases are complete, i.e., have no missing values.

x[good]
## [1] 1 2 4 5
y[good]
## [1] "a" "b" "d" "f"

It will work with dataframes:

data(chmiss, package = "faraway")
sub.chmiss <- head(chmiss)
sub.chmiss
##       race fire theft  age involact income
## 60626 10.0  6.2    29 60.4       NA 11.744
## 60640 22.2  9.5    44 76.5      0.1  9.323
## 60613 19.6 10.5    36   NA      1.2  9.948
## 60657 17.3  7.7    37   NA      0.5 10.656
## 60614 24.5  8.6    53 81.4      0.7  9.730
## 60610 54.0 34.1    68 52.6      0.3  8.231
good <- complete.cases(sub.chmiss)
sub.chmiss[good,]
##       race fire theft  age involact income
## 60640 22.2  9.5    44 76.5      0.1  9.323
## 60614 24.5  8.6    53 81.4      0.7  9.730
## 60610 54.0 34.1    68 52.6      0.3  8.231

Demonstration Code

Clear your current variables

rm(list = ls())

table1 <- read.table("My_Sample_Sheet.txt", row.names = 1, skip = 12, header = TRUE)

#Export the Table as CSV
write.csv(table1, "new_sample_sheet.csv")

#Export the table as a .txt
write.table(table1, "new_sample_sheet.txt", sep=",", row.names = FALSE)

#Reading a SAS file
library(haven)
tumor.data <- read_sas("Tumor_Data")
#View(tumor.data)

#Column Classes
system.time(read.csv("5000000_BT_Records.csv"))
##    user  system elapsed 
##  10.151   0.204  10.365
initial <- read.csv("5000000_BT_Records.csv", nrows = 10)
classed <- sapply(initial, class)
classed
##        Date Description    Deposits  Withdrawls     Balance 
## "character" "character" "character" "character" "character"
#Providing the column classes ahead of time can help R read the file faster
system.time(data <- read.csv("5000000_BT_Records.csv", colClasses = classed))
##    user  system elapsed 
##   6.447   0.137   6.602

Saving and Loading Files

x <- list(a = 1,
          b = 2)

y <- list(a = 3,
          b = 4)

#Save a file in Binary
saveRDS(x, "test.rds")
readRDS("test.rds")
## $a
## [1] 1
## 
## $b
## [1] 2
#Save as an Image File
save.image("test2.RData")
load("test2.RData")