Data Management plays a crucial role in the entire data analysis lifecycle.
In this class, we will explore how R can be a powerful tool for importing, exporting, cleaning, and managing your data efficiently.
Understanding these aspects is essential for any data scientist, analyst, or researcher working with R.
There are several functions that assist with reading data into R.
There are analogous functions for writing data to files.
To read an entire data frame directly, the external file will need to have the following form:
The fist line of the file should have the column names in the data frame.
Each additional line should have values for each column.
Note: By default, numeric items are read as numeric variables and non-numeric variables are read as factors. Specifying header = TRUE eliminates the need for column labels.
You can call read.table without specifying any other arguments.
By default, R will skip any lines that begin with # and figure out what the type of variable is in each column of the table. read.csv() operates similar.y
Use the colClasses argument to tell R the class of object for each column in the table
You can write data (e.g a dataframne called ‘mydata’) to a csv file “my.data.csv” in your working directory using:
Note: These will overwrite existing files in the working directory unless you specify append = TRUE
Omit Column Names: read.table() with col.names = FALSE
CRAN (Comprehensive R Archive Network) provides an excellent, succinct user manual for R data importing and exporting that includes:
https://cran.r-project.org/doc/manuals/r-release/R-data.pdf
The R package datasets contains several datasets that can be loaded into R. Others are available in packages that can be downloaded from CRAN.
head() will call the first six rows of data
tails() will call the last six rows of data
data(mtcars)
head(mtcars) #Will return the first 6 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#cyl will cause an object not found error becayse only the data frame has been attached
attach(mtcars) # all the columns are available in the global R environment
cyl #Now cyl can be used
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
data(chredlin)
## Warning in data(chredlin): data set 'chredlin' not found
data(chredlin, package = "faraway")
tail(chredlin) #Will return the last 6 rows, but can be customized
## race fire theft age involact income side
## 60655 1.0 4.8 19 15.2 0.0 13.323 s
## 60643 42.5 10.4 25 40.8 0.5 12.960 s
## 60628 35.1 15.6 28 57.8 1.0 11.260 s
## 60627 47.4 7.0 3 11.4 0.2 10.080 s
## 60633 34.0 7.1 23 49.2 0.3 11.428 s
## 60645 3.1 4.9 27 46.6 0.0 13.731 n
tail(chredlin, n=3) ## last 3 rows
## race fire theft age involact income side
## 60627 47.4 7.0 3 11.4 0.2 10.080 s
## 60633 34.0 7.1 23 49.2 0.3 11.428 s
## 60645 3.1 4.9 27 46.6 0.0 13.731 n
Sometimes data will come in “wide format” where repeated measurements on an experimental unit (a subject?) are represented by several variables in the dataset. For example, a clinical trial with 50 timepoints
To reshape it to ’long’ or ’stacked format’, you can apply the “stack” or reshape functions:
Note: You can also use reshape to go from long to wide. Packages reshape, reshape2, and plyr have useful tools for reshaping data.
A common task in data analysis is identifying and (sometimes) removing missing values (NAs).
x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
print(bad)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
x[!bad] #"!" means not
## [1] 1 2 4 5
You can also create a subset of multiple objects:
x <- c(1, 2, NA, 4, NA, 5)
y <- c("a", "b", NA, "d", NA, "f")
#THIS ONLY WORKS BECAUSE THE NA'S ARE IN THE SAME SPOT
good <- complete.cases(x,y) #Return a logical vector indicating which cases are complete, i.e., have no missing values.
x[good]
## [1] 1 2 4 5
y[good]
## [1] "a" "b" "d" "f"
It will work with dataframes:
data(chmiss, package = "faraway")
sub.chmiss <- head(chmiss)
sub.chmiss
## race fire theft age involact income
## 60626 10.0 6.2 29 60.4 NA 11.744
## 60640 22.2 9.5 44 76.5 0.1 9.323
## 60613 19.6 10.5 36 NA 1.2 9.948
## 60657 17.3 7.7 37 NA 0.5 10.656
## 60614 24.5 8.6 53 81.4 0.7 9.730
## 60610 54.0 34.1 68 52.6 0.3 8.231
good <- complete.cases(sub.chmiss)
sub.chmiss[good,]
## race fire theft age involact income
## 60640 22.2 9.5 44 76.5 0.1 9.323
## 60614 24.5 8.6 53 81.4 0.7 9.730
## 60610 54.0 34.1 68 52.6 0.3 8.231
Clear your current variables
rm(list = ls())
table1 <- read.table("My_Sample_Sheet.txt", row.names = 1, skip = 12, header = TRUE)
#Export the Table as CSV
write.csv(table1, "new_sample_sheet.csv")
#Export the table as a .txt
write.table(table1, "new_sample_sheet.txt", sep=",", row.names = FALSE)
#Reading a SAS file
library(haven)
tumor.data <- read_sas("Tumor_Data")
#View(tumor.data)
#Column Classes
system.time(read.csv("5000000_BT_Records.csv"))
## user system elapsed
## 10.151 0.204 10.365
initial <- read.csv("5000000_BT_Records.csv", nrows = 10)
classed <- sapply(initial, class)
classed
## Date Description Deposits Withdrawls Balance
## "character" "character" "character" "character" "character"
#Providing the column classes ahead of time can help R read the file faster
system.time(data <- read.csv("5000000_BT_Records.csv", colClasses = classed))
## user system elapsed
## 6.447 0.137 6.602
Saving and Loading Files
x <- list(a = 1,
b = 2)
y <- list(a = 3,
b = 4)
#Save a file in Binary
saveRDS(x, "test.rds")
readRDS("test.rds")
## $a
## [1] 1
##
## $b
## [1] 2
#Save as an Image File
save.image("test2.RData")
load("test2.RData")