Get and clean data

Week 1: Basic tips

The goal of this course: Raw data -> Processing script -> tidy data -> …
Components of tidy data: the raw data, a tidy data set, a code book decribing the tidy data set, processed code.

if(!file.exists("data")){dir.create("data")}

download.file(url, destfile = "./data/filename", method = "curl")

Read local files: read.table(), read.csv() to read rectangbular data. header, sep etc. are important parameters.

Week 2: Reading from different source type of data

This week gives the packages and their using for reading MySQL, HDF5, websites, XML and APIs.

Week 3

3-1 Subsetting and Sorting

set.seed(1)
X <- data.frame("var1" = sample(1:5), "var2" = sample(6:10, replace = T), "var3" = rnorm(5))
sort(X$var2, decreasing = T, na.last = T)

## [1] 10 10  9  9  6

X[order(X$var2, X$var1),]

##   var1 var2       var3
## 5    1    6 -0.3053884
## 4    3    9  0.5757814
## 3    4    9  0.7383247
## 1    2   10 -0.8204684
## 2    5   10  0.4874291

## or we can use the plyr package
library(plyr)
arrange(X, desc(var1))

##   var1 var2       var3
## 1    5   10  0.4874291
## 2    4    9  0.7383247
## 3    3    9  0.5757814
## 4    2   10 -0.8204684
## 5    1    6 -0.3053884

There several functions to summary the data, including table, str, summary etc..

Make cross tabs: xtabs(Freq ~ Gender + Admit, data = DF) will create some data structure just like class table.

3-2 Data processing

Cut by quantiles: x<- 1:20; cut(x, breaks = quantile(x))

Reshaping data

library(reshape2)
mtcars$carname <- rownames(mtcars)
carMelt <- melt(mtcars, id = c("carname", "gear", "cyl"), measure.vars = c("mpg", "hp"))
cylData <- dcast(carMelt, cyl ~ variable)

## Aggregation function missing: defaulting to length

cylData <- dcast(carMelt, cyl ~ variable, mean)

## This is great! I finally get it!!!
## An example to understand melt and dcast
x <- data.frame("name" = c("Li", "Gong", "Ma", "Hong"), "num" = c(35, 36, 38, 9), "sex" = as.factor(c("M", "M", "F", "F")), "math" = c(89, 90, 68, 73), "english" = c(98, 89, 78, 76))
meltx <- melt(x, id = c("name", "num"), measure.vars = c("math", "english"))
namecast <- dcast(meltx, name ~ variable)



tapply(InsectSprays$count, InsectSprays$spray, sum)

##   A   B   C   D   E   F 
## 174 184  25  59  42 200

## Following is another way to do this
spIns <- split(InsectSprays$count, InsectSprays$spray); sprCount <- sapply(spIns, sum)
## Or we can use the plyr package
sprCount <- ddply(InsectSprays, .(spray), summarize, sum = sum(count))

3-3 The dplyr package

This is the most convinient package!!!

## The dplyr package
## select, filter, arrange, rename, mutate, summarise
## The projects for this class gives a very good explanation of the using of it.

Week 4

4-1 Editing text variables

tolower(names(cameraData)) covert to lower case.

splitNames = strsplit(names(camerraData),"\\.")  # separate the variable names. 
firstElement <- function(x){x[1]}  
sapply(splitNames,firstElement)

sub("_","","this_is_a_test_name ") substitute _ with nothing. (gsub replace all)

# search "a" in some expressions
grep("a", c("am", "this", "what", "shit","go"))
grepl("a", c("am", "this", "what", "shit","go"))
# some other functions
library(stringr)
nchar("what life will be like")
substr("what life will be like",1,7)
paste("what","life")
paste0("what", "life")
str_trim("what    ")

4-2 Regular expressions

Regular expressions are combinations of literals and metacharacters. ^i think will match the expressions start with i think, morning$ will match the lines ended with morning. [^?.]$ will match the lines ended without "." or "?". "." is used to refer to any character. "|" means or "?" means optional.

There are usually used with functions grep, grepl, sub, gsub

4-3 Dealing with time

d1 = date(); d2= Sys.Date()
class(d1);class(d2)

## [1] "character"

## [1] "Date"

%d day as number, %a abbreviated weekday, %m month number, %b abbreviated month. %y two digit year, %Y four digit year.

We can use the basic calculations on date

d2+4
weekdays(d2)

using the lubridate package

library(lubridate)
d= Sys.Date()
ymd(d)
d+4

Questions

data.table, the using of this data type, data.table package.