if(!file.exists("data")){dir.create("data")}
download.file(url, destfile = "./data/filename", method = "curl")
read.table(), read.csv()
to read rectangbular data. header, sep
etc. are important parameters.
This week gives the packages and their using for reading MySQL, HDF5, websites, XML and APIs.
set.seed(1)
X <- data.frame("var1" = sample(1:5), "var2" = sample(6:10, replace = T), "var3" = rnorm(5))
sort(X$var2, decreasing = T, na.last = T)
## [1] 10 10 9 9 6
X[order(X$var2, X$var1),]
## var1 var2 var3
## 5 1 6 -0.3053884
## 4 3 9 0.5757814
## 3 4 9 0.7383247
## 1 2 10 -0.8204684
## 2 5 10 0.4874291
## or we can use the plyr package
library(plyr)
arrange(X, desc(var1))
## var1 var2 var3
## 1 5 10 0.4874291
## 2 4 9 0.7383247
## 3 3 9 0.5757814
## 4 2 10 -0.8204684
## 5 1 6 -0.3053884
There several functions to summary the data, including table, str, summary
etc..
xtabs(Freq ~ Gender + Admit, data = DF)
will create some data structure just like class table.
x<- 1:20; cut(x, breaks = quantile(x))
Reshaping data
library(reshape2)
mtcars$carname <- rownames(mtcars)
carMelt <- melt(mtcars, id = c("carname", "gear", "cyl"), measure.vars = c("mpg", "hp"))
cylData <- dcast(carMelt, cyl ~ variable)
## Aggregation function missing: defaulting to length
cylData <- dcast(carMelt, cyl ~ variable, mean)
## This is great! I finally get it!!!
## An example to understand melt and dcast
x <- data.frame("name" = c("Li", "Gong", "Ma", "Hong"), "num" = c(35, 36, 38, 9), "sex" = as.factor(c("M", "M", "F", "F")), "math" = c(89, 90, 68, 73), "english" = c(98, 89, 78, 76))
meltx <- melt(x, id = c("name", "num"), measure.vars = c("math", "english"))
namecast <- dcast(meltx, name ~ variable)
tapply(InsectSprays$count, InsectSprays$spray, sum)
## A B C D E F
## 174 184 25 59 42 200
## Following is another way to do this
spIns <- split(InsectSprays$count, InsectSprays$spray); sprCount <- sapply(spIns, sum)
## Or we can use the plyr package
sprCount <- ddply(InsectSprays, .(spray), summarize, sum = sum(count))
This is the most convinient package!!!
## The dplyr package
## select, filter, arrange, rename, mutate, summarise
## The projects for this class gives a very good explanation of the using of it.
tolower(names(cameraData))
covert to lower case.
splitNames = strsplit(names(camerraData),"\\.") # separate the variable names.
firstElement <- function(x){x[1]}
sapply(splitNames,firstElement)
sub("_","","this_is_a_test_name ")
substitute _ with nothing. (gsub replace all)
# search "a" in some expressions
grep("a", c("am", "this", "what", "shit","go"))
grepl("a", c("am", "this", "what", "shit","go"))
# some other functions
library(stringr)
nchar("what life will be like")
substr("what life will be like",1,7)
paste("what","life")
paste0("what", "life")
str_trim("what ")
Regular expressions are combinations of literals and metacharacters. ^i think will match the expressions start with i think, morning$
will match the lines ended with morning. [^?.]$
will match the lines ended without "." or "?"
. "."
is used to refer to any character. "|"
means or "?"
means optional.
There are usually used with functions grep, grepl, sub, gsub
d1 = date(); d2= Sys.Date()
class(d1);class(d2)
## [1] "character"
## [1] "Date"
%d
day as number, %a
abbreviated weekday, %m
month number, %b
abbreviated month. %y
two digit year, %Y
four digit year.
We can use the basic calculations on date
d2+4
weekdays(d2)
using the lubridate package
library(lubridate)
d= Sys.Date()
ymd(d)
d+4
data.table
, the using of this data type, data.table package.