This is an R Markdown document created with examples of how to handle data using R. Based on Coursera’s Getting and Cleaning Data.
Let’s start with a sample dataset and see what it looks like:
set.seed(13435)
X =data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X$var2[c(1,3)] = NA
X
## var1 var2 var3
## 1 2 NA 15
## 2 3 7 12
## 3 5 NA 14
## 4 1 10 11
## 5 4 9 13
If we only want to extract a couple of elements (of a column or some rows), this is pretty self-explanatory:
X[,1]
## [1] 2 3 5 1 4
X[,"var1"]
## [1] 2 3 5 1 4
X[1:2,"var2"]
## [1] NA 7
Now imagine we only want the rows of the matrix for which the first column is smaller or equal to 3 and the third column greater than 11:
X[(X$var1<=3 & X$var3>11),]
## var1 var2 var3
## 1 2 NA 15
## 2 3 7 12
As expected, if instead of & we use the symbol | we get “or” instead of “and”. Now the NA values might get in the way if we only want the rows for which the second column is greater than 8:
X[X$var2>8, ]
## var1 var2 var3
## NA NA NA NA
## NA.1 NA NA NA
## 4 1 10 11
## 5 4 9 13
There’s an easy fix:
X[which(X$var2>8), ]
## var1 var2 var3
## 4 1 10 11
## 5 4 9 13
or we could use complete.cases, though it looks way more complicated:
Y=X[X$var2>8, ]
aux=complete.cases(Y)
Y[aux, ]
## var1 var2 var3
## 4 1 10 11
## 5 4 9 13
Other stuff we may want to do is sort out a vector:
sort(X$var1)
## [1] 1 2 3 4 5
sort(X$var1,decreasing=TRUE)
## [1] 5 4 3 2 1
sort(X$var2,na.last=TRUE)
## [1] 7 9 10 NA NA
or order a dataframe by sorting one of its columns:
X[order(X$var1),]
## var1 var2 var3
## 4 1 10 11
## 1 2 NA 15
## 2 3 7 12
## 5 4 9 13
## 3 5 NA 14
But now imagine there are repeated numbers on the first column, which we want to use to sort the dataframe, i.e., we have a situation like:
Y=X
Y$var1[2]=2
Y
## var1 var2 var3
## 1 2 NA 15
## 2 2 7 12
## 3 5 NA 14
## 4 1 10 11
## 5 4 9 13
Note that there are two occurences of the number 2 showing up in the first column. We can order it again:
Y[order(Y$var1),]
## var1 var2 var3
## 4 1 10 11
## 1 2 NA 15
## 2 2 7 12
## 5 4 9 13
## 3 5 NA 14
but now suppose we want to order it first by its first column and then, if there are repeated elements in that first column, we want to use the third column to sort the dataframe:
Y[order(Y$var1, Y$var3),]
## var1 var2 var3
## 4 1 10 11
## 2 2 7 12
## 1 2 NA 15
## 5 4 9 13
## 3 5 NA 14
Other stuff to take into account:
When we have data, it is important that we are able to look at it, see what it contains. Take for example data about restaurants taken from a website and stored as the variable restData:
if(!file.exists("./data")){dir.create("./data")}
fileUrl = "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")
## Warning: download had nonzero exit status
restData = read.csv("./data/restaurants.csv")
We may want to look at the first elements of the data we have downloaded or maybe just at the last 3 elements:
head(restData)
## name zipCode neighborhood councilDistrict
## 1 410 21206 Frankford 2
## 2 1919 21231 Fells Point 1
## 3 SAUTE 21224 Canton 1
## 4 #1 CHINESE KITCHEN 21211 Hampden 14
## 5 #1 chinese restaurant 21223 Millhill 9
## 6 19TH HOLE 21218 Clifton Park 14
## policeDistrict Location.1
## 1 NORTHEASTERN 4509 BELAIR ROAD\nBaltimore, MD\n
## 2 SOUTHEASTERN 1919 FLEET ST\nBaltimore, MD\n
## 3 SOUTHEASTERN 2844 HUDSON ST\nBaltimore, MD\n
## 4 NORTHERN 3998 ROLAND AVE\nBaltimore, MD\n
## 5 SOUTHWESTERN 2481 frederick ave\nBaltimore, MD\n
## 6 NORTHEASTERN 2722 HARFORD RD\nBaltimore, MD\n
tail(restData, n=3)
## name zipCode neighborhood councilDistrict policeDistrict
## 1325 ZINK'S CAFÂ 21213 Belair-Edison 13 NORTHEASTERN
## 1326 ZISSIMOS BAR 21211 Hampden 7 NORTHERN
## 1327 ZORBAS 21224 Greektown 2 SOUTHEASTERN
## Location.1
## 1325 3300 LAWNVIEW AVE\nBaltimore, MD\n
## 1326 1023 36TH ST\nBaltimore, MD\n
## 1327 4710 EASTERN Ave\nBaltimore, MD\n
We can also get a summary on the data, which gives us the frequency of text-based or factor variables and min, max, average and some quantiles of quantitative variables:
summary(restData)
## name zipCode neighborhood
## MCDONALD'S : 8 Min. :-21226 Downtown :128
## POPEYES FAMOUS FRIED CHICKEN: 7 1st Qu.: 21202 Fells Point : 91
## SUBWAY : 6 Median : 21218 Inner Harbor: 89
## KENTUCKY FRIED CHICKEN : 5 Mean : 21185 Canton : 81
## BURGER KING : 4 3rd Qu.: 21226 Federal Hill: 42
## DUNKIN DONUTS : 4 Max. : 21287 Mount Vernon: 33
## (Other) :1293 (Other) :863
## councilDistrict policeDistrict
## Min. : 1.00 SOUTHEASTERN:385
## 1st Qu.: 2.00 CENTRAL :288
## Median : 9.00 SOUTHERN :213
## Mean : 7.19 NORTHERN :157
## 3rd Qu.:11.00 NORTHEASTERN: 72
## Max. :14.00 EASTERN : 67
## (Other) :145
## Location.1
## 1101 RUSSELL ST\nBaltimore, MD\n: 9
## 201 PRATT ST\nBaltimore, MD\n : 8
## 2400 BOSTON ST\nBaltimore, MD\n : 8
## 300 LIGHT ST\nBaltimore, MD\n : 5
## 300 CHARLES ST\nBaltimore, MD\n : 4
## 301 LIGHT ST\nBaltimore, MD\n : 4
## (Other) :1289
Another possibility is to use srt, which gives you the class of each variable and some extra information:
str(restData)
## 'data.frame': 1327 obs. of 6 variables:
## $ name : Factor w/ 1277 levels "#1 CHINESE KITCHEN",..: 9 3 992 1 2 4 5 6 7 8 ...
## $ zipCode : int 21206 21231 21224 21211 21223 21218 21205 21211 21205 21231 ...
## $ neighborhood : Factor w/ 173 levels "Abell","Arlington",..: 53 52 18 66 104 33 98 133 98 157 ...
## $ councilDistrict: int 2 1 1 14 9 14 13 7 13 1 ...
## $ policeDistrict : Factor w/ 9 levels "CENTRAL","EASTERN",..: 3 6 6 4 8 3 6 4 6 6 ...
## $ Location.1 : Factor w/ 1210 levels "1 BIDDLE ST\nBaltimore, MD\n",..: 835 334 554 755 492 537 505 530 507 569 ...
We can also get quantiles of a column of the data:
quantile(restData$councilDistrict, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 1 2 9 11 14
You can make tables of a specific variable to check out how many observations are within each category:
table(restData$zipCode, useNA="ifany") # useNA="ifany" -> if there are missing values, there will be a column, in the table, counting them
##
## -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211
## 1 136 201 27 30 4 1 8 23 41
## 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
## 28 31 17 54 10 32 69 1 7 56
## 21224 21225 21226 21227 21229 21230 21231 21234 21237 21239
## 199 19 18 4 13 156 127 7 1 3
## 21251 21287
## 2 1
The table can also check how many observations are within each pair of categories of two variables, e.g.:
table(restData$councilDistrict, restData$zipCode)
##
## -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212
## 1 0 0 37 0 0 0 0 0 0 0 0
## 2 0 0 0 3 27 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 27
## 5 0 0 0 0 0 3 0 6 0 0 0
## 6 0 0 0 0 0 0 0 1 19 0 0
## 7 0 0 0 0 0 0 0 1 0 27 0
## 8 0 0 0 0 0 1 0 0 0 0 0
## 9 0 1 0 0 0 0 0 0 0 0 0
## 10 1 0 1 0 0 0 0 0 0 0 0
## 11 0 115 139 0 0 0 1 0 0 0 1
## 12 0 20 24 4 0 0 0 0 0 0 0
## 13 0 0 0 20 3 0 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0 4 14 0
##
## 21213 21214 21215 21216 21217 21218 21220 21222 21223 21224 21225
## 1 2 0 0 0 0 0 0 7 0 140 1
## 2 0 0 0 0 0 0 0 0 0 54 0
## 3 2 17 0 0 0 3 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 31 0 0 0 0 0 0 0 0
## 6 0 0 15 1 0 0 0 0 0 0 0
## 7 0 0 6 7 15 6 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 2 0 0
## 9 0 0 0 2 8 0 0 0 53 0 0
## 10 0 0 0 0 0 0 1 0 0 0 18
## 11 0 0 0 0 9 0 0 0 1 0 0
## 12 13 0 0 0 0 26 0 0 0 0 0
## 13 13 0 1 0 0 0 0 0 0 5 0
## 14 1 0 1 0 0 34 0 0 0 0 0
##
## 21226 21227 21229 21230 21231 21234 21237 21239 21251 21287
## 1 0 0 0 1 124 0 0 0 0 0
## 2 0 0 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 7 0 0 2 0
## 4 0 0 0 0 0 0 0 3 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 2 13 0 0 0 0 0 0 0
## 9 0 0 0 11 0 0 0 0 0 0
## 10 18 0 0 133 0 0 0 0 0 0
## 11 0 0 0 11 0 0 0 0 0 0
## 12 0 0 0 0 2 0 0 0 0 0
## 13 0 1 0 0 1 0 0 0 0 1
## 14 0 0 0 0 0 0 0 0 0 0
We may also want to do stuff like:
sum(is.na(restData$zipCode))
## [1] 0
or for the whole dataframe:
colSums(is.na(restData))
## name zipCode neighborhood councilDistrict
## 0 0 0 0
## policeDistrict Location.1
## 0 0
all(colSums(is.na(restData))==0)
## [1] TRUE
any(restData$zipCode<0)
## [1] TRUE
all(restData$zipCode>0)
## [1] FALSE
table(restData$zipCode %in% c("21212","21213"))
##
## FALSE TRUE
## 1268 59
sum(restData$zipCode %in% c("21212","21213"))
## [1] 59
Now consider another data frame, containing information about how many students (Freq) of each gender were admitted and rejected by each deparment of a university,:
data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
head(DF)
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
Imagine we want to know, in total, how many male students were admitted, how many were rejected, and how many female students were admitted and how many were rejected. We can do stuff like:
sum(DF[DF$Admit=="Admitted" & DF$Gender=="Male", ]$Freq)
## [1] 1198
to know how many male students were admitted in total, but that’s a bit complicated. It’s simpler to do something like:
xtabs(Freq ~ Admit+Gender, data=DF)
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
Finally, let’s look at another dataset, warpbreaks, to which we will add a new column:
head(warpbreaks)
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
warpbreaks$replicate=rep(1:9, len = 54)
head(warpbreaks)
## breaks wool tension replicate
## 1 26 A L 1
## 2 30 A L 2
## 3 54 A L 3
## 4 25 A L 4
## 5 70 A L 5
## 6 52 A L 6
If we try xtab here of breaks as a function of all other variables, it will be all over the place:
xt=xtabs(breaks ~.,data=warpbreaks)
xt
## , , replicate = 1
##
## tension
## wool L M H
## A 26 18 36
## B 27 42 20
##
## , , replicate = 2
##
## tension
## wool L M H
## A 30 21 21
## B 14 26 21
##
## , , replicate = 3
##
## tension
## wool L M H
## A 54 29 24
## B 29 19 24
##
## , , replicate = 4
##
## tension
## wool L M H
## A 25 17 18
## B 19 16 17
##
## , , replicate = 5
##
## tension
## wool L M H
## A 70 12 10
## B 29 39 13
##
## , , replicate = 6
##
## tension
## wool L M H
## A 52 18 43
## B 31 28 15
##
## , , replicate = 7
##
## tension
## wool L M H
## A 51 35 28
## B 41 21 15
##
## , , replicate = 8
##
## tension
## wool L M H
## A 26 30 15
## B 20 39 16
##
## , , replicate = 9
##
## tension
## wool L M H
## A 67 36 26
## B 44 29 28
We can put it all together by using ftable:
ftable(xt)
## replicate 1 2 3 4 5 6 7 8 9
## wool tension
## A L 26 30 54 25 70 52 51 26 67
## M 18 21 29 17 12 18 35 30 36
## H 36 21 24 18 10 43 28 15 26
## B L 27 14 29 19 29 31 41 20 44
## M 42 26 19 16 39 28 21 39 29
## H 20 21 24 17 13 15 15 16 28
Often you need to create new variables for your data, be it an index, a qualitative version of a quantitative variable or something else. Examples:
s1 <- seq(1,10,by=2) ; s1
## [1] 1 3 5 7 9
s2 <- seq(1,10,length=3); s2
## [1] 1.0 5.5 10.0
x <- c(1,3,8,25,100); seq(along = x)
## [1] 1 2 3 4 5
restData$nearMe = restData$neighborhood %in% c("Roland Park", "Homeland") # true when the neighbourhood is either Roland Park or Homeland
head(restData$nearMe)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
restData$zipWrong = ifelse(restData$zipCode < 0, TRUE, FALSE) # True when the zipcode is negative
head(restData$zipWrong)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
restData$zipGroups = cut2(restData$zipCode,g=4)
head(restData$zipGroups)
## [1] [ 21205,21220) [ 21227,21287] [ 21220,21227) [ 21205,21220)
## [5] [ 21220,21227) [ 21205,21220)
## Levels: [-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]
table(restData$zipGroups)
##
## [-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]
## 338 375 300 314
restData$zcf = factor(restData$zipCode)
Sometimes we want to reshape our data. Let’s load the library reshape2 and take for example the dataframe mtcars (with 32 rows):
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.0.3
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Suppose we want to transform so that it looks like this (basically we want to display just some variables and we want the variables “mpg” and “hp” to now be in the same column; so except for this new column, for the other variables the first two 32 rows will be copy pasted to the final two 32 rows):
## carname gear cyl variable value
## 1 Mazda RX4 4 6 mpg 21.0
## 2 Mazda RX4 Wag 4 6 mpg 21.0
## 3 Datsun 710 4 4 mpg 22.8
## 4 Hornet 4 Drive 3 6 mpg 21.4
## 5 Hornet Sportabout 3 8 mpg 18.7
## 6 Valiant 3 6 mpg 18.1
## 7 Duster 360 3 8 mpg 14.3
## 8 Merc 240D 4 4 mpg 24.4
## 9 Merc 230 4 4 mpg 22.8
## 10 Merc 280 4 6 mpg 19.2
## 11 Merc 280C 4 6 mpg 17.8
## 12 Merc 450SE 3 8 mpg 16.4
## 13 Merc 450SL 3 8 mpg 17.3
## 14 Merc 450SLC 3 8 mpg 15.2
## 15 Cadillac Fleetwood 3 8 mpg 10.4
## 16 Lincoln Continental 3 8 mpg 10.4
## 17 Chrysler Imperial 3 8 mpg 14.7
## 18 Fiat 128 4 4 mpg 32.4
## 19 Honda Civic 4 4 mpg 30.4
## 20 Toyota Corolla 4 4 mpg 33.9
## 21 Toyota Corona 3 4 mpg 21.5
## 22 Dodge Challenger 3 8 mpg 15.5
## 23 AMC Javelin 3 8 mpg 15.2
## 24 Camaro Z28 3 8 mpg 13.3
## 25 Pontiac Firebird 3 8 mpg 19.2
## 26 Fiat X1-9 4 4 mpg 27.3
## 27 Porsche 914-2 5 4 mpg 26.0
## 28 Lotus Europa 5 4 mpg 30.4
## 29 Ford Pantera L 5 8 mpg 15.8
## 30 Ferrari Dino 5 6 mpg 19.7
## 31 Maserati Bora 5 8 mpg 15.0
## 32 Volvo 142E 4 4 mpg 21.4
## 33 Mazda RX4 4 6 hp 110.0
## 34 Mazda RX4 Wag 4 6 hp 110.0
## 35 Datsun 710 4 4 hp 93.0
## 36 Hornet 4 Drive 3 6 hp 110.0
## 37 Hornet Sportabout 3 8 hp 175.0
## 38 Valiant 3 6 hp 105.0
## 39 Duster 360 3 8 hp 245.0
## 40 Merc 240D 4 4 hp 62.0
## 41 Merc 230 4 4 hp 95.0
## 42 Merc 280 4 6 hp 123.0
## 43 Merc 280C 4 6 hp 123.0
## 44 Merc 450SE 3 8 hp 180.0
## 45 Merc 450SL 3 8 hp 180.0
## 46 Merc 450SLC 3 8 hp 180.0
## 47 Cadillac Fleetwood 3 8 hp 205.0
## 48 Lincoln Continental 3 8 hp 215.0
## 49 Chrysler Imperial 3 8 hp 230.0
## 50 Fiat 128 4 4 hp 66.0
## 51 Honda Civic 4 4 hp 52.0
## 52 Toyota Corolla 4 4 hp 65.0
## 53 Toyota Corona 3 4 hp 97.0
## 54 Dodge Challenger 3 8 hp 150.0
## 55 AMC Javelin 3 8 hp 150.0
## 56 Camaro Z28 3 8 hp 245.0
## 57 Pontiac Firebird 3 8 hp 175.0
## 58 Fiat X1-9 4 4 hp 66.0
## 59 Porsche 914-2 5 4 hp 91.0
## 60 Lotus Europa 5 4 hp 113.0
## 61 Ford Pantera L 5 8 hp 264.0
## 62 Ferrari Dino 5 6 hp 175.0
## 63 Maserati Bora 5 8 hp 335.0
## 64 Volvo 142E 4 4 hp 109.0
Note that we left three of the original variables in the new dataframe (carname, gear and cyl). We may now use this new dataframe to check out stuff like what is the mean value of miles per gallon (mpg) or horsepower (hp) for each number-of-gears-in-a-car?
gearData = dcast(carMelt, gear ~ variable,mean)
gearData
## gear mpg hp
## 1 3 16.11 176.1
## 2 4 24.53 89.5
## 3 5 21.38 195.6
Or simply, how many different values of mpg or hp do we have for each number-of-cylinders-in-a-car?
cylData = dcast(carMelt, cyl ~ variable)
## Aggregation function missing: defaulting to length
cylData
## cyl mpg hp
## 1 4 11 11
## 2 6 7 7
## 3 8 14 14
Personally, i find this all too complicate. I’d just rather use the aggregate function:
aggregate(cbind(mpg, hp)~gear, data=mtcars, mean)
## gear mpg hp
## 1 3 16.11 176.1
## 2 4 24.53 89.5
## 3 5 21.38 195.6
I like this function :). Let’s look at another example of how to use it. Say we take the following data, with the counts of insects in agricultural experimental units treated with different insecticides:
head(InsectSprays, n=15)
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
Want to know, in total, how many insects were treated with each spray? That’s easy:
aggregate(count~spray, data=InsectSprays, sum)
## spray count
## 1 A 174
## 2 B 184
## 3 C 25
## 4 D 59
## 5 E 42
## 6 F 200
Now another useful function is split. We might want to split this database by the different sprays. We get a list:
spIns = split(InsectSprays$count,InsectSprays$spray)
spIns
## $A
## [1] 10 7 20 14 14 12 10 23 17 20 14 13
##
## $B
## [1] 11 17 21 11 16 14 17 17 19 21 7 13
##
## $C
## [1] 0 1 7 2 3 1 2 1 3 0 1 4
##
## $D
## [1] 3 5 12 6 4 3 5 5 5 5 2 4
##
## $E
## [1] 3 5 3 5 3 6 1 1 3 2 6 4
##
## $F
## [1] 11 9 15 22 15 16 13 10 26 26 24 13
If we want to count the number of elements per spray, we have to use lapply. Other interesting functions: * unlist * ddply (plyr package) * acast * arrange * mutate
Finally, we might want to go from a short form to a long form (notably for repeated measures), i.e., do something like going from:
example_short=data.frame( measure.1=c(34, 31, 35, 38), measure.2=c(45, 47, 43, 45))
example_short
## measure.1 measure.2
## 1 34 45
## 2 31 47
## 3 35 43
## 4 38 45
to
example_long=reshape(example_short, direction="long", varying=c(1, 2), idvar='id', timevar="TIME", v.names="RESULTS", times =c("measure 1", "measure 2"))
row.names(example_long) <- NULL
example_long
## TIME RESULTS id
## 1 measure 1 34 1
## 2 measure 1 31 2
## 3 measure 1 35 3
## 4 measure 1 38 4
## 5 measure 2 45 1
## 6 measure 2 47 2
## 7 measure 2 43 3
## 8 measure 2 45 4
Sometimes you load more than one dataset into R and you want to merge these two data sets, typically based on an id (basically you can think of this has having linked tables and transforming them into one unique table).
reviews = read.csv("C:/Users/Miguel/Documents/data/reviews.csv")
solutions = read.csv("C:/Users/Miguel/Documents/data/solutions.csv")
head(reviews)
## id solution_id reviewer_id start stop time_left accept
## 1 1 3 27 1304095698 1304095758 1754 1
## 2 2 4 22 1304095188 1304095206 2306 1
## 3 3 5 28 1304095276 1304095320 2192 1
## 4 4 1 26 1304095267 1304095423 2089 1
## 5 5 10 29 1304095456 1304095469 2043 1
## 6 6 2 29 1304095471 1304095513 1999 1
head(solutions)
## id problem_id subject_id start stop time_left answer
## 1 1 156 29 1304095119 1304095169 2343 B
## 2 2 269 25 1304095119 1304095183 2329 C
## 3 3 34 22 1304095127 1304095146 2366 C
## 4 4 19 23 1304095127 1304095150 2362 D
## 5 5 605 26 1304095127 1304095167 2345 A
## 6 6 384 27 1304095131 1304095270 2242 C
In the “reviews” data set, the variable “solution_id” corresponds to the id (first column) that may be seen in the “solutions” data set. So to merge the two data sets:
mergedData = merge(reviews,solutions,by.x="solution_id",by.y="id",all=TRUE)
head(mergedData)
## solution_id id reviewer_id start.x stop.x time_left.x accept
## 1 1 4 26 1304095267 1304095423 2089 1
## 2 2 6 29 1304095471 1304095513 1999 1
## 3 3 1 27 1304095698 1304095758 1754 1
## 4 4 2 22 1304095188 1304095206 2306 1
## 5 5 3 28 1304095276 1304095320 2192 1
## 6 6 16 22 1304095303 1304095471 2041 1
## problem_id subject_id start.y stop.y time_left.y answer
## 1 156 29 1304095119 1304095169 2343 B
## 2 269 25 1304095119 1304095183 2329 C
## 3 34 22 1304095127 1304095146 2366 C
## 4 19 23 1304095127 1304095150 2362 D
## 5 605 26 1304095127 1304095167 2345 A
## 6 384 27 1304095131 1304095270 2242 C
If you have multiple data frames (such that, except for the common id, two dataframes do NOT have commonly named variables… if data frame 1 has a variable called XXX, no other data frame has a variable called XXX), it is easier to use the plyr package:
library(plyr)
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
df3 = data.frame(id=sample(1:10),z=rnorm(10))
dfList = list(df1,df2,df3)
join_all(dfList)
## id x y z
## 1 1 2.54376 -0.43383 -0.1263
## 2 9 1.55453 -1.34955 0.4922
## 3 5 -0.61923 -0.38265 0.6705
## 4 6 -0.92610 -0.55750 -1.5793
## 5 2 -0.66550 -0.36522 -0.5386
## 6 3 -0.02167 -1.57748 0.1347
## 7 8 -0.17412 0.06178 -1.1507
## 8 4 0.23900 -1.53784 0.1094
## 9 10 -1.83246 1.63550 -0.6820
## 10 7 -0.03719 -0.36594 0.5745