Handling data

This is an R Markdown document created with examples of how to handle data using R. Based on Coursera’s Getting and Cleaning Data.

Subsetting and sorting

Let’s start with a sample dataset and see what it looks like:

set.seed(13435)
X =data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X$var2[c(1,3)] = NA
X

##   var1 var2 var3
## 1    2   NA   15
## 2    3    7   12
## 3    5   NA   14
## 4    1   10   11
## 5    4    9   13

If we only want to extract a couple of elements (of a column or some rows), this is pretty self-explanatory:

X[,1]

## [1] 2 3 5 1 4

X[,"var1"]

## [1] 2 3 5 1 4

X[1:2,"var2"]

## [1] NA  7

Now imagine we only want the rows of the matrix for which the first column is smaller or equal to 3 and the third column greater than 11:

X[(X$var1<=3 & X$var3>11),]

##   var1 var2 var3
## 1    2   NA   15
## 2    3    7   12

As expected, if instead of & we use the symbol | we get “or” instead of “and”. Now the NA values might get in the way if we only want the rows for which the second column is greater than 8:

X[X$var2>8, ]

##      var1 var2 var3
## NA     NA   NA   NA
## NA.1   NA   NA   NA
## 4       1   10   11
## 5       4    9   13

There’s an easy fix:

X[which(X$var2>8), ]

##   var1 var2 var3
## 4    1   10   11
## 5    4    9   13

or we could use complete.cases, though it looks way more complicated:

Y=X[X$var2>8, ]
aux=complete.cases(Y)
Y[aux, ]

##   var1 var2 var3
## 4    1   10   11
## 5    4    9   13

Other stuff we may want to do is sort out a vector:

sort(X$var1)

## [1] 1 2 3 4 5

sort(X$var1,decreasing=TRUE)

## [1] 5 4 3 2 1

sort(X$var2,na.last=TRUE)

## [1]  7  9 10 NA NA

or order a dataframe by sorting one of its columns:

X[order(X$var1),]

##   var1 var2 var3
## 4    1   10   11
## 1    2   NA   15
## 2    3    7   12
## 5    4    9   13
## 3    5   NA   14

But now imagine there are repeated numbers on the first column, which we want to use to sort the dataframe, i.e., we have a situation like:

Y=X
Y$var1[2]=2
Y

##   var1 var2 var3
## 1    2   NA   15
## 2    2    7   12
## 3    5   NA   14
## 4    1   10   11
## 5    4    9   13

Note that there are two occurences of the number 2 showing up in the first column. We can order it again:

Y[order(Y$var1),]

##   var1 var2 var3
## 4    1   10   11
## 1    2   NA   15
## 2    2    7   12
## 5    4    9   13
## 3    5   NA   14

but now suppose we want to order it first by its first column and then, if there are repeated elements in that first column, we want to use the third column to sort the dataframe:

Y[order(Y$var1, Y$var3),]

##   var1 var2 var3
## 4    1   10   11
## 2    2    7   12
## 1    2   NA   15
## 5    4    9   13
## 3    5   NA   14

Other stuff to take into account:

you can use the plyr library to order (arrange(X, var1), arrange(X, desc(var1)), …)m tiy can use cbind to add columns to a dataframe.

Summarising data

When we have data, it is important that we are able to look at it, see what it contains. Take for example data about restaurants taken from a website and stored as the variable restData:

if(!file.exists("./data")){dir.create("./data")}
fileUrl = "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl")

## Warning: download had nonzero exit status

restData = read.csv("./data/restaurants.csv")

We may want to look at the first elements of the data we have downloaded or maybe just at the last 3 elements:

head(restData)

##                    name zipCode neighborhood councilDistrict
## 1                   410   21206    Frankford               2
## 2                  1919   21231  Fells Point               1
## 3                 SAUTE   21224       Canton               1
## 4    #1 CHINESE KITCHEN   21211      Hampden              14
## 5 #1 chinese restaurant   21223     Millhill               9
## 6             19TH HOLE   21218 Clifton Park              14
##   policeDistrict                          Location.1
## 1   NORTHEASTERN   4509 BELAIR ROAD\nBaltimore, MD\n
## 2   SOUTHEASTERN      1919 FLEET ST\nBaltimore, MD\n
## 3   SOUTHEASTERN     2844 HUDSON ST\nBaltimore, MD\n
## 4       NORTHERN    3998 ROLAND AVE\nBaltimore, MD\n
## 5   SOUTHWESTERN 2481 frederick ave\nBaltimore, MD\n
## 6   NORTHEASTERN    2722 HARFORD RD\nBaltimore, MD\n

tail(restData, n=3)

##              name zipCode  neighborhood councilDistrict policeDistrict
## 1325 ZINK'S CAFÂ   21213 Belair-Edison              13   NORTHEASTERN
## 1326 ZISSIMOS BAR   21211       Hampden               7       NORTHERN
## 1327       ZORBAS   21224     Greektown               2   SOUTHEASTERN
##                              Location.1
## 1325 3300 LAWNVIEW AVE\nBaltimore, MD\n
## 1326      1023 36TH ST\nBaltimore, MD\n
## 1327  4710 EASTERN Ave\nBaltimore, MD\n

We can also get a summary on the data, which gives us the frequency of text-based or factor variables and min, max, average and some quantiles of quantitative variables:

summary(restData)

##                            name         zipCode             neighborhood
##  MCDONALD'S                  :   8   Min.   :-21226   Downtown    :128  
##  POPEYES FAMOUS FRIED CHICKEN:   7   1st Qu.: 21202   Fells Point : 91  
##  SUBWAY                      :   6   Median : 21218   Inner Harbor: 89  
##  KENTUCKY FRIED CHICKEN      :   5   Mean   : 21185   Canton      : 81  
##  BURGER KING                 :   4   3rd Qu.: 21226   Federal Hill: 42  
##  DUNKIN DONUTS               :   4   Max.   : 21287   Mount Vernon: 33  
##  (Other)                     :1293                    (Other)     :863  
##  councilDistrict      policeDistrict
##  Min.   : 1.00   SOUTHEASTERN:385   
##  1st Qu.: 2.00   CENTRAL     :288   
##  Median : 9.00   SOUTHERN    :213   
##  Mean   : 7.19   NORTHERN    :157   
##  3rd Qu.:11.00   NORTHEASTERN: 72   
##  Max.   :14.00   EASTERN     : 67   
##                  (Other)     :145   
##                           Location.1    
##  1101 RUSSELL ST\nBaltimore, MD\n:   9  
##  201 PRATT ST\nBaltimore, MD\n   :   8  
##  2400 BOSTON ST\nBaltimore, MD\n :   8  
##  300 LIGHT ST\nBaltimore, MD\n   :   5  
##  300 CHARLES ST\nBaltimore, MD\n :   4  
##  301 LIGHT ST\nBaltimore, MD\n   :   4  
##  (Other)                         :1289

Another possibility is to use srt, which gives you the class of each variable and some extra information:

str(restData)

## 'data.frame':    1327 obs. of  6 variables:
##  $ name           : Factor w/ 1277 levels "#1 CHINESE KITCHEN",..: 9 3 992 1 2 4 5 6 7 8 ...
##  $ zipCode        : int  21206 21231 21224 21211 21223 21218 21205 21211 21205 21231 ...
##  $ neighborhood   : Factor w/ 173 levels "Abell","Arlington",..: 53 52 18 66 104 33 98 133 98 157 ...
##  $ councilDistrict: int  2 1 1 14 9 14 13 7 13 1 ...
##  $ policeDistrict : Factor w/ 9 levels "CENTRAL","EASTERN",..: 3 6 6 4 8 3 6 4 6 6 ...
##  $ Location.1     : Factor w/ 1210 levels "1 BIDDLE ST\nBaltimore, MD\n",..: 835 334 554 755 492 537 505 530 507 569 ...

We can also get quantiles of a column of the data:

quantile(restData$councilDistrict, na.rm=TRUE)

##   0%  25%  50%  75% 100% 
##    1    2    9   11   14

You can make tables of a specific variable to check out how many observations are within each category:

table(restData$zipCode, useNA="ifany") # useNA="ifany" -> if there are missing values, there will be a column, in the table, counting them

## 
## -21226  21201  21202  21205  21206  21207  21208  21209  21210  21211 
##      1    136    201     27     30      4      1      8     23     41 
##  21212  21213  21214  21215  21216  21217  21218  21220  21222  21223 
##     28     31     17     54     10     32     69      1      7     56 
##  21224  21225  21226  21227  21229  21230  21231  21234  21237  21239 
##    199     19     18      4     13    156    127      7      1      3 
##  21251  21287 
##      2      1

The table can also check how many observations are within each pair of categories of two variables, e.g.:

table(restData$councilDistrict, restData$zipCode)

##     
##      -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212
##   1       0     0    37     0     0     0     0     0     0     0     0
##   2       0     0     0     3    27     0     0     0     0     0     0
##   3       0     0     0     0     0     0     0     0     0     0     0
##   4       0     0     0     0     0     0     0     0     0     0    27
##   5       0     0     0     0     0     3     0     6     0     0     0
##   6       0     0     0     0     0     0     0     1    19     0     0
##   7       0     0     0     0     0     0     0     1     0    27     0
##   8       0     0     0     0     0     1     0     0     0     0     0
##   9       0     1     0     0     0     0     0     0     0     0     0
##   10      1     0     1     0     0     0     0     0     0     0     0
##   11      0   115   139     0     0     0     1     0     0     0     1
##   12      0    20    24     4     0     0     0     0     0     0     0
##   13      0     0     0    20     3     0     0     0     0     0     0
##   14      0     0     0     0     0     0     0     0     4    14     0
##     
##      21213 21214 21215 21216 21217 21218 21220 21222 21223 21224 21225
##   1      2     0     0     0     0     0     0     7     0   140     1
##   2      0     0     0     0     0     0     0     0     0    54     0
##   3      2    17     0     0     0     3     0     0     0     0     0
##   4      0     0     0     0     0     0     0     0     0     0     0
##   5      0     0    31     0     0     0     0     0     0     0     0
##   6      0     0    15     1     0     0     0     0     0     0     0
##   7      0     0     6     7    15     6     0     0     0     0     0
##   8      0     0     0     0     0     0     0     0     2     0     0
##   9      0     0     0     2     8     0     0     0    53     0     0
##   10     0     0     0     0     0     0     1     0     0     0    18
##   11     0     0     0     0     9     0     0     0     1     0     0
##   12    13     0     0     0     0    26     0     0     0     0     0
##   13    13     0     1     0     0     0     0     0     0     5     0
##   14     1     0     1     0     0    34     0     0     0     0     0
##     
##      21226 21227 21229 21230 21231 21234 21237 21239 21251 21287
##   1      0     0     0     1   124     0     0     0     0     0
##   2      0     0     0     0     0     0     1     0     0     0
##   3      0     1     0     0     0     7     0     0     2     0
##   4      0     0     0     0     0     0     0     3     0     0
##   5      0     0     0     0     0     0     0     0     0     0
##   6      0     0     0     0     0     0     0     0     0     0
##   7      0     0     0     0     0     0     0     0     0     0
##   8      0     2    13     0     0     0     0     0     0     0
##   9      0     0     0    11     0     0     0     0     0     0
##   10    18     0     0   133     0     0     0     0     0     0
##   11     0     0     0    11     0     0     0     0     0     0
##   12     0     0     0     0     2     0     0     0     0     0
##   13     0     1     0     0     1     0     0     0     0     1
##   14     0     0     0     0     0     0     0     0     0     0

We may also want to do stuff like:

check for missing values in a column:

sum(is.na(restData$zipCode))

## [1] 0

or for the whole dataframe:

colSums(is.na(restData))

##            name         zipCode    neighborhood councilDistrict 
##               0               0               0               0 
##  policeDistrict      Location.1 
##               0               0

all(colSums(is.na(restData))==0)

## [1] TRUE

check if any zip code has been labeled as negative

any(restData$zipCode<0)

## [1] TRUE

check if they have all been labeled as positive

all(restData$zipCode>0)

## [1] FALSE

check how many restaurants have zip codes 21212 or 21213

table(restData$zipCode %in% c("21212","21213"))

## 
## FALSE  TRUE 
##  1268    59

sum(restData$zipCode %in% c("21212","21213"))

## [1] 59

Now consider another data frame, containing information about how many students (Freq) of each gender were admitted and rejected by each deparment of a university,:

data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
head(DF)

##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

Imagine we want to know, in total, how many male students were admitted, how many were rejected, and how many female students were admitted and how many were rejected. We can do stuff like:

sum(DF[DF$Admit=="Admitted" & DF$Gender=="Male", ]$Freq)

## [1] 1198

to know how many male students were admitted in total, but that’s a bit complicated. It’s simpler to do something like:

xtabs(Freq ~ Admit+Gender, data=DF)

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

Finally, let’s look at another dataset, warpbreaks, to which we will add a new column:

head(warpbreaks)

##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

warpbreaks$replicate=rep(1:9, len = 54)
head(warpbreaks)

##   breaks wool tension replicate
## 1     26    A       L         1
## 2     30    A       L         2
## 3     54    A       L         3
## 4     25    A       L         4
## 5     70    A       L         5
## 6     52    A       L         6

If we try xtab here of breaks as a function of all other variables, it will be all over the place:

xt=xtabs(breaks ~.,data=warpbreaks)
xt

## , , replicate = 1
## 
##     tension
## wool  L  M  H
##    A 26 18 36
##    B 27 42 20
## 
## , , replicate = 2
## 
##     tension
## wool  L  M  H
##    A 30 21 21
##    B 14 26 21
## 
## , , replicate = 3
## 
##     tension
## wool  L  M  H
##    A 54 29 24
##    B 29 19 24
## 
## , , replicate = 4
## 
##     tension
## wool  L  M  H
##    A 25 17 18
##    B 19 16 17
## 
## , , replicate = 5
## 
##     tension
## wool  L  M  H
##    A 70 12 10
##    B 29 39 13
## 
## , , replicate = 6
## 
##     tension
## wool  L  M  H
##    A 52 18 43
##    B 31 28 15
## 
## , , replicate = 7
## 
##     tension
## wool  L  M  H
##    A 51 35 28
##    B 41 21 15
## 
## , , replicate = 8
## 
##     tension
## wool  L  M  H
##    A 26 30 15
##    B 20 39 16
## 
## , , replicate = 9
## 
##     tension
## wool  L  M  H
##    A 67 36 26
##    B 44 29 28

We can put it all together by using ftable:

ftable(xt)

##              replicate  1  2  3  4  5  6  7  8  9
## wool tension                                     
## A    L                 26 30 54 25 70 52 51 26 67
##      M                 18 21 29 17 12 18 35 30 36
##      H                 36 21 24 18 10 43 28 15 26
## B    L                 27 14 29 19 29 31 41 20 44
##      M                 42 26 19 16 39 28 21 39 29
##      H                 20 21 24 17 13 15 15 16 28

Creating new variables

Often you need to create new variables for your data, be it an index, a qualitative version of a quantitative variable or something else. Examples:

maybe we just want to create a sequence, to serve as an index:

s1 <- seq(1,10,by=2) ; s1

## [1] 1 3 5 7 9

s2 <- seq(1,10,length=3); s2

## [1]  1.0  5.5 10.0

x <- c(1,3,8,25,100); seq(along = x)

## [1] 1 2 3 4 5

or a boolean-like variable that is true when some condition holds:

restData$nearMe = restData$neighborhood %in% c("Roland Park", "Homeland") # true when the neighbourhood is either Roland Park or Homeland
head(restData$nearMe)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

restData$zipWrong = ifelse(restData$zipCode < 0, TRUE, FALSE) # True when the zipcode is negative
head(restData$zipWrong)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

we may want to cut down a quantitative variable into intervals, which is easier if we use the Hmisc package:

restData$zipGroups = cut2(restData$zipCode,g=4)
head(restData$zipGroups)

## [1] [ 21205,21220) [ 21227,21287] [ 21220,21227) [ 21205,21220)
## [5] [ 21220,21227) [ 21205,21220)
## Levels: [-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287]

table(restData$zipGroups)

## 
## [-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287] 
##            338            375            300            314

sometimes it is useful to work with factors:

restData$zcf = factor(restData$zipCode)

Reshaping data

Sometimes we want to reshape our data. Let’s load the library reshape2 and take for example the dataframe mtcars (with 32 rows):

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.0.3

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Suppose we want to transform so that it looks like this (basically we want to display just some variables and we want the variables “mpg” and “hp” to now be in the same column; so except for this new column, for the other variables the first two 32 rows will be copy pasted to the final two 32 rows):

##                carname gear cyl variable value
## 1            Mazda RX4    4   6      mpg  21.0
## 2        Mazda RX4 Wag    4   6      mpg  21.0
## 3           Datsun 710    4   4      mpg  22.8
## 4       Hornet 4 Drive    3   6      mpg  21.4
## 5    Hornet Sportabout    3   8      mpg  18.7
## 6              Valiant    3   6      mpg  18.1
## 7           Duster 360    3   8      mpg  14.3
## 8            Merc 240D    4   4      mpg  24.4
## 9             Merc 230    4   4      mpg  22.8
## 10            Merc 280    4   6      mpg  19.2
## 11           Merc 280C    4   6      mpg  17.8
## 12          Merc 450SE    3   8      mpg  16.4
## 13          Merc 450SL    3   8      mpg  17.3
## 14         Merc 450SLC    3   8      mpg  15.2
## 15  Cadillac Fleetwood    3   8      mpg  10.4
## 16 Lincoln Continental    3   8      mpg  10.4
## 17   Chrysler Imperial    3   8      mpg  14.7
## 18            Fiat 128    4   4      mpg  32.4
## 19         Honda Civic    4   4      mpg  30.4
## 20      Toyota Corolla    4   4      mpg  33.9
## 21       Toyota Corona    3   4      mpg  21.5
## 22    Dodge Challenger    3   8      mpg  15.5
## 23         AMC Javelin    3   8      mpg  15.2
## 24          Camaro Z28    3   8      mpg  13.3
## 25    Pontiac Firebird    3   8      mpg  19.2
## 26           Fiat X1-9    4   4      mpg  27.3
## 27       Porsche 914-2    5   4      mpg  26.0
## 28        Lotus Europa    5   4      mpg  30.4
## 29      Ford Pantera L    5   8      mpg  15.8
## 30        Ferrari Dino    5   6      mpg  19.7
## 31       Maserati Bora    5   8      mpg  15.0
## 32          Volvo 142E    4   4      mpg  21.4
## 33           Mazda RX4    4   6       hp 110.0
## 34       Mazda RX4 Wag    4   6       hp 110.0
## 35          Datsun 710    4   4       hp  93.0
## 36      Hornet 4 Drive    3   6       hp 110.0
## 37   Hornet Sportabout    3   8       hp 175.0
## 38             Valiant    3   6       hp 105.0
## 39          Duster 360    3   8       hp 245.0
## 40           Merc 240D    4   4       hp  62.0
## 41            Merc 230    4   4       hp  95.0
## 42            Merc 280    4   6       hp 123.0
## 43           Merc 280C    4   6       hp 123.0
## 44          Merc 450SE    3   8       hp 180.0
## 45          Merc 450SL    3   8       hp 180.0
## 46         Merc 450SLC    3   8       hp 180.0
## 47  Cadillac Fleetwood    3   8       hp 205.0
## 48 Lincoln Continental    3   8       hp 215.0
## 49   Chrysler Imperial    3   8       hp 230.0
## 50            Fiat 128    4   4       hp  66.0
## 51         Honda Civic    4   4       hp  52.0
## 52      Toyota Corolla    4   4       hp  65.0
## 53       Toyota Corona    3   4       hp  97.0
## 54    Dodge Challenger    3   8       hp 150.0
## 55         AMC Javelin    3   8       hp 150.0
## 56          Camaro Z28    3   8       hp 245.0
## 57    Pontiac Firebird    3   8       hp 175.0
## 58           Fiat X1-9    4   4       hp  66.0
## 59       Porsche 914-2    5   4       hp  91.0
## 60        Lotus Europa    5   4       hp 113.0
## 61      Ford Pantera L    5   8       hp 264.0
## 62        Ferrari Dino    5   6       hp 175.0
## 63       Maserati Bora    5   8       hp 335.0
## 64          Volvo 142E    4   4       hp 109.0

Note that we left three of the original variables in the new dataframe (carname, gear and cyl). We may now use this new dataframe to check out stuff like what is the mean value of miles per gallon (mpg) or horsepower (hp) for each number-of-gears-in-a-car?

gearData = dcast(carMelt, gear ~ variable,mean)
gearData

##   gear   mpg    hp
## 1    3 16.11 176.1
## 2    4 24.53  89.5
## 3    5 21.38 195.6

Or simply, how many different values of mpg or hp do we have for each number-of-cylinders-in-a-car?

cylData = dcast(carMelt, cyl ~ variable)

## Aggregation function missing: defaulting to length

cylData

##   cyl mpg hp
## 1   4  11 11
## 2   6   7  7
## 3   8  14 14

Personally, i find this all too complicate. I’d just rather use the aggregate function:

aggregate(cbind(mpg, hp)~gear, data=mtcars, mean)

##   gear   mpg    hp
## 1    3 16.11 176.1
## 2    4 24.53  89.5
## 3    5 21.38 195.6

I like this function :). Let’s look at another example of how to use it. Say we take the following data, with the counts of insects in agricultural experimental units treated with different insecticides:

head(InsectSprays, n=15)

##    count spray
## 1     10     A
## 2      7     A
## 3     20     A
## 4     14     A
## 5     14     A
## 6     12     A
## 7     10     A
## 8     23     A
## 9     17     A
## 10    20     A
## 11    14     A
## 12    13     A
## 13    11     B
## 14    17     B
## 15    21     B

Want to know, in total, how many insects were treated with each spray? That’s easy:

aggregate(count~spray, data=InsectSprays, sum)

##   spray count
## 1     A   174
## 2     B   184
## 3     C    25
## 4     D    59
## 5     E    42
## 6     F   200

Now another useful function is split. We might want to split this database by the different sprays. We get a list:

spIns = split(InsectSprays$count,InsectSprays$spray)
spIns

## $A
##  [1] 10  7 20 14 14 12 10 23 17 20 14 13
## 
## $B
##  [1] 11 17 21 11 16 14 17 17 19 21  7 13
## 
## $C
##  [1] 0 1 7 2 3 1 2 1 3 0 1 4
## 
## $D
##  [1]  3  5 12  6  4  3  5  5  5  5  2  4
## 
## $E
##  [1] 3 5 3 5 3 6 1 1 3 2 6 4
## 
## $F
##  [1] 11  9 15 22 15 16 13 10 26 26 24 13

If we want to count the number of elements per spray, we have to use lapply. Other interesting functions: * unlist * ddply (plyr package) * acast * arrange * mutate

Finally, we might want to go from a short form to a long form (notably for repeated measures), i.e., do something like going from:

example_short=data.frame( measure.1=c(34, 31, 35, 38), measure.2=c(45, 47, 43, 45))
example_short

##   measure.1 measure.2
## 1        34        45
## 2        31        47
## 3        35        43
## 4        38        45

example_long=reshape(example_short, direction="long", varying=c(1, 2), idvar='id', timevar="TIME", v.names="RESULTS", times =c("measure 1", "measure 2"))
row.names(example_long) <- NULL
example_long

##        TIME RESULTS id
## 1 measure 1      34  1
## 2 measure 1      31  2
## 3 measure 1      35  3
## 4 measure 1      38  4
## 5 measure 2      45  1
## 6 measure 2      47  2
## 7 measure 2      43  3
## 8 measure 2      45  4

Merging data

Sometimes you load more than one dataset into R and you want to merge these two data sets, typically based on an id (basically you can think of this has having linked tables and transforming them into one unique table).

reviews = read.csv("C:/Users/Miguel/Documents/data/reviews.csv")
solutions = read.csv("C:/Users/Miguel/Documents/data/solutions.csv")
head(reviews)

##   id solution_id reviewer_id      start       stop time_left accept
## 1  1           3          27 1304095698 1304095758      1754      1
## 2  2           4          22 1304095188 1304095206      2306      1
## 3  3           5          28 1304095276 1304095320      2192      1
## 4  4           1          26 1304095267 1304095423      2089      1
## 5  5          10          29 1304095456 1304095469      2043      1
## 6  6           2          29 1304095471 1304095513      1999      1

head(solutions)

##   id problem_id subject_id      start       stop time_left answer
## 1  1        156         29 1304095119 1304095169      2343      B
## 2  2        269         25 1304095119 1304095183      2329      C
## 3  3         34         22 1304095127 1304095146      2366      C
## 4  4         19         23 1304095127 1304095150      2362      D
## 5  5        605         26 1304095127 1304095167      2345      A
## 6  6        384         27 1304095131 1304095270      2242      C

In the “reviews” data set, the variable “solution_id” corresponds to the id (first column) that may be seen in the “solutions” data set. So to merge the two data sets:

mergedData = merge(reviews,solutions,by.x="solution_id",by.y="id",all=TRUE)
head(mergedData)

##   solution_id id reviewer_id    start.x     stop.x time_left.x accept
## 1           1  4          26 1304095267 1304095423        2089      1
## 2           2  6          29 1304095471 1304095513        1999      1
## 3           3  1          27 1304095698 1304095758        1754      1
## 4           4  2          22 1304095188 1304095206        2306      1
## 5           5  3          28 1304095276 1304095320        2192      1
## 6           6 16          22 1304095303 1304095471        2041      1
##   problem_id subject_id    start.y     stop.y time_left.y answer
## 1        156         29 1304095119 1304095169        2343      B
## 2        269         25 1304095119 1304095183        2329      C
## 3         34         22 1304095127 1304095146        2366      C
## 4         19         23 1304095127 1304095150        2362      D
## 5        605         26 1304095127 1304095167        2345      A
## 6        384         27 1304095131 1304095270        2242      C

If you have multiple data frames (such that, except for the common id, two dataframes do NOT have commonly named variables… if data frame 1 has a variable called XXX, no other data frame has a variable called XXX), it is easier to use the plyr package:

library(plyr)
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
df3 = data.frame(id=sample(1:10),z=rnorm(10))
dfList = list(df1,df2,df3)
join_all(dfList)

##    id        x        y       z
## 1   1  2.54376 -0.43383 -0.1263
## 2   9  1.55453 -1.34955  0.4922
## 3   5 -0.61923 -0.38265  0.6705
## 4   6 -0.92610 -0.55750 -1.5793
## 5   2 -0.66550 -0.36522 -0.5386
## 6   3 -0.02167 -1.57748  0.1347
## 7   8 -0.17412  0.06178 -1.1507
## 8   4  0.23900 -1.53784  0.1094
## 9  10 -1.83246  1.63550 -0.6820
## 10  7 -0.03719 -0.36594  0.5745