Data wrangling is a very time-consuming and sometimes, really hard task that you have to perform on your data to get the results you want. Data can come for different sources, in several formats, it is incomplete, it is unstructured… It is basically a mess and in order to extract some valuable information out of it, in order to construct knowledge out of it, you have to tidy it up make it uniform, according to the type of analysis that will follow.

A first look through your data

A first look through your data will start with head(). It allows you to take a look at the first 6 rows of your data. In addition, tail() will give you the last 6 records. First, I will construct a dummy example for this purpose:

tab_data <- data.frame("x"=rnorm(10, 60, 100), "variable1"=seq(1:10))
names(tab_data) <- c("Latitude conventional", "Logitude separator")
tab_data

##    Latitude conventional Logitude separator
## 1               22.45927                  1
## 2              -92.05847                  2
## 3               74.58678                  3
## 4              193.17415                  4
## 5               59.32055                  5
## 6               86.53750                  6
## 7               96.25202                  7
## 8             -242.93970                  8
## 9              239.52399                  9
## 10             136.72346                 10

Now, take a look at its head and tail:

head(tab_data)

##   Latitude conventional Logitude separator
## 1              22.45927                  1
## 2             -92.05847                  2
## 3              74.58678                  3
## 4             193.17415                  4
## 5              59.32055                  5
## 6              86.53750                  6

tail(tab_data)

##    Latitude conventional Logitude separator
## 5               59.32055                  5
## 6               86.53750                  6
## 7               96.25202                  7
## 8             -242.93970                  8
## 9              239.52399                  9
## 10             136.72346                 10

Another function you can apply to your data is the function str(). This function will give you an idea of rows/columns, how many variables you have and their classification, and some data samples of each variable.

str(tab_data)

## 'data.frame':    10 obs. of  2 variables:
##  $ Latitude conventional: num  22.5 -92.1 74.6 193.2 59.3 ...
##  $ Logitude separator   : int  1 2 3 4 5 6 7 8 9 10

Another aspect to have into consideration are the names of the variables you are dealing with. Once you import data, variable names that contain special characters or spaces can completely drag you down with a million problems. They should be “cleaned” - no spaces or special characters.

The variable names on the table have a space that could give you trouble in your analysis. But R has function that will tidy them up, and prevent future possible problems - make.names()

names(tab_data) <- make.names(names(tab_data))
tab_data[1,]

##   Latitude.conventional Logitude.separator
## 1              22.45927                  1

If you’re looking for a pattern, take a look at the function table(). It will count each factor for a variable.
Let’s take as an example the mtcars dataset:

table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

table(mtcars$gear)

## 
##  3  4  5 
## 15 12  5

table() can be used as a technique to find duplicates, when appropriate, but you can also use the funtion duplicated(). It returns a logical vector that will point TRUE whenever there is a duplicate. If you intend to remove duplicates directly, you can use the function unique().

Strings and `stringr`

require(stringr)

stringr is a package in R that has “Simple, Consistent Wrappers for Common String Operations” (Hadley_b Wickham 2017). It might seem that you won’t need it, but it is quite important, specially when the data you’re receiving has been inputed by different people, using different methods, and you have to uniformize it.

There are four main families of functions in stringr:

Character Manipulation functions;
Whitespaces manipulation functions;
Locale sensitive functions and
Pattern matching functions.

You can get the lenght of any string using str_length().

a<-"Hello World"
str_length(a)

## [1] 11

You can substitute characters in a string by something your want, or nada de nada. The first argument is your string, the second the starting position and the third is the last position.

str_sub(a, 1, 5) # to keep the string specified within your arguments

## [1] "Hello"

str_sub(a, 7, 11)<-"Pluto" #substitute "world" by "pluto"
a

## [1] "Hello Pluto"

a<-c("Hello World", "Goodbye World")
str_sub(a, 1,6)

## [1] "Hello " "Goodby"

str_sub(a, -5, -1)

## [1] "World" "World"

There is also a function from base that I really like which is gsub. It is quite usefull as well, in case that, for example, you need to supress whitespaces inside your variables and substitute them by “_“.

a<-"I am Angela"
gsub(" ", "_", a)

## [1] "I_am_Angela"

Regarding whitespaces, it is worth to mention str_pad() that adds extra whitespaces either on the left, or right, or on both sides; str_trim() will remove leading and trailing whitespaces. This can become quite usefull in our dataset, where in certain classes there is a leading whitespace at the beginning of each level. E.g.:

df<-data.frame(x=c("  FOO"," baR"), y= c("blUe   ", "GREy "))
str_trim(df$x)

## [1] "FOO" "baR"

Now, let’s apply this function to the rest of the table:

dftrim<-as.data.frame(lapply(df, str_trim))
levels(dftrim$y)

## [1] "blUe" "GREy"

More on the lapply() function will be given further on.

To change cases, there are functions such as str_to_upper(), str_to_title() and str_to_lower(). You can also order strings, str_order() or sort them with str_sort().

For example, in our case data, I believe it would be better to lower all cases. Therefore,

dftrim<-as.data.frame(lapply(dftrim, str_to_lower))
dftrim

##     x    y
## 1 foo blue
## 2 bar grey

More information about string patterns can be found here.

Basic operations on tables

One of the most basic table operations you can do is to transpose the table. You can do so by simply using the function t().

dftrimt<-as.data.frame(t(dftrim))

Selecting data

Selection of data can be done in a couple of ways. Let’s say you want to select rows 7:9, or columns 1 and 2:

mtcars[7:9,]

##             mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D  24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 230   22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2

mtcars[c(7,8,9),]

##             mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D  24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 230   22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2

head(mtcars[,1:2])

##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

head(mtcars[, c(1,2)])

##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

You can also subset your data according to the values of your variables. E.g., let’s say you need to select all cars with “cyl” equal to 6. You have several ways to do it:

mtcars[which(mtcars$cyl == 6), ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

subset(mtcars, mtcars$cyl ==6)

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Now, let’s a look at another example. Let’s select cyl=6 and carburator 4 or less. You can also select all cars with 6 cilinders plus all cars with carburator less than 4. Take into consideration that in the first case it is used & and the second |. They mean “AND” and “OR” respectively.

mtcars[which(mtcars$cyl== 6 & mtcars$carb <=4), ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

head(mtcars[which(mtcars$cyl== 6 | mtcars$carb <=4), ])

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

subset(mtcars, mtcars$cyl== 6 & mtcars$carb <=4)

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

head(subset(mtcars, mtcars$cyl== 6 | mtcars$carb <=4))

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

With the subset function, you can also select the columns you want on the output.

subset(mtcars, mtcars$cyl== 6 & mtcars$carb <=4, select = c("cyl", "wt"))

##                cyl    wt
## Mazda RX4        6 2.620
## Mazda RX4 Wag    6 2.875
## Hornet 4 Drive   6 3.215
## Valiant          6 3.460
## Merc 280         6 3.440
## Merc 280C        6 3.440

Excluding data

This process is also referred to as dropping data.

head(mtcars[-1,]) # exclude first row

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

head(mtcars[,-1]) #exclude first column

##                   cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1

head(mtcars[ , -c(1:3)]) #exclude columns 1:3

##                    hp drat    wt  qsec vs am gear carb
## Mazda RX4         110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 175 3.15 3.440 17.02  0  0    3    2
## Valiant           105 2.76 3.460 20.22  1  0    3    1

Adding data

If you want to merge data horizontally, by key-value columns, check the package dplyr in the next sections. If you want to merge two datatables vertically, you can use rbind, as long as the tables have the same variables (columns).

x<-data.frame(x=1:5, y=2:6)
y<-data.frame(x=2:10, y=20)
xy=rbind(x,y)

Tidyverse

Tidyverse (Wickham 2016) is a group of packages that share the same design philosophy, grammar and data structures. The packages that are part of this universe are dplyr (Wickham and Francois 2016), ggplot2 (Wickham and Chang 2016), readr (Wickham, Hester, and Francois 2016), tibble (Wickham, Francois, and Muller 2016), tidyr (Hadley_a Wickham 2017) and purrr (Henry and Wickham 2017).

The wide-long Dillema

A wide dataframe format is a dataframe that has a column for each variable, such as the table we’ve been using so far. A long format dataframe is a dataframe that has a column for all variables “stacked” into a column and another column has the values.

A wide dataframe looks like:

##   SO4  Ca
## 1 0.1 0.5
## 2 0.2 0.6
## 3 0.3 0.7

The same dataframe on the long format looks like:

##   variable value
## 1      SO4   0.1
## 2      SO4   0.2
## 3      SO4   0.3
## 4       Ca   0.5
## 5       Ca   0.6
## 6       Ca   0.7

Now this requires a little thinking about your data, and what are the “x’s” of your data and which are the “y’s”. Sometimes you’ll need long format, and other times, you’ll need wide format. It only depends on what you need to do with your data. ggplot2, plyr, and some modelling functions lm(), glm(), need the long format tables. However, you might find it easy to record your data in the wide-format.

The package reshape helps you to convert between these two formats, and all goes around with two main functions: melt() (from wide to long) and cast (from long to wide).

Let’s take into consideration the mtcars dataset:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The “x” for this dataset is each car model, which is the name for each row. For that reason, I will transfer these names as a new column. Now, to transform a dataset from wide to long, you can use the function melt(). It’s always important that you think about the nature of your data. All datasets are different to a certain extent, and that largely depends on the results you’re trying to accomplish. A deep understanding of your data (what are the variables of interest; what are the questions you’re trying to answer what is the relationship between those questions and the variables you have) might be something that you don’t really want to think about, because you’re just eager to get the results. However, “wasting” this time at the beginning of your data analysis, is something that could avoid future bulk mistakes and misconceptions is your analysis. Therefore, whenever you’re melting your data, always use the argument “id.vars”, so you know exaclty what you’re doing with your data.

mtcars$model<-gsub(" ", "_", row.names(mtcars))
longmtcars<-melt(mtcars, id.vars = "model")
head(longmtcars)

##               model variable value
## 1         Mazda_RX4      mpg  21.0
## 2     Mazda_RX4_Wag      mpg  21.0
## 3        Datsun_710      mpg  22.8
## 4    Hornet_4_Drive      mpg  21.4
## 5 Hornet_Sportabout      mpg  18.7
## 6           Valiant      mpg  18.1

tail(longmtcars)

##              model variable value
## 347  Porsche_914-2     carb     2
## 348   Lotus_Europa     carb     2
## 349 Ford_Pantera_L     carb     4
## 350   Ferrari_Dino     carb     6
## 351  Maserati_Bora     carb     8
## 352     Volvo_142E     carb     2

The onbject we just created, longmtcars is now a “tidy” dataset according to (Wickham and others 2014). Now, you can convert it back again to wide format. You can use the function dcast() to have a table as an output or you can use acast to have an array or vector as an output. To achieve the conversion to wide format, you’ll need a formula of the type \(x_{1} + x_{2} \sim y_{1} + y_{2}\).

wide<-dcast(longmtcars, model ~ variable)
head(wide)

##                model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1        AMC_Javelin 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
## 2 Cadillac_Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## 3         Camaro_Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## 4  Chrysler_Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## 5         Datsun_710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 6   Dodge_Challenger 15.5   8  318 150 2.76 3.520 16.87  0  0    3    2

And there you go, your data is back as a wide table.

Examples of other types of tables you can have; note that the one to choose will always be directly related to the type of analysis you’re trying to do. You have to understand which variable is the cause and which is the effect.

head(melt(mtcars, id.vars = "hp"))

##    hp variable value
## 1 110      mpg    21
## 2 110      mpg    21
## 3  93      mpg  22.8
## 4 110      mpg  21.4
## 5 175      mpg  18.7
## 6 105      mpg  18.1

head(melt(mtcars, id.vars = c("hp", "gear")))

##    hp gear variable value
## 1 110    4      mpg    21
## 2 110    4      mpg    21
## 3  93    4      mpg  22.8
## 4 110    3      mpg  21.4
## 5 175    3      mpg  18.7
## 6 105    3      mpg  18.1

You can also use dcast() to perform aggregation. Let’s say you want to collect the mean of each variable for each gear. You can do so by:

df<-melt(mtcars, id.vars = c("model", "gear"))
head(df)

##               model gear variable value
## 1         Mazda_RX4    4      mpg  21.0
## 2     Mazda_RX4_Wag    4      mpg  21.0
## 3        Datsun_710    4      mpg  22.8
## 4    Hornet_4_Drive    3      mpg  21.4
## 5 Hornet_Sportabout    3      mpg  18.7
## 6           Valiant    3      mpg  18.1

dcast(df, gear ~ variable, mean)

##   gear      mpg      cyl     disp       hp     drat       wt   qsec
## 1    3 16.10667 7.466667 326.3000 176.1333 3.132667 3.892600 17.692
## 2    4 24.53333 4.666667 123.0167  89.5000 4.043333 2.616667 18.965
## 3    5 21.38000 6.000000 202.4800 195.6000 3.916000 2.632600 15.640
##          vs        am     carb
## 1 0.2000000 0.0000000 2.666667
## 2 0.8333333 0.6666667 2.333333
## 3 0.2000000 1.0000000 4.400000

dcast(df, gear~variable) #counts occurences

## Aggregation function missing: defaulting to length

##   gear mpg cyl disp hp drat wt qsec vs am carb
## 1    3  15  15   15 15   15 15   15 15 15   15
## 2    4  12  12   12 12   12 12   12 12 12   12
## 3    5   5   5    5  5    5  5    5  5  5    5

Note that it pops the message that the lenght is used as the aggregation function, because we did not define the function, which means that the output will give you counts.

However, I believe it is easier to do this operation (aggregation) with dplyr.

Now just as a curiosity, take a look on how to produce boxplots with base and with ggplot2 for the mtcars:

require(ggplot2, quietly = TRUE)

boxplot(mpg~gear, data=mtcars)
ggplot(mtcars, aes(x=as.factor(gear), y=mpg)) + 
    geom_boxplot() + xlab("cyl") # one variable against the other
ggplot(longmtcars, aes (x=variable, y=value)) +
  geom_boxplot() #all variables

`tidyr`

The tidyr package is a package that allows you to tidy your data easily (Hadley_a Wickham 2017). It might seem very similar to reshape2, but tidyr is designed mainly to tidy data and not to reshape.

This package offers you four main functions:

gather ()
separate ()
spread ()
unite ()

Let’s take an example. You measured \(SO_4^2\) levels, and collected that data along a river in three locations, from 2005 to 2008.

require(tidyr, quietly = TRUE)

riversulfate<-data.frame(Location= (1:3),
                        y2005 = c(0.1, 0.2, 0.1),
                        y2006= c(0.2, 0.1, 0.1), 
                         y2007= c(0.5, 0.1, 0.2), 
                         y2008 = c(0.2, 0.3, 0.1))
riversulfate

##   Location y2005 y2006 y2007 y2008
## 1        1   0.1   0.2   0.5   0.2
## 2        2   0.2   0.1   0.1   0.3
## 3        3   0.1   0.1   0.2   0.1

If you take a look at the data, your rows are not a single observation for each location, but instead in each row you have four different observations for each location. To tidy these data, and keep each observation on a row, you use gather (), where the first argument is your table, second the name you want to call the variable that will “group” several columns into the same column (referred to as “key”), the third the name of the measurement itself, and lastly, the indication for columns to be gathered.

gather(riversulfate, year, value, y2005:y2008)

##    Location  year value
## 1         1 y2005   0.1
## 2         2 y2005   0.2
## 3         3 y2005   0.1
## 4         1 y2006   0.2
## 5         2 y2006   0.1
## 6         3 y2006   0.1
## 7         1 y2007   0.5
## 8         2 y2007   0.1
## 9         3 y2007   0.2
## 10        1 y2008   0.2
## 11        2 y2008   0.3
## 12        3 y2008   0.1

or more simplisticaly, with the pipe notation:

riversulfate %>%
  gather(year, value, y2005:y2008)

##    Location  year value
## 1         1 y2005   0.1
## 2         2 y2005   0.2
## 3         3 y2005   0.1
## 4         1 y2006   0.2
## 5         2 y2006   0.1
## 6         3 y2006   0.1
## 7         1 y2007   0.5
## 8         2 y2007   0.1
## 9         3 y2007   0.2
## 10        1 y2008   0.2
## 11        2 y2008   0.3
## 12        3 y2008   0.1

Now, let’s take the same example, but let’s consider that for each location you measured both \(Na^+\) and \(SO_4^2\), and merged both variables into “Location”.

set.seed(2)
options(digits = 2)
riversulfate<-data.frame(Location= c("1;Na", "1;SO4", "2;Na", 
                                     "2;SO4", "3;Na", "3;SO4"),
                        y2005 = rnorm(6, 5),
                        y2006= rnorm(6, 0.2), 
                         y2007= rnorm(6, 0.1), 
                         y2008 = rnorm(6, 0.5))
head(riversulfate)

##   Location y2005  y2006 y2007 y2008
## 1     1;Na   4.1  0.908 -0.29  1.51
## 2    1;SO4   5.2 -0.040 -0.94  0.93
## 3     2;Na   6.6  2.184  1.88  2.59
## 4    2;SO4   3.9  0.061 -2.21 -0.70
## 5     3;Na   4.9  0.618  0.98  2.09
## 6    3;SO4   5.1  1.182  0.14  2.45

To tidy this table, you first have to collapse the different years into one column using gather(), and then use separate() to separate the “location” column into two columns. As a result, you’ll have for each column a variable and for each row an observation.

tidy<-gather(riversulfate, year, mgL, y2005:y2008)
head(tidy)

##   Location  year mgL
## 1     1;Na y2005 4.1
## 2    1;SO4 y2005 5.2
## 3     2;Na y2005 6.6
## 4    2;SO4 y2005 3.9
## 5     3;Na y2005 4.9
## 6    3;SO4 y2005 5.1

river<-separate(tidy, Location, c("Location", "element"), sep="\\;")
head(river)

##   Location element  year mgL
## 1        1      Na y2005 4.1
## 2        1     SO4 y2005 5.2
## 3        2      Na y2005 6.6
## 4        2     SO4 y2005 3.9
## 5        3      Na y2005 4.9
## 6        3     SO4 y2005 5.1

If you desire to present an element as a column, you can use spread(). The first argument is your data, second the column you want to separate to several columns (the key), and the third name of the column that holds the values for that key.

tidier<-spread(river, element, mgL)
head(tidier)

##   Location  year    Na    SO4
## 1        1 y2005  4.10  5.185
## 2        1 y2006  0.91 -0.040
## 3        1 y2007 -0.29 -0.940
## 4        1 y2008  1.51  0.932
## 5        2 y2005  6.59  3.870
## 6        2 y2006  2.18  0.061

`dplyr`

dplyr is “A fast, consistent tool for working with data frame like objects”" (Wickham and Francois 2016).

The main functions from this package are: * filter() - selection based on an attribute value * arrange() - reorder your variables by a variable order (ascending or descending) * select() and rename() - selecting variables by name * mutate() - works as a “field calculator” * summarise() - condenses a variable into one value

Under this context, let’s use again the mtcars dataset. dplyr allows you to filter by attributes. In the following example, I want to extract all cars with gear 5:

require(dplyr)

filter(mtcars, gear==5)

##   mpg cyl disp  hp drat  wt qsec vs am gear carb          model
## 1  26   4  120  91  4.4 2.1   17  0  1    5    2  Porsche_914-2
## 2  30   4   95 113  3.8 1.5   17  1  1    5    2   Lotus_Europa
## 3  16   8  351 264  4.2 3.2   14  0  1    5    4 Ford_Pantera_L
## 4  20   6  145 175  3.6 2.8   16  0  1    5    6   Ferrari_Dino
## 5  15   8  301 335  3.5 3.6   15  0  1    5    8  Maserati_Bora

filter(mtcars, gear ==5, 
       carb!=2 & carb!=3) #gear 5, carb isnt 2 and 3

##   mpg cyl disp  hp drat  wt qsec vs am gear carb          model
## 1  16   8  351 264  4.2 3.2   14  0  1    5    4 Ford_Pantera_L
## 2  20   6  145 175  3.6 2.8   16  0  1    5    6   Ferrari_Dino
## 3  15   8  301 335  3.5 3.6   15  0  1    5    8  Maserati_Bora

filter(mtcars, gear>=5) #gear equal or more than 5

##   mpg cyl disp  hp drat  wt qsec vs am gear carb          model
## 1  26   4  120  91  4.4 2.1   17  0  1    5    2  Porsche_914-2
## 2  30   4   95 113  3.8 1.5   17  1  1    5    2   Lotus_Europa
## 3  16   8  351 264  4.2 3.2   14  0  1    5    4 Ford_Pantera_L
## 4  20   6  145 175  3.6 2.8   16  0  1    5    6   Ferrari_Dino
## 5  15   8  301 335  3.5 3.6   15  0  1    5    8  Maserati_Bora

With dplyr, you can also re-arrange your table, by order of a variable you want:

head(arrange(mtcars, gear))

##   mpg cyl disp  hp drat  wt qsec vs am gear carb             model
## 1  21   6  258 110  3.1 3.2   19  1  0    3    1    Hornet_4_Drive
## 2  19   8  360 175  3.1 3.4   17  0  0    3    2 Hornet_Sportabout
## 3  18   6  225 105  2.8 3.5   20  1  0    3    1           Valiant
## 4  14   8  360 245  3.2 3.6   16  0  0    3    4        Duster_360
## 5  16   8  276 180  3.1 4.1   17  0  0    3    3        Merc_450SE
## 6  17   8  276 180  3.1 3.7   18  0  0    3    3        Merc_450SL

head(arrange(mtcars, vs))

##   mpg cyl disp  hp drat  wt qsec vs am gear carb             model
## 1  21   6  160 110  3.9 2.6   16  0  1    4    4         Mazda_RX4
## 2  21   6  160 110  3.9 2.9   17  0  1    4    4     Mazda_RX4_Wag
## 3  19   8  360 175  3.1 3.4   17  0  0    3    2 Hornet_Sportabout
## 4  14   8  360 245  3.2 3.6   16  0  0    3    4        Duster_360
## 5  16   8  276 180  3.1 4.1   17  0  0    3    3        Merc_450SE
## 6  17   8  276 180  3.1 3.7   18  0  0    3    3        Merc_450SL

head(arrange(mtcars, desc(wt))) #descending order

##   mpg cyl disp  hp drat  wt qsec vs am gear carb               model
## 1  10   8  460 215  3.0 5.4   18  0  0    3    4 Lincoln_Continental
## 2  15   8  440 230  3.2 5.3   17  0  0    3    4   Chrysler_Imperial
## 3  10   8  472 205  2.9 5.2   18  0  0    3    4  Cadillac_Fleetwood
## 4  16   8  276 180  3.1 4.1   17  0  0    3    3          Merc_450SE
## 5  19   8  400 175  3.1 3.8   17  0  0    3    2    Pontiac_Firebird
## 6  13   8  350 245  3.7 3.8   15  0  0    3    4          Camaro_Z28

You can also select only variables of your interest, e.g.

head(select(mtcars, model, cyl))

##                               model cyl
## Mazda RX4                 Mazda_RX4   6
## Mazda RX4 Wag         Mazda_RX4_Wag   6
## Datsun 710               Datsun_710   4
## Hornet 4 Drive       Hornet_4_Drive   6
## Hornet Sportabout Hornet_Sportabout   8
## Valiant                     Valiant   6

head(select(mtcars, -cyl, -wt, -model))

##                   mpg disp  hp drat qsec vs am gear carb
## Mazda RX4          21  160 110  3.9   16  0  1    4    4
## Mazda RX4 Wag      21  160 110  3.9   17  0  1    4    4
## Datsun 710         23  108  93  3.8   19  1  1    4    1
## Hornet 4 Drive     21  258 110  3.1   19  1  0    3    1
## Hornet Sportabout  19  360 175  3.1   17  0  0    3    2
## Valiant            18  225 105  2.8   20  1  0    3    1

head(select(mtcars, -(disp:carb)))

##                   mpg cyl             model
## Mazda RX4          21   6         Mazda_RX4
## Mazda RX4 Wag      21   6     Mazda_RX4_Wag
## Datsun 710         23   4        Datsun_710
## Hornet 4 Drive     21   6    Hornet_4_Drive
## Hornet Sportabout  19   8 Hornet_Sportabout
## Valiant            18   6           Valiant

There is other syntaxes you can use (check ?select_helpers ).

If you want to calculate a new field inside your table, you can use mutate or transmute. Transmute will drop variables (unless you use them as arguments), and mutate will keep them. I will use the Titanic dataset from now on.

Tita<-as.data.frame(Titanic)
head(Tita)

##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

Tita<-mutate(Tita, perc=Freq/sum(Freq)*100)
head(Tita)

##   Class    Sex   Age Survived Freq perc
## 1   1st   Male Child       No    0  0.0
## 2   2nd   Male Child       No    0  0.0
## 3   3rd   Male Child       No   35  1.6
## 4  Crew   Male Child       No    0  0.0
## 5   1st Female Child       No    0  0.0
## 6   2nd Female Child       No    0  0.0

head(transmute(Tita, Freq, perc=Freq/sum(Freq)*100))

##   Freq perc
## 1    0  0.0
## 2    0  0.0
## 3   35  1.6
## 4    0  0.0
## 5    0  0.0
## 6    0  0.0

The function summarise() helps you collapsing your variables into a single record, according to a given function - could be sum, or mean, or whatever you feel like.

summarise(Tita, Total=sum(Freq))#How many people were sampled?

##   Total
## 1  2201

A very usefull operation is group_by. With it, you can perform your operations and group your data.

people<-group_by(Tita, Class)
summarise(people, sum(Freq))#How many people were sampled by class?

## # A tibble: 4 × 2
##    Class `sum(Freq)`
##   <fctr>       <dbl>
## 1    1st         325
## 2    2nd         285
## 3    3rd         706
## 4   Crew         885

You can also use pipe notation (from magrittr (Bache and Wickham 2014)), and it simplifies your code:

Tita %>%
  group_by(Survived) %>%
  summarise(sum(Freq)) #How many survived and how many died?

## # A tibble: 2 × 2
##   Survived `sum(Freq)`
##     <fctr>       <dbl>
## 1       No        1490
## 2      Yes         711

Tita %>%
  group_by(Sex, Survived)%>%
  summarise(sum(Freq)) #How many women and men died?

## Source: local data frame [4 x 3]
## Groups: Sex [?]
## 
##      Sex Survived `sum(Freq)`
##   <fctr>   <fctr>       <dbl>
## 1   Male       No        1364
## 2   Male      Yes         367
## 3 Female       No         126
## 4 Female      Yes         344

Tita %>%
  group_by(Age, Survived) %>%
  summarise(sum(Freq))#How many children (or adults) died?

## Source: local data frame [4 x 3]
## Groups: Age [?]
## 
##      Age Survived `sum(Freq)`
##   <fctr>   <fctr>       <dbl>
## 1  Child       No          52
## 2  Child      Yes          57
## 3  Adult       No        1438
## 4  Adult      Yes         654

`dplyr::joins`

Another very usefull feature of dplyr is the joins between tables. They work more or less like the SQL joins. After all, dplyr is a front end language that can be converted to SQL or spark. In order to perform a join between two tables, a key is needed to base that join on. In other words, a common variable between both tables is needed, on which the join will be based on.

So let’s make two dummy tables, with a common key (Name).

A=data.frame(Name = c("Maria", "Joana", "Paulo", "Abu", "Jacinta"),
             age = c("cret", "jur", "mioc", "ord", "neog"),
             refs = c(1:5))
B=data.frame(Name = c("Abu", "Joana", "Maria", "Gonzalo"), 
             comp = c(1:4), 
             sent= c(5:8))

left_join(A, B)

##      Name  age refs comp sent
## 1   Maria cret    1    3    7
## 2   Joana  jur    2    2    6
## 3   Paulo mioc    3   NA   NA
## 4     Abu  ord    4    1    5
## 5 Jacinta neog    5   NA   NA

right_join(A, B)

##      Name  age refs comp sent
## 1     Abu  ord    4    1    5
## 2   Joana  jur    2    2    6
## 3   Maria cret    1    3    7
## 4 Gonzalo <NA>   NA    4    8

inner_join(A, B)

##    Name  age refs comp sent
## 1 Maria cret    1    3    7
## 2 Joana  jur    2    2    6
## 3   Abu  ord    4    1    5

semi_join(A, B)

##    Name  age refs
## 1   Abu  ord    4
## 2 Joana  jur    2
## 3 Maria cret    1

semi_join(B, A)

##    Name comp sent
## 1 Maria    3    7
## 2 Joana    2    6
## 3   Abu    1    5

anti_join(A, B)

##      Name  age refs
## 1   Paulo mioc    3
## 2 Jacinta neog    5

anti_join(B, A)

##      Name comp sent
## 1 Gonzalo    4    8

full_join(A, B)

##      Name  age refs comp sent
## 1   Maria cret    1    3    7
## 2   Joana  jur    2    2    6
## 3   Paulo mioc    3   NA   NA
## 4     Abu  ord    4    1    5
## 5 Jacinta neog    5   NA   NA
## 6 Gonzalo <NA>   NA    4    8

full_join(B, A)

##      Name comp sent  age refs
## 1     Abu    1    5  ord    4
## 2   Joana    2    6  jur    2
## 3   Maria    3    7 cret    1
## 4 Gonzalo    4    8 <NA>   NA
## 5   Paulo   NA   NA mioc    3
## 6 Jacinta   NA   NA neog    5

The figure bellow is a resume on how the joins in dplyr work. Note that the difference of inner_join() and semi_join() is that the inner preserves all columns from both tables, whereas semi will only keep the “A” columns. In addition, if you’re looking for a join that will keep unmatching results from both tables, you have to use anti_join() twice. For that, you can use the code:

full_join(anti_join(A, B), anti_join(B, A))

##      Name  age refs comp sent
## 1   Paulo mioc    3   NA   NA
## 2 Jacinta neog    5   NA   NA
## 3 Gonzalo <NA>   NA    4    8

knitr::include_graphics("figure/12.jpg")

Apply family

apply will apply a function to a matrix
lapply “l” stands for list. It returns a list of the same input as X, where each element is the result of applying a function to the corresponding element of X
sapply “s” stands for simple. It is the same as lapply, but it returns a vector
tapply “t” stands for table. This function will agregate your data into groups and apply the function over those groups.
mapply m stands for multivariate. It is the same as lapply but instead of looping through a vector/list, it loops through each item of multiple lists, or vectors.

apply has three main arguments. First is the array (or matrix), second, the margin parameter which will determine if you want to perform the operation over a row or a column, and third indicates the function.

df<-matrix(1:20, 5)
df

##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

apply(df, 2, sum)

## [1] 15 40 65 90

apply(df, 1, sum)

## [1] 34 38 42 46 50

Note that for the apply function you input an arrow, and what you get is a vector, or a matrix (if your margin argument is c(1,2)).

lapply

data <- list(foo=c(5:10), bar=c(1:50), baz=c(2000:2018))
lapply(data, FUN = sum)

## $foo
## [1] 45
## 
## $bar
## [1] 1275
## 
## $baz
## [1] 38171

lapply(data, FUN = mean)

## $foo
## [1] 7.5
## 
## $bar
## [1] 26
## 
## $baz
## [1] 2009

Note that for the lapply function you input a list, and you get as an output another list.

b <- lapply(data, FUN = sum)
class(b)

## [1] "list"

sapply

As stated before, sapply works exactly as lapply, but will return a vector.

a <- sapply(data, FUN = sum)
a

##   foo   bar   baz 
##    45  1275 38171

class(a)

## [1] "integer"

tapply To understand how this function works, let’s consider a data frame where I have a couple measurements on a plant characteristic - “foo” and “bar” and I measured those in 5 different ecosystems - A, B, C, D, and E. Now let’s say I want the mean over foo and bar for each ecosystem:

data<- data.frame("foo"=(1:100), "bar"=(201:400), 
                    "ecosystem"=rep(paste(letters[1:5]), 40))
tapply(data$foo, data$ecosystem, mean)

##  a  b  c  d  e 
## 48 50 50 52 52

tapply(data$bar, data$ecosystem, mean)

##   a   b   c   d   e 
## 298 300 300 302 302

The first argument corresponds to the object to which you want to apply the function, second how should the data be grouped by, and third is the function to apply.

mapply As stated before, this function loops through each item of multiple lists, or vectors. Therefore, the objects should be the same lenght!!!

a<-list("foo"=c(0:10), "bar"= c(10:20))
mapply(sum, a$foo, a$bar)

##  [1] 10 12 14 16 18 20 22 24 26 28 30

Note that first argument is the function, and the others are whatever objects you want.

mapply(sum, a$foo, c(50:60), a$foo, a$bar)

##  [1]  60  64  68  72  76  80  84  88  92  96 100

mapply(mean, a$foo, a$bar)

##  [1]  0  1  2  3  4  5  6  7  8  9 10

References

Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.

Henry, Lionel, and Hadley Wickham. 2017. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Wickham, Hadley. 2016. Tidyverse: Easily Install and Load ’Tidyverse’ Packages. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, and Winston Chang. 2016. Ggplot2: An Implementation of the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistics: 1–23.

Wickham, Hadley, Romain Francois, and Kirill Muller. 2016. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley_a. 2017. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley_b. 2017. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Data Wrangling in R

Angela Rodrigues

2018-03-14

A first look through your data

Strings and `stringr`

Basic operations on tables

Selecting data

Excluding data

Adding data

Tidyverse

The wide-long Dillema

`tidyr`

`dplyr`

`dplyr::joins`

Apply family

References

Data Wrangling in R

Angela Rodrigues

2018-03-14

A first look through your data

Strings and stringr

Basic operations on tables

Selecting data

Excluding data

Adding data

Tidyverse

The wide-long Dillema

tidyr

dplyr

dplyr::joins

Apply family

References

Strings and `stringr`

`tidyr`

`dplyr`

`dplyr::joins`