Data wrangling is a very time-consuming and sometimes, really hard task that you have to perform on your data to get the results you want. Data can come for different sources, in several formats, it is incomplete, it is unstructured… It is basically a mess and in order to extract some valuable information out of it, in order to construct knowledge out of it, you have to tidy it up make it uniform, according to the type of analysis that will follow.
A first look through your data
A first look through your data will start with head(). It allows you to take a look at the first 6 rows of your data. In addition, tail() will give you the last 6 records. First, I will construct a dummy example for this purpose:
tab_data <- data.frame("x"=rnorm(10, 60, 100), "variable1"=seq(1:10))
names(tab_data) <- c("Latitude conventional", "Logitude separator")
tab_data## Latitude conventional Logitude separator
## 1 22.45927 1
## 2 -92.05847 2
## 3 74.58678 3
## 4 193.17415 4
## 5 59.32055 5
## 6 86.53750 6
## 7 96.25202 7
## 8 -242.93970 8
## 9 239.52399 9
## 10 136.72346 10
Now, take a look at its head and tail:
head(tab_data)## Latitude conventional Logitude separator
## 1 22.45927 1
## 2 -92.05847 2
## 3 74.58678 3
## 4 193.17415 4
## 5 59.32055 5
## 6 86.53750 6
tail(tab_data)## Latitude conventional Logitude separator
## 5 59.32055 5
## 6 86.53750 6
## 7 96.25202 7
## 8 -242.93970 8
## 9 239.52399 9
## 10 136.72346 10
Another function you can apply to your data is the function str(). This function will give you an idea of rows/columns, how many variables you have and their classification, and some data samples of each variable.
str(tab_data)## 'data.frame': 10 obs. of 2 variables:
## $ Latitude conventional: num 22.5 -92.1 74.6 193.2 59.3 ...
## $ Logitude separator : int 1 2 3 4 5 6 7 8 9 10
Another aspect to have into consideration are the names of the variables you are dealing with. Once you import data, variable names that contain special characters or spaces can completely drag you down with a million problems. They should be “cleaned” - no spaces or special characters.
The variable names on the table have a space that could give you trouble in your analysis. But R has function that will tidy them up, and prevent future possible problems - make.names()
names(tab_data) <- make.names(names(tab_data))
tab_data[1,]## Latitude.conventional Logitude.separator
## 1 22.45927 1
If you’re looking for a pattern, take a look at the function table(). It will count each factor for a variable.
Let’s take as an example the mtcars dataset:
table(mtcars$cyl)##
## 4 6 8
## 11 7 14
table(mtcars$gear)##
## 3 4 5
## 15 12 5
table() can be used as a technique to find duplicates, when appropriate, but you can also use the funtion duplicated(). It returns a logical vector that will point TRUE whenever there is a duplicate. If you intend to remove duplicates directly, you can use the function unique().
Strings and stringr
require(stringr)stringr is a package in R that has “Simple, Consistent Wrappers for Common String Operations” (Hadley_b Wickham 2017). It might seem that you won’t need it, but it is quite important, specially when the data you’re receiving has been inputed by different people, using different methods, and you have to uniformize it.
There are four main families of functions in stringr:
- Character Manipulation functions;
- Whitespaces manipulation functions;
- Locale sensitive functions and
- Pattern matching functions.
You can get the lenght of any string using str_length().
a<-"Hello World"
str_length(a)## [1] 11
You can substitute characters in a string by something your want, or nada de nada. The first argument is your string, the second the starting position and the third is the last position.
str_sub(a, 1, 5) # to keep the string specified within your arguments## [1] "Hello"
str_sub(a, 7, 11)<-"Pluto" #substitute "world" by "pluto"
a## [1] "Hello Pluto"
a<-c("Hello World", "Goodbye World")
str_sub(a, 1,6)## [1] "Hello " "Goodby"
str_sub(a, -5, -1)## [1] "World" "World"
There is also a function from base that I really like which is gsub. It is quite usefull as well, in case that, for example, you need to supress whitespaces inside your variables and substitute them by “_“.
a<-"I am Angela"
gsub(" ", "_", a)## [1] "I_am_Angela"
Regarding whitespaces, it is worth to mention str_pad() that adds extra whitespaces either on the left, or right, or on both sides; str_trim() will remove leading and trailing whitespaces. This can become quite usefull in our dataset, where in certain classes there is a leading whitespace at the beginning of each level. E.g.:
df<-data.frame(x=c(" FOO"," baR"), y= c("blUe ", "GREy "))
str_trim(df$x)## [1] "FOO" "baR"
Now, let’s apply this function to the rest of the table:
dftrim<-as.data.frame(lapply(df, str_trim))
levels(dftrim$y)## [1] "blUe" "GREy"
More on the lapply() function will be given further on.
To change cases, there are functions such as str_to_upper(), str_to_title() and str_to_lower(). You can also order strings, str_order() or sort them with str_sort().
For example, in our case data, I believe it would be better to lower all cases. Therefore,
dftrim<-as.data.frame(lapply(dftrim, str_to_lower))
dftrim## x y
## 1 foo blue
## 2 bar grey
More information about string patterns can be found here.
Basic operations on tables
One of the most basic table operations you can do is to transpose the table. You can do so by simply using the function t().
dftrimt<-as.data.frame(t(dftrim))Selecting data
Selection of data can be done in a couple of ways. Let’s say you want to select rows 7:9, or columns 1 and 2:
mtcars[7:9,]## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
mtcars[c(7,8,9),]## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
head(mtcars[,1:2])## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
head(mtcars[, c(1,2)])## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
You can also subset your data according to the values of your variables. E.g., let’s say you need to select all cars with “cyl” equal to 6. You have several ways to do it:
mtcars[which(mtcars$cyl == 6), ]## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
subset(mtcars, mtcars$cyl ==6)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Now, let’s a look at another example. Let’s select cyl=6 and carburator 4 or less. You can also select all cars with 6 cilinders plus all cars with carburator less than 4. Take into consideration that in the first case it is used & and the second |. They mean “AND” and “OR” respectively.
mtcars[which(mtcars$cyl== 6 & mtcars$carb <=4), ]## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
head(mtcars[which(mtcars$cyl== 6 | mtcars$carb <=4), ])## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
subset(mtcars, mtcars$cyl== 6 & mtcars$carb <=4)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
head(subset(mtcars, mtcars$cyl== 6 | mtcars$carb <=4))## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
With the subset function, you can also select the columns you want on the output.
subset(mtcars, mtcars$cyl== 6 & mtcars$carb <=4, select = c("cyl", "wt"))## cyl wt
## Mazda RX4 6 2.620
## Mazda RX4 Wag 6 2.875
## Hornet 4 Drive 6 3.215
## Valiant 6 3.460
## Merc 280 6 3.440
## Merc 280C 6 3.440
Excluding data
This process is also referred to as dropping data.
head(mtcars[-1,]) # exclude first row## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
head(mtcars[,-1]) #exclude first column## cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
head(mtcars[ , -c(1:3)]) #exclude columns 1:3## hp drat wt qsec vs am gear carb
## Mazda RX4 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 175 3.15 3.440 17.02 0 0 3 2
## Valiant 105 2.76 3.460 20.22 1 0 3 1
Adding data
If you want to merge data horizontally, by key-value columns, check the package dplyr in the next sections. If you want to merge two datatables vertically, you can use rbind, as long as the tables have the same variables (columns).
x<-data.frame(x=1:5, y=2:6)
y<-data.frame(x=2:10, y=20)
xy=rbind(x,y)Tidyverse
Tidyverse (Wickham 2016) is a group of packages that share the same design philosophy, grammar and data structures. The packages that are part of this universe are dplyr (Wickham and Francois 2016), ggplot2 (Wickham and Chang 2016), readr (Wickham, Hester, and Francois 2016), tibble (Wickham, Francois, and Muller 2016), tidyr (Hadley_a Wickham 2017) and purrr (Henry and Wickham 2017).
The wide-long Dillema
A wide dataframe format is a dataframe that has a column for each variable, such as the table we’ve been using so far. A long format dataframe is a dataframe that has a column for all variables “stacked” into a column and another column has the values.
A wide dataframe looks like:
## SO4 Ca
## 1 0.1 0.5
## 2 0.2 0.6
## 3 0.3 0.7
The same dataframe on the long format looks like:
## variable value
## 1 SO4 0.1
## 2 SO4 0.2
## 3 SO4 0.3
## 4 Ca 0.5
## 5 Ca 0.6
## 6 Ca 0.7
Now this requires a little thinking about your data, and what are the “x’s” of your data and which are the “y’s”. Sometimes you’ll need long format, and other times, you’ll need wide format. It only depends on what you need to do with your data. ggplot2, plyr, and some modelling functions lm(), glm(), need the long format tables. However, you might find it easy to record your data in the wide-format.
The package reshape helps you to convert between these two formats, and all goes around with two main functions: melt() (from wide to long) and cast (from long to wide).
Let’s take into consideration the mtcars dataset:
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The “x” for this dataset is each car model, which is the name for each row. For that reason, I will transfer these names as a new column. Now, to transform a dataset from wide to long, you can use the function melt(). It’s always important that you think about the nature of your data. All datasets are different to a certain extent, and that largely depends on the results you’re trying to accomplish. A deep understanding of your data (what are the variables of interest; what are the questions you’re trying to answer what is the relationship between those questions and the variables you have) might be something that you don’t really want to think about, because you’re just eager to get the results. However, “wasting” this time at the beginning of your data analysis, is something that could avoid future bulk mistakes and misconceptions is your analysis. Therefore, whenever you’re melting your data, always use the argument “id.vars”, so you know exaclty what you’re doing with your data.
mtcars$model<-gsub(" ", "_", row.names(mtcars))
longmtcars<-melt(mtcars, id.vars = "model")
head(longmtcars)## model variable value
## 1 Mazda_RX4 mpg 21.0
## 2 Mazda_RX4_Wag mpg 21.0
## 3 Datsun_710 mpg 22.8
## 4 Hornet_4_Drive mpg 21.4
## 5 Hornet_Sportabout mpg 18.7
## 6 Valiant mpg 18.1
tail(longmtcars)## model variable value
## 347 Porsche_914-2 carb 2
## 348 Lotus_Europa carb 2
## 349 Ford_Pantera_L carb 4
## 350 Ferrari_Dino carb 6
## 351 Maserati_Bora carb 8
## 352 Volvo_142E carb 2
The onbject we just created, longmtcars is now a “tidy” dataset according to (Wickham and others 2014). Now, you can convert it back again to wide format. You can use the function dcast() to have a table as an output or you can use acast to have an array or vector as an output. To achieve the conversion to wide format, you’ll need a formula of the type \(x_{1} + x_{2} \sim y_{1} + y_{2}\).
wide<-dcast(longmtcars, model ~ variable)
head(wide)## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 AMC_Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
## 2 Cadillac_Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## 3 Camaro_Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## 4 Chrysler_Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## 5 Datsun_710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 6 Dodge_Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
And there you go, your data is back as a wide table.
Examples of other types of tables you can have; note that the one to choose will always be directly related to the type of analysis you’re trying to do. You have to understand which variable is the cause and which is the effect.
head(melt(mtcars, id.vars = "hp"))## hp variable value
## 1 110 mpg 21
## 2 110 mpg 21
## 3 93 mpg 22.8
## 4 110 mpg 21.4
## 5 175 mpg 18.7
## 6 105 mpg 18.1
head(melt(mtcars, id.vars = c("hp", "gear")))## hp gear variable value
## 1 110 4 mpg 21
## 2 110 4 mpg 21
## 3 93 4 mpg 22.8
## 4 110 3 mpg 21.4
## 5 175 3 mpg 18.7
## 6 105 3 mpg 18.1
You can also use dcast() to perform aggregation. Let’s say you want to collect the mean of each variable for each gear. You can do so by:
df<-melt(mtcars, id.vars = c("model", "gear"))
head(df)## model gear variable value
## 1 Mazda_RX4 4 mpg 21.0
## 2 Mazda_RX4_Wag 4 mpg 21.0
## 3 Datsun_710 4 mpg 22.8
## 4 Hornet_4_Drive 3 mpg 21.4
## 5 Hornet_Sportabout 3 mpg 18.7
## 6 Valiant 3 mpg 18.1
dcast(df, gear ~ variable, mean)## gear mpg cyl disp hp drat wt qsec
## 1 3 16.10667 7.466667 326.3000 176.1333 3.132667 3.892600 17.692
## 2 4 24.53333 4.666667 123.0167 89.5000 4.043333 2.616667 18.965
## 3 5 21.38000 6.000000 202.4800 195.6000 3.916000 2.632600 15.640
## vs am carb
## 1 0.2000000 0.0000000 2.666667
## 2 0.8333333 0.6666667 2.333333
## 3 0.2000000 1.0000000 4.400000
dcast(df, gear~variable) #counts occurences## Aggregation function missing: defaulting to length
## gear mpg cyl disp hp drat wt qsec vs am carb
## 1 3 15 15 15 15 15 15 15 15 15 15
## 2 4 12 12 12 12 12 12 12 12 12 12
## 3 5 5 5 5 5 5 5 5 5 5 5
Note that it pops the message that the lenght is used as the aggregation function, because we did not define the function, which means that the output will give you counts.
However, I believe it is easier to do this operation (aggregation) with dplyr.
Now just as a curiosity, take a look on how to produce boxplots with base and with ggplot2 for the mtcars:
require(ggplot2, quietly = TRUE)boxplot(mpg~gear, data=mtcars)
ggplot(mtcars, aes(x=as.factor(gear), y=mpg)) +
geom_boxplot() + xlab("cyl") # one variable against the other
ggplot(longmtcars, aes (x=variable, y=value)) +
geom_boxplot() #all variablestidyr
The tidyr package is a package that allows you to tidy your data easily (Hadley_a Wickham 2017). It might seem very similar to reshape2, but tidyr is designed mainly to tidy data and not to reshape.
This package offers you four main functions:
gather ()separate ()spread ()unite ()
Let’s take an example. You measured \(SO_4^2\) levels, and collected that data along a river in three locations, from 2005 to 2008.
require(tidyr, quietly = TRUE)riversulfate<-data.frame(Location= (1:3),
y2005 = c(0.1, 0.2, 0.1),
y2006= c(0.2, 0.1, 0.1),
y2007= c(0.5, 0.1, 0.2),
y2008 = c(0.2, 0.3, 0.1))
riversulfate## Location y2005 y2006 y2007 y2008
## 1 1 0.1 0.2 0.5 0.2
## 2 2 0.2 0.1 0.1 0.3
## 3 3 0.1 0.1 0.2 0.1
If you take a look at the data, your rows are not a single observation for each location, but instead in each row you have four different observations for each location. To tidy these data, and keep each observation on a row, you use gather (), where the first argument is your table, second the name you want to call the variable that will “group” several columns into the same column (referred to as “key”), the third the name of the measurement itself, and lastly, the indication for columns to be gathered.
gather(riversulfate, year, value, y2005:y2008)## Location year value
## 1 1 y2005 0.1
## 2 2 y2005 0.2
## 3 3 y2005 0.1
## 4 1 y2006 0.2
## 5 2 y2006 0.1
## 6 3 y2006 0.1
## 7 1 y2007 0.5
## 8 2 y2007 0.1
## 9 3 y2007 0.2
## 10 1 y2008 0.2
## 11 2 y2008 0.3
## 12 3 y2008 0.1
or more simplisticaly, with the pipe notation:
riversulfate %>%
gather(year, value, y2005:y2008)## Location year value
## 1 1 y2005 0.1
## 2 2 y2005 0.2
## 3 3 y2005 0.1
## 4 1 y2006 0.2
## 5 2 y2006 0.1
## 6 3 y2006 0.1
## 7 1 y2007 0.5
## 8 2 y2007 0.1
## 9 3 y2007 0.2
## 10 1 y2008 0.2
## 11 2 y2008 0.3
## 12 3 y2008 0.1
Now, let’s take the same example, but let’s consider that for each location you measured both \(Na^+\) and \(SO_4^2\), and merged both variables into “Location”.
set.seed(2)
options(digits = 2)
riversulfate<-data.frame(Location= c("1;Na", "1;SO4", "2;Na",
"2;SO4", "3;Na", "3;SO4"),
y2005 = rnorm(6, 5),
y2006= rnorm(6, 0.2),
y2007= rnorm(6, 0.1),
y2008 = rnorm(6, 0.5))
head(riversulfate)## Location y2005 y2006 y2007 y2008
## 1 1;Na 4.1 0.908 -0.29 1.51
## 2 1;SO4 5.2 -0.040 -0.94 0.93
## 3 2;Na 6.6 2.184 1.88 2.59
## 4 2;SO4 3.9 0.061 -2.21 -0.70
## 5 3;Na 4.9 0.618 0.98 2.09
## 6 3;SO4 5.1 1.182 0.14 2.45
To tidy this table, you first have to collapse the different years into one column using gather(), and then use separate() to separate the “location” column into two columns. As a result, you’ll have for each column a variable and for each row an observation.
tidy<-gather(riversulfate, year, mgL, y2005:y2008)
head(tidy)## Location year mgL
## 1 1;Na y2005 4.1
## 2 1;SO4 y2005 5.2
## 3 2;Na y2005 6.6
## 4 2;SO4 y2005 3.9
## 5 3;Na y2005 4.9
## 6 3;SO4 y2005 5.1
river<-separate(tidy, Location, c("Location", "element"), sep="\\;")
head(river)## Location element year mgL
## 1 1 Na y2005 4.1
## 2 1 SO4 y2005 5.2
## 3 2 Na y2005 6.6
## 4 2 SO4 y2005 3.9
## 5 3 Na y2005 4.9
## 6 3 SO4 y2005 5.1
If you desire to present an element as a column, you can use spread(). The first argument is your data, second the column you want to separate to several columns (the key), and the third name of the column that holds the values for that key.
tidier<-spread(river, element, mgL)
head(tidier)## Location year Na SO4
## 1 1 y2005 4.10 5.185
## 2 1 y2006 0.91 -0.040
## 3 1 y2007 -0.29 -0.940
## 4 1 y2008 1.51 0.932
## 5 2 y2005 6.59 3.870
## 6 2 y2006 2.18 0.061
dplyr
dplyr is “A fast, consistent tool for working with data frame like objects”" (Wickham and Francois 2016).
The main functions from this package are: * filter() - selection based on an attribute value * arrange() - reorder your variables by a variable order (ascending or descending) * select() and rename() - selecting variables by name * mutate() - works as a “field calculator” * summarise() - condenses a variable into one value
Under this context, let’s use again the mtcars dataset. dplyr allows you to filter by attributes. In the following example, I want to extract all cars with gear 5:
require(dplyr)filter(mtcars, gear==5)## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 26 4 120 91 4.4 2.1 17 0 1 5 2 Porsche_914-2
## 2 30 4 95 113 3.8 1.5 17 1 1 5 2 Lotus_Europa
## 3 16 8 351 264 4.2 3.2 14 0 1 5 4 Ford_Pantera_L
## 4 20 6 145 175 3.6 2.8 16 0 1 5 6 Ferrari_Dino
## 5 15 8 301 335 3.5 3.6 15 0 1 5 8 Maserati_Bora
filter(mtcars, gear ==5,
carb!=2 & carb!=3) #gear 5, carb isnt 2 and 3## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 16 8 351 264 4.2 3.2 14 0 1 5 4 Ford_Pantera_L
## 2 20 6 145 175 3.6 2.8 16 0 1 5 6 Ferrari_Dino
## 3 15 8 301 335 3.5 3.6 15 0 1 5 8 Maserati_Bora
filter(mtcars, gear>=5) #gear equal or more than 5## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 26 4 120 91 4.4 2.1 17 0 1 5 2 Porsche_914-2
## 2 30 4 95 113 3.8 1.5 17 1 1 5 2 Lotus_Europa
## 3 16 8 351 264 4.2 3.2 14 0 1 5 4 Ford_Pantera_L
## 4 20 6 145 175 3.6 2.8 16 0 1 5 6 Ferrari_Dino
## 5 15 8 301 335 3.5 3.6 15 0 1 5 8 Maserati_Bora
With dplyr, you can also re-arrange your table, by order of a variable you want:
head(arrange(mtcars, gear))## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 21 6 258 110 3.1 3.2 19 1 0 3 1 Hornet_4_Drive
## 2 19 8 360 175 3.1 3.4 17 0 0 3 2 Hornet_Sportabout
## 3 18 6 225 105 2.8 3.5 20 1 0 3 1 Valiant
## 4 14 8 360 245 3.2 3.6 16 0 0 3 4 Duster_360
## 5 16 8 276 180 3.1 4.1 17 0 0 3 3 Merc_450SE
## 6 17 8 276 180 3.1 3.7 18 0 0 3 3 Merc_450SL
head(arrange(mtcars, vs))## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 21 6 160 110 3.9 2.6 16 0 1 4 4 Mazda_RX4
## 2 21 6 160 110 3.9 2.9 17 0 1 4 4 Mazda_RX4_Wag
## 3 19 8 360 175 3.1 3.4 17 0 0 3 2 Hornet_Sportabout
## 4 14 8 360 245 3.2 3.6 16 0 0 3 4 Duster_360
## 5 16 8 276 180 3.1 4.1 17 0 0 3 3 Merc_450SE
## 6 17 8 276 180 3.1 3.7 18 0 0 3 3 Merc_450SL
head(arrange(mtcars, desc(wt))) #descending order## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 10 8 460 215 3.0 5.4 18 0 0 3 4 Lincoln_Continental
## 2 15 8 440 230 3.2 5.3 17 0 0 3 4 Chrysler_Imperial
## 3 10 8 472 205 2.9 5.2 18 0 0 3 4 Cadillac_Fleetwood
## 4 16 8 276 180 3.1 4.1 17 0 0 3 3 Merc_450SE
## 5 19 8 400 175 3.1 3.8 17 0 0 3 2 Pontiac_Firebird
## 6 13 8 350 245 3.7 3.8 15 0 0 3 4 Camaro_Z28
You can also select only variables of your interest, e.g.
head(select(mtcars, model, cyl))## model cyl
## Mazda RX4 Mazda_RX4 6
## Mazda RX4 Wag Mazda_RX4_Wag 6
## Datsun 710 Datsun_710 4
## Hornet 4 Drive Hornet_4_Drive 6
## Hornet Sportabout Hornet_Sportabout 8
## Valiant Valiant 6
head(select(mtcars, -cyl, -wt, -model))## mpg disp hp drat qsec vs am gear carb
## Mazda RX4 21 160 110 3.9 16 0 1 4 4
## Mazda RX4 Wag 21 160 110 3.9 17 0 1 4 4
## Datsun 710 23 108 93 3.8 19 1 1 4 1
## Hornet 4 Drive 21 258 110 3.1 19 1 0 3 1
## Hornet Sportabout 19 360 175 3.1 17 0 0 3 2
## Valiant 18 225 105 2.8 20 1 0 3 1
head(select(mtcars, -(disp:carb)))## mpg cyl model
## Mazda RX4 21 6 Mazda_RX4
## Mazda RX4 Wag 21 6 Mazda_RX4_Wag
## Datsun 710 23 4 Datsun_710
## Hornet 4 Drive 21 6 Hornet_4_Drive
## Hornet Sportabout 19 8 Hornet_Sportabout
## Valiant 18 6 Valiant
There is other syntaxes you can use (check ?select_helpers ).
If you want to calculate a new field inside your table, you can use mutate or transmute. Transmute will drop variables (unless you use them as arguments), and mutate will keep them. I will use the Titanic dataset from now on.
Tita<-as.data.frame(Titanic)
head(Tita)## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
Tita<-mutate(Tita, perc=Freq/sum(Freq)*100)
head(Tita)## Class Sex Age Survived Freq perc
## 1 1st Male Child No 0 0.0
## 2 2nd Male Child No 0 0.0
## 3 3rd Male Child No 35 1.6
## 4 Crew Male Child No 0 0.0
## 5 1st Female Child No 0 0.0
## 6 2nd Female Child No 0 0.0
head(transmute(Tita, Freq, perc=Freq/sum(Freq)*100))## Freq perc
## 1 0 0.0
## 2 0 0.0
## 3 35 1.6
## 4 0 0.0
## 5 0 0.0
## 6 0 0.0
The function summarise() helps you collapsing your variables into a single record, according to a given function - could be sum, or mean, or whatever you feel like.
summarise(Tita, Total=sum(Freq))#How many people were sampled?## Total
## 1 2201
A very usefull operation is group_by. With it, you can perform your operations and group your data.
people<-group_by(Tita, Class)
summarise(people, sum(Freq))#How many people were sampled by class?## # A tibble: 4 × 2
## Class `sum(Freq)`
## <fctr> <dbl>
## 1 1st 325
## 2 2nd 285
## 3 3rd 706
## 4 Crew 885
You can also use pipe notation (from magrittr (Bache and Wickham 2014)), and it simplifies your code:
Tita %>%
group_by(Survived) %>%
summarise(sum(Freq)) #How many survived and how many died?## # A tibble: 2 × 2
## Survived `sum(Freq)`
## <fctr> <dbl>
## 1 No 1490
## 2 Yes 711
Tita %>%
group_by(Sex, Survived)%>%
summarise(sum(Freq)) #How many women and men died?## Source: local data frame [4 x 3]
## Groups: Sex [?]
##
## Sex Survived `sum(Freq)`
## <fctr> <fctr> <dbl>
## 1 Male No 1364
## 2 Male Yes 367
## 3 Female No 126
## 4 Female Yes 344
Tita %>%
group_by(Age, Survived) %>%
summarise(sum(Freq))#How many children (or adults) died?## Source: local data frame [4 x 3]
## Groups: Age [?]
##
## Age Survived `sum(Freq)`
## <fctr> <fctr> <dbl>
## 1 Child No 52
## 2 Child Yes 57
## 3 Adult No 1438
## 4 Adult Yes 654
dplyr::joins
Another very usefull feature of dplyr is the joins between tables. They work more or less like the SQL joins. After all, dplyr is a front end language that can be converted to SQL or spark. In order to perform a join between two tables, a key is needed to base that join on. In other words, a common variable between both tables is needed, on which the join will be based on.
So let’s make two dummy tables, with a common key (Name).
A=data.frame(Name = c("Maria", "Joana", "Paulo", "Abu", "Jacinta"),
age = c("cret", "jur", "mioc", "ord", "neog"),
refs = c(1:5))
B=data.frame(Name = c("Abu", "Joana", "Maria", "Gonzalo"),
comp = c(1:4),
sent= c(5:8))left_join(A, B)## Name age refs comp sent
## 1 Maria cret 1 3 7
## 2 Joana jur 2 2 6
## 3 Paulo mioc 3 NA NA
## 4 Abu ord 4 1 5
## 5 Jacinta neog 5 NA NA
right_join(A, B)## Name age refs comp sent
## 1 Abu ord 4 1 5
## 2 Joana jur 2 2 6
## 3 Maria cret 1 3 7
## 4 Gonzalo <NA> NA 4 8
inner_join(A, B)## Name age refs comp sent
## 1 Maria cret 1 3 7
## 2 Joana jur 2 2 6
## 3 Abu ord 4 1 5
semi_join(A, B)## Name age refs
## 1 Abu ord 4
## 2 Joana jur 2
## 3 Maria cret 1
semi_join(B, A)## Name comp sent
## 1 Maria 3 7
## 2 Joana 2 6
## 3 Abu 1 5
anti_join(A, B)## Name age refs
## 1 Paulo mioc 3
## 2 Jacinta neog 5
anti_join(B, A)## Name comp sent
## 1 Gonzalo 4 8
full_join(A, B)## Name age refs comp sent
## 1 Maria cret 1 3 7
## 2 Joana jur 2 2 6
## 3 Paulo mioc 3 NA NA
## 4 Abu ord 4 1 5
## 5 Jacinta neog 5 NA NA
## 6 Gonzalo <NA> NA 4 8
full_join(B, A)## Name comp sent age refs
## 1 Abu 1 5 ord 4
## 2 Joana 2 6 jur 2
## 3 Maria 3 7 cret 1
## 4 Gonzalo 4 8 <NA> NA
## 5 Paulo NA NA mioc 3
## 6 Jacinta NA NA neog 5
The figure bellow is a resume on how the joins in dplyr work. Note that the difference of inner_join() and semi_join() is that the inner preserves all columns from both tables, whereas semi will only keep the “A” columns. In addition, if you’re looking for a join that will keep unmatching results from both tables, you have to use anti_join() twice. For that, you can use the code:
full_join(anti_join(A, B), anti_join(B, A))## Name age refs comp sent
## 1 Paulo mioc 3 NA NA
## 2 Jacinta neog 5 NA NA
## 3 Gonzalo <NA> NA 4 8
knitr::include_graphics("figure/12.jpg")Apply family
applywill apply a function to a matrixlapply“l” stands for list. It returns a list of the same input as X, where each element is the result of applying a function to the corresponding element of Xsapply“s” stands for simple. It is the same aslapply, but it returns a vectortapply“t” stands for table. This function will agregate your data into groups and apply the function over those groups.mapplym stands for multivariate. It is the same aslapplybut instead of looping through a vector/list, it loops through each item of multiple lists, or vectors.
apply has three main arguments. First is the array (or matrix), second, the margin parameter which will determine if you want to perform the operation over a row or a column, and third indicates the function.
df<-matrix(1:20, 5)
df## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
apply(df, 2, sum)## [1] 15 40 65 90
apply(df, 1, sum)## [1] 34 38 42 46 50
Note that for the apply function you input an arrow, and what you get is a vector, or a matrix (if your margin argument is c(1,2)).
lapply
data <- list(foo=c(5:10), bar=c(1:50), baz=c(2000:2018))
lapply(data, FUN = sum)## $foo
## [1] 45
##
## $bar
## [1] 1275
##
## $baz
## [1] 38171
lapply(data, FUN = mean)## $foo
## [1] 7.5
##
## $bar
## [1] 26
##
## $baz
## [1] 2009
Note that for the lapply function you input a list, and you get as an output another list.
b <- lapply(data, FUN = sum)
class(b)## [1] "list"
sapply
As stated before, sapply works exactly as lapply, but will return a vector.
a <- sapply(data, FUN = sum)
a## foo bar baz
## 45 1275 38171
class(a)## [1] "integer"
tapply To understand how this function works, let’s consider a data frame where I have a couple measurements on a plant characteristic - “foo” and “bar” and I measured those in 5 different ecosystems - A, B, C, D, and E. Now let’s say I want the mean over foo and bar for each ecosystem:
data<- data.frame("foo"=(1:100), "bar"=(201:400),
"ecosystem"=rep(paste(letters[1:5]), 40))
tapply(data$foo, data$ecosystem, mean)## a b c d e
## 48 50 50 52 52
tapply(data$bar, data$ecosystem, mean)## a b c d e
## 298 300 300 302 302
The first argument corresponds to the object to which you want to apply the function, second how should the data be grouped by, and third is the function to apply.
mapply As stated before, this function loops through each item of multiple lists, or vectors. Therefore, the objects should be the same lenght!!!
a<-list("foo"=c(0:10), "bar"= c(10:20))
mapply(sum, a$foo, a$bar)## [1] 10 12 14 16 18 20 22 24 26 28 30
Note that first argument is the function, and the others are whatever objects you want.
mapply(sum, a$foo, c(50:60), a$foo, a$bar)## [1] 60 64 68 72 76 80 84 88 92 96 100
mapply(mean, a$foo, a$bar)## [1] 0 1 2 3 4 5 6 7 8 9 10
References
Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.
Henry, Lionel, and Hadley Wickham. 2017. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Wickham, Hadley. 2016. Tidyverse: Easily Install and Load ’Tidyverse’ Packages. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and Winston Chang. 2016. Ggplot2: An Implementation of the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). Foundation for Open Access Statistics: 1–23.
Wickham, Hadley, Romain Francois, and Kirill Muller. 2016. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.
Wickham, Hadley_a. 2017. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley_b. 2017. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.