This tutorial desribes some helpful ways to modify data from its original format. We’ll look at what do if you have missing data or need to add empty placeholders, how to subset data, merge multiple datasets, use for loops and logicals, and how to use functions. For this tutorial we’re going to use the state.x77 base dataset in R.
Let’s start by looking at our state.x77 dataset. If we use the str() command to get the structure of this data, we can see that it is currently a matrix with 50 rows and 8 columns. Let’s turn it into a dataframe. Dataframes are easier to work with in R, and allow for easy variable modification. We’ll call this dataframe “states”.
View(state.x77)
str(state.x77)
## num [1:50, 1:8] 3615 365 2212 2110 21198 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
states <- data.frame(state.x77)
str(states)
## 'data.frame': 50 obs. of 8 variables:
## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
You can now see how the states dataframe is orgnized. We see that each column of data is in the numeric format, which matches the structure of data. Sometimes, data will be in the incorrect format, which you must change before performing any data analysis (otherwise you can get some strange errors).
Lets add a new column of data that has state names. We can do this two different ways: 1) by creating the column on the fly, which is much more efficient, or 2) by manually re-entering existing columns in states dataframe, and adding a new column. We will use the row names from the states dataframe because these are actually the state names.
rownames(states)
# Process 1: use $ to create a new column:
states$Name <- rownames(states)
#Process 2: manually re-make the dataframe:
states2 <- data.frame(states$Population, states$Income, states$Illiteracy, states$Life.Exp, states$Murder, states$HS.Grad, states$Frost, states$Area, rownames(states))
We can see that in this situation, it is much more efficient to use process 1. However, sometimes you may need only a few columns from an existing data frame, in which case process 2 would be more helpful. If you look at the column names in both states and states2, you see that the states columns retained their original names, whereas states2 columns were renamed.
Now let’s clean up our row and column names. This process is essentially the same for both rows and columns, and there are a few different ways to do it. First, let’s use nrows to rename row number:
rownames(states) <- 1:nrow(states) #counts 1-50, and assigns row name as the row number
Next, we’ll rename columns in states2. This will be more a more descriptive approach than simply numbering them 1-9:
colnames(states2) <- c('Pop.', 'Income', 'Illit', 'Lif.Exp', 'Murder', 'Grad', 'Frost', 'Area', 'Name')
# Note: if you put a space between column names, an extra `` will be required around the name when you reference it later in the code.
Now that we know how to set dataframes up, let’s learn how to subset our data.
Lets look at our states dataframe. Say we are interested in seeing which states have a lower life expectancy than others. Let’s set our condition to select any states with a life expectancy below 70 years. We’ll create a dataframe called low_exp to store this data in. To create a subset, the general syntax is subset(x, condition), where x is your dataset and condition is what you wish to find. Let’s give it a try:
low_exp <- subset(states, states$Life.Exp < 70)
View(low_exp)
What do we do when we have missing data? We’re going to look at USGS gage 02068500, which is missing daily flow data for a portion of its record. From 1987-10-12 to 1991-11-30, no data exists. Say we need to merge this gage data with another gage’s data (lets pick a random gage – 03171000) into a single frame for dataframe comparison. This is an issue because we have one gage with missing data and one with complete data. Let’s go ahead and pull our data in.
library(dataRetrieval) #package pulls data from NWIS
library(lfstat) #package includes baseflow calculator tool (createlfobj)
gage_id1 <- '02068500' #gage 1
gage_id2 <- '03171000' #gage 2
startDate <- '1985-10-01'
endDate <- '2005-09-30'
pCode <- '00060' #00060 is discharge in cfs
statCd <- '00003' #00003 is daily mean data
USGS_1 <- readNWISdv(gage_id1, pCode, startDate, endDate, statCd)
USGS_2 <- readNWISdv(gage_id2, pCode, startDate, endDate, statCd)
colnames(USGS_1) <- c('Agency', 'Site No', 'Date', 'Flow', 'Code')
colnames(USGS_2) <- c('Agency', 'Site No', 'Date', 'Flow', 'Code')
nrow(USGS_1); nrow(USGS_2)
## [1] 5795
## [1] 7305
We need to figure out how to merge these two datasets in a way that will not cause data loss. So how do we do it? We’re going to use a command called merge(). Notation is important here: We would like to merge these dataframes based on date, and if no data exists for an entry, we would like to fill that position with NA.
In the example below, see how we specify the column named “Date”, which exists in both USGS_1 and USGS_2. Then we have all.y = TRUE, which tells R that for any non-matching date rows, we want to add the NA value to that row. For more information on merge, see ??merge
USGS_compare <- merge(USGS_1, USGS_2, by="Date", all.y = TRUE)
Now you can see that we have one dataframe that combined USGS_1 and USGS_2 data based on date.
Now we need to learn how to handle data in the reverse situation – what if we have NA values that are affecting data calcuations? We know that USGS_1 flow has NA values, because we just put them there in the line above. What happens if we try to calculate mean for this column?
mean1 <- mean(USGS_compare$Flow.x)
mean1
## [1] NA
Mean can’t be NA though – we know we have values in that column. In these situations, there are two things that can be done depending on what you need from your data.
USGS_noNA <- na.omit(USGS_compare)
nrow(USGS_noNA)
## [1] 5795
What happened here?? We removed entire rows of data for the period in which USGS_1 was missing data. Now both USGS_1 and USGS_2 are missing data! You just removed over 1400 days of data for these gages – think that might have made a difference?
mean(USGS_compare$Flow.y)
## [1] 3894.252
mean(USGS_noNA$Flow.y)
## [1] 3918.437
With a larger dataset such as this, the difference may not look appreciable, but when working with smaller sets, this could cause detrimental results.
## For Loops, If Structures, and Logicals For loops can be extremely helpful when you need to perform one action repeatedly for an entire group of data. Let’s look at a simple example where we’ll again use the state.x77 data. Let’s use population and per capita income to calculate income by state.
To set up a for loop, you must specify a counter, how long you want the loop to repeat, and what you want to happen. Simple structure: for (counter in start:end){}, where your actions are performed within {}. Let’s use a for loop to calculate state income.
i <- 1
states <- data.frame(state.x77)
colnames(states) <- c('Pop.', 'Income', 'Illit', 'Lif.Exp', 'Murder', 'Grad', 'Frost', 'Area')
rownames(states) <- 1:nrow(states)
for (i in 1:nrow(states)){
states$state.income[i] <- states$Pop.[i] * states$Income[i]
i <- i + 1
}
states[1:3,] #look at the first three rows of data to ensure operation worked
## Pop. Income Illit Lif.Exp Murder Grad Frost Area state.income
## 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 13100760
## 2 365 6315 1.5 69.31 11.3 66.7 152 566432 2304975
## 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 10020360
If statements basically have the same structure. Let’s add another column to the states dataframe, stating whether a state is wealthy or not based on per capita income. Note, if statements don’t HAVE to be nested in a for loop – this was just a quick example displaying some of the ways you can nest these structures.
cutoff <- round(mean(states$Income), digits=0) #round mean to the nearest whole #
j <- 1
for (j in 1:nrow(states)){
if (states$Income[j] < cutoff){
states$wealthpercap[j] <- "Poor"
} else
states$wealthpercap[j] <- "Rich"
j <- j + 1
}
states[1:3,]
## Pop. Income Illit Lif.Exp Murder Grad Frost Area state.income
## 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 13100760
## 2 365 6315 1.5 69.31 11.3 66.7 152 566432 2304975
## 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 10020360
## wealthpercap
## 1 Poor
## 2 Rich
## 3 Rich
Now we’re going to look into some of the ways we can use logicals (TRUE and FALSE). We’re going to use test as an example dataframe in which we have multiple types of data.
test <- data.frame(c(1, 2), c(TRUE, FALSE), c(0,FALSE), c(3,4))
test
## c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## 1 1 TRUE 0 3
## 2 2 FALSE 0 4
# Are any of the test entries true?
test==TRUE
## c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,] TRUE TRUE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
# Are any of the test entries false?
test==FALSE
## c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,] FALSE FALSE TRUE FALSE
## [2,] FALSE TRUE TRUE FALSE
# Which entries are 3 -- returns TRUE for entries = 3
test==3
## c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,] FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE
You should take special note that 1 and TRUE are synonymous, as are 0 and FALSE.This means that if you are comparing data with 1’s and 0’s, you need to be aware of how R may evaluate these values. Let’s look at the example below:
test <- data.frame(c(1, 2), c(TRUE, FALSE), c(0,NA), c(3,4))
test[]==TRUE #this will return true even though there is the number 1 in position [1,1]
## c.1..2. c.TRUE..FALSE. c.0..NA. c.3..4.
## [1,] TRUE TRUE FALSE FALSE
## [2,] FALSE FALSE NA FALSE
The last thing this tutorial will briefly cover is functions. Functions can be useful when you need to repeat the same calculation multiple times, or need to store them for later access. Function formats are as follows:
functionname <- function(inputs){
operations
return(outputs)
}
Let’s create a simple function called product that will require two inputs to be multiplied together, and will return the product of the inputs.
product <- function(x,y){
result <- x * y
return(result)
}
a <- 7
b <- 4
product(a,b)
## [1] 28