Data Modification and Analysis

This tutorial desribes some helpful ways to modify data from its original format. We’ll look at what do if you have missing data or need to add empty placeholders, how to subset data, merge multiple datasets, use for loops and logicals, and how to use functions. For this tutorial we’re going to use the state.x77 base dataset in R.

Making Modifications to a dataframe

Let’s start by looking at our state.x77 dataset. If we use the str() command to get the structure of this data, we can see that it is currently a matrix with 50 rows and 8 columns. Let’s turn it into a dataframe. Dataframes are easier to work with in R, and allow for easy variable modification. We’ll call this dataframe “states”.

View(state.x77)
str(state.x77)
##  num [1:50, 1:8] 3615 365 2212 2110 21198 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
states <- data.frame(state.x77)
str(states)
## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...

You can now see how the states dataframe is orgnized. We see that each column of data is in the numeric format, which matches the structure of data. Sometimes, data will be in the incorrect format, which you must change before performing any data analysis (otherwise you can get some strange errors).

Lets add a new column of data that has state names. We can do this two different ways: 1) by creating the column on the fly, which is much more efficient, or 2) by manually re-entering existing columns in states dataframe, and adding a new column. We will use the row names from the states dataframe because these are actually the state names.

rownames(states)
# Process 1: use $ to create a new column: 
states$Name <- rownames(states)

#Process 2: manually re-make the dataframe: 
states2 <- data.frame(states$Population, states$Income, states$Illiteracy, states$Life.Exp, states$Murder, states$HS.Grad, states$Frost, states$Area, rownames(states))

We can see that in this situation, it is much more efficient to use process 1. However, sometimes you may need only a few columns from an existing data frame, in which case process 2 would be more helpful. If you look at the column names in both states and states2, you see that the states columns retained their original names, whereas states2 columns were renamed.

Now let’s clean up our row and column names. This process is essentially the same for both rows and columns, and there are a few different ways to do it. First, let’s use nrows to rename row number:

rownames(states) <- 1:nrow(states) #counts 1-50, and assigns row name as the row number

Next, we’ll rename columns in states2. This will be more a more descriptive approach than simply numbering them 1-9:

colnames(states2) <- c('Pop.', 'Income', 'Illit', 'Lif.Exp', 'Murder', 'Grad', 'Frost', 'Area', 'Name')
# Note: if you put a space between column names, an extra `` will be required around the name when you reference it later in the code.

Creating Subsets of Data

Now that we know how to set dataframes up, let’s learn how to subset our data.

Lets look at our states dataframe. Say we are interested in seeing which states have a lower life expectancy than others. Let’s set our condition to select any states with a life expectancy below 70 years. We’ll create a dataframe called low_exp to store this data in. To create a subset, the general syntax is subset(x, condition), where x is your dataset and condition is what you wish to find. Let’s give it a try:

low_exp <- subset(states, states$Life.Exp < 70)
View(low_exp)

Dealing With Missing Data and Using Merge

What do we do when we have missing data? We’re going to look at USGS gage 02068500, which is missing daily flow data for a portion of its record. From 1987-10-12 to 1991-11-30, no data exists. Say we need to merge this gage data with another gage’s data (lets pick a random gage – 03171000) into a single frame for dataframe comparison. This is an issue because we have one gage with missing data and one with complete data. Let’s go ahead and pull our data in.

library(dataRetrieval) #package pulls data from NWIS
library(lfstat) #package includes baseflow calculator tool (createlfobj)

gage_id1 <- '02068500' #gage 1
gage_id2 <- '03171000' #gage 2
startDate <- '1985-10-01'
endDate <- '2005-09-30'
pCode <- '00060' #00060 is discharge in cfs 
statCd <- '00003' #00003 is daily mean data
USGS_1 <- readNWISdv(gage_id1, pCode, startDate, endDate, statCd)
USGS_2 <- readNWISdv(gage_id2, pCode, startDate, endDate, statCd)

colnames(USGS_1) <- c('Agency', 'Site No', 'Date', 'Flow', 'Code')
colnames(USGS_2) <- c('Agency', 'Site No', 'Date', 'Flow', 'Code')

nrow(USGS_1); nrow(USGS_2)
## [1] 5795
## [1] 7305

We need to figure out how to merge these two datasets in a way that will not cause data loss. So how do we do it? We’re going to use a command called merge(). Notation is important here: We would like to merge these dataframes based on date, and if no data exists for an entry, we would like to fill that position with NA.

In the example below, see how we specify the column named “Date”, which exists in both USGS_1 and USGS_2. Then we have all.y = TRUE, which tells R that for any non-matching date rows, we want to add the NA value to that row. For more information on merge, see ??merge

USGS_compare <- merge(USGS_1, USGS_2, by="Date", all.y = TRUE)

Now you can see that we have one dataframe that combined USGS_1 and USGS_2 data based on date.

Now we need to learn how to handle data in the reverse situation – what if we have NA values that are affecting data calcuations? We know that USGS_1 flow has NA values, because we just put them there in the line above. What happens if we try to calculate mean for this column?

mean1 <- mean(USGS_compare$Flow.x)
mean1
## [1] NA

Mean can’t be NA though – we know we have values in that column. In these situations, there are two things that can be done depending on what you need from your data.

Suppose you don’t need the data to remain in the structure that it is currently in. We can perform an action to remove all rows that may contain NA values from any of the columns (this isn’t a good idea for streamgage data– you’ll see why in a minute):

USGS_noNA <- na.omit(USGS_compare)
nrow(USGS_noNA)
## [1] 5795

What happened here?? We removed entire rows of data for the period in which USGS_1 was missing data. Now both USGS_1 and USGS_2 are missing data! You just removed over 1400 days of data for these gages – think that might have made a difference?

mean(USGS_compare$Flow.y)
## [1] 3894.252
mean(USGS_noNA$Flow.y)
## [1] 3918.437

With a larger dataset such as this, the difference may not look appreciable, but when working with smaller sets, this could cause detrimental results.

## For Loops, If Structures, and Logicals For loops can be extremely helpful when you need to perform one action repeatedly for an entire group of data. Let’s look at a simple example where we’ll again use the state.x77 data. Let’s use population and per capita income to calculate income by state.

To set up a for loop, you must specify a counter, how long you want the loop to repeat, and what you want to happen. Simple structure: for (counter in start:end){}, where your actions are performed within {}. Let’s use a for loop to calculate state income.

i <- 1
states <- data.frame(state.x77)
colnames(states) <- c('Pop.', 'Income', 'Illit', 'Lif.Exp', 'Murder', 'Grad', 'Frost', 'Area')
rownames(states) <- 1:nrow(states)

for (i in 1:nrow(states)){
  states$state.income[i] <- states$Pop.[i] * states$Income[i]
  i <- i + 1
}
states[1:3,] #look at the first three rows of data to ensure operation worked

##   Pop. Income Illit Lif.Exp Murder Grad Frost   Area state.income
## 1 3615   3624   2.1   69.05   15.1 41.3    20  50708     13100760
## 2  365   6315   1.5   69.31   11.3 66.7   152 566432      2304975
## 3 2212   4530   1.8   70.55    7.8 58.1    15 113417     10020360

If statements basically have the same structure. Let’s add another column to the states dataframe, stating whether a state is wealthy or not based on per capita income. Note, if statements don’t HAVE to be nested in a for loop – this was just a quick example displaying some of the ways you can nest these structures.

cutoff <- round(mean(states$Income), digits=0) #round mean to the nearest whole #
j <- 1 
for (j in 1:nrow(states)){
  
if (states$Income[j] < cutoff){
  states$wealthpercap[j] <- "Poor"
  } else
    states$wealthpercap[j] <- "Rich"
  
  j <- j + 1
}

states[1:3,]

##   Pop. Income Illit Lif.Exp Murder Grad Frost   Area state.income
## 1 3615   3624   2.1   69.05   15.1 41.3    20  50708     13100760
## 2  365   6315   1.5   69.31   11.3 66.7   152 566432      2304975
## 3 2212   4530   1.8   70.55    7.8 58.1    15 113417     10020360
##   wealthpercap
## 1         Poor
## 2         Rich
## 3         Rich

Now we’re going to look into some of the ways we can use logicals (TRUE and FALSE). We’re going to use test as an example dataframe in which we have multiple types of data.

test <- data.frame(c(1, 2), c(TRUE, FALSE), c(0,FALSE), c(3,4))
test
##   c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## 1       1           TRUE           0       3
## 2       2          FALSE           0       4

# Are any of the test entries true? 
test==TRUE
##      c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,]    TRUE           TRUE       FALSE   FALSE
## [2,]   FALSE          FALSE       FALSE   FALSE

# Are any of the test entries false?
test==FALSE
##      c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,]   FALSE          FALSE        TRUE   FALSE
## [2,]   FALSE           TRUE        TRUE   FALSE

# Which entries are 3 -- returns TRUE for entries = 3
test==3
##      c.1..2. c.TRUE..FALSE. c.0..FALSE. c.3..4.
## [1,]   FALSE          FALSE       FALSE    TRUE
## [2,]   FALSE          FALSE       FALSE   FALSE

You should take special note that 1 and TRUE are synonymous, as are 0 and FALSE.This means that if you are comparing data with 1’s and 0’s, you need to be aware of how R may evaluate these values. Let’s look at the example below:

test <- data.frame(c(1, 2), c(TRUE, FALSE), c(0,NA), c(3,4))

test[]==TRUE #this will return true even though there is the number 1 in position [1,1]
##      c.1..2. c.TRUE..FALSE. c.0..NA. c.3..4.
## [1,]    TRUE           TRUE    FALSE   FALSE
## [2,]   FALSE          FALSE       NA   FALSE

Functions

The last thing this tutorial will briefly cover is functions. Functions can be useful when you need to repeat the same calculation multiple times, or need to store them for later access. Function formats are as follows:
functionname <- function(inputs){
operations
return(outputs)
}
Let’s create a simple function called product that will require two inputs to be multiplied together, and will return the product of the inputs.

product <- function(x,y){
  result <- x * y
  return(result)
}

a <- 7
b <- 4

product(a,b)

## [1] 28

Data Modification and Analysis

HARP

August 16, 2018

Making Modifications to a dataframe

Creating Subsets of Data

Dealing With Missing Data and Using Merge

Functions