PDM Week 4 Supplement

Author

Collin Paschall

Is this useful? Yes, we promise.

Here’s the good news: we are really close to being able to do useful stuff with R.

Here’s the bad news: we have another week of what seem like pretty esoteric programming concepts.

The seven lectures this week cover two distinct topics: operators and loops. These tools allow you perform tests on data and then repeatedly do the same process over and over again, automatically and subject to certain conditions.

You might be asking, why would we ever want to do these things? What relevance does this have to data analysis? The answer is that these processes form the the foundations of data wrangling. Once you can create data within a program (as we have in previous lessons) and you can perform basic mathematical operations, you are ready to start reshaping/reorganizing data and getting useful insights from it.

Operators

Consider the following. Let’s say you work in the advancement/fundraising division of a university. You want to identify which alumni are the most likely to give the most money.

You’ve got some information about your alumni, including their income levels, their GPAs, and whether or not they participated in some kind of intervention during their undergraduate degree designed to enhance their academic performance.

In the code block that I have hidden here, I create some fake data along these lines. You don’t need to worry about the specifics of how I generated these data, but take a look if you want.

Show the code

 # some packages you need
library(tidyverse)
library(here)

st_data <- tibble(
  student_id = seq(1:1000),
  sex = sample(c(0,1),1000,replace=TRUE),
  income = rnorm(1000,100000,25000),
  grit = rnorm(1000,50,10)
  )
  
st_data$income <- st_data$income - min(st_data$income) + 15000
st_data$intervention <- rep(0,1000)
st_data$intervention[sample(st_data$student_id,250,replace=FALSE)] <- 1

st_data$GPA <- 2+0.000008*st_data$income + 0.008*st_data$grit + .5*st_data$intervention +  rnorm(4,0,1)

st_data$giving <- 2+0.0008*st_data$income + 0.008*st_data$grit + 100*st_data$intervention + 100*st_data$sex + rnorm(100,0,50)

The important thing is that we have some data that look like this:

st_data

# A tibble: 1,000 × 7
   student_id   sex  income  grit intervention   GPA giving
        <int> <dbl>   <dbl> <dbl>        <dbl> <dbl>  <dbl>
 1          1     1 127583.  54.5            0  2.55 243.  
 2          2     0 108146.  64.2            1  4.50 185.  
 3          3     0 118898.  48.9            0  3.79  -1.56
 4          4     0  30227.  65.9            0  3.32  66.3 
 5          5     1 106249.  46.5            0  2.31 197.  
 6          6     1  90223.  65.1            0  3.86 200.  
 7          7     1  76074.  62.6            0  3.55 194.  
 8          8     0  71846.  59.3            1  4.10 152.  
 9          9     0 108074.  38.7            0  2.27 144.  
10         10     0 105747.  38.0            0  3.77  20.5 
# … with 990 more rows

Now, let’s imagine that we were given these data and we were asked by our advancement director, “I want to know how much money we are getting from alumni who graduated with more than a 3.0, are middle income (50k to 100k) and participated or did not participate in this special intervention program.

To subset the data, just use a series of logical and relational operators.

summary(st_data[st_data$GPA>3 & st_data$income>50000 & st_data$income <100000 & st_data$intervention == 1,]$giving)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  70.09  165.17  202.63  213.27  270.42  364.19

summary(st_data[st_data$GPA>3 & st_data$income>50000 & st_data$income <100000 & st_data$intervention == 0,]$giving)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -84.61   57.89  108.76  108.34  159.49  258.16

What do you infer from these contrasting summaries? Holy smokes, that intervention program has a massive effect on annual giving! Let’s target these people, and let’s get this intervention rolled out university wide, ASAP!

An example of loops in practice

There are fewer cases in R where you are going to use for loops, for the purpose of data wrangling, because R has some internal programming tricks called vectorization that make for loops pretty inefficient for many data wrangling tasks.

But the concept of a for loop is still a good thing to learn about. Basically, for loops let you do very repetitive tasks much more quickly.

I’m going to provide a toy example here that is a little contrived but is based on something I actually encountered once.

Because of a very goofy data processing decision by a government agency, I had a situation where I had something like 10,000 large spreadsheets (like, hundreds of thousands of rows), and I needed to determine whether the spreadsheet included a certain kind of data. If the answer to that question was “yes”, I needed to extract the mean of a single column of that spreadsheet, and I wanted to put those means together into a vector.

Because I was dealing with 10,000 individual spreadsheets that were all very large, I couldn’t do this by hand, and I didn’t know how to load these kind of data into a single database that would allow me to manage that much data. I knew there were ways to do that, but my database skills were poor and I would have had to learn a whole new technology.

The solution I adopted was to write a for loop. Perhaps not the most elegant choice, but it worked.

I wrote a loop that iterated over all of the spreadsheet files in a directory on my computer, each time testing for whether it was a spreadsheet for which I wanted to extract the mean for one column. I then put all those means in a vector so I could use it in further calculations/reporting.

So, to illustrate, first I am going to use a for loop to create 100 .csv files, some of which will be spreadsheets that I want to collect data from, based on the values in the test_col. If the sum of that column is greater than 1000, I want to extract the mean of the data_col so I can assemble that vector for future analysis.

Note: this code is going to create a temporary directory within your working directory with 100 .csv files in it. You can delete them when you are done.

if(!("temp" %in% list.files())){dir.create("temp")} # create a subdirectory of the working directory, if it doesn't exist

for (i in seq(1:100)) { #do this 100 times
  
   temp_dat <- tibble( # create a tibble
    test_col = rnorm(1000,1000,100000), # test_col is a bunch of numbers, don't worry about the specifics
    data_col = rnorm(1000,sample(seq(50,125,by=5),1),20) # data_col is a bunch of numbers, don't worry about the specifics
  )
   
   write.csv(temp_dat,paste0(here("temp"),"/","dat_",i,".csv")) # write the .csv file into that temporary directory - don't worry about the specifics of the file location
  
}

So now we have the fake data, so let’s write a loop that will run that test, and if the data in that spreadsheet met the test, append the mean for the data_col to the empty vector my_means.

my_means <- vector()

for(i in 1:length(list.files("temp"))){
  
    temp_dat <- read.csv(here("temp",paste0("dat_",i,".csv")))
    
    if(mean(temp_dat$test_col)>1000){
    
      my_means <- append(my_means,mean(temp_dat$data_col))}
  
}

my_means

 [1] 109.42098  70.75709 100.47882  84.33133  84.67477  65.90836  60.47019
 [8]  70.18916  51.32862 114.10435  91.08851  64.42939 119.57539  69.43624
[15]  84.42273  95.02389  74.51856 109.85704  80.06401  54.48072  80.29540
[22]  55.14962  55.65846 120.51733 120.50080  74.16032 102.20180 105.40018
[29]  69.79062  80.69917 115.29493  70.48833 116.16896 100.36560 115.54644
[36]  90.11283  54.38194 115.53293 100.89882  88.93997 119.26139 124.84124
[43]  90.17188 120.19766  75.43315  68.92971  89.46829  63.83561  64.76760
[50]  70.12460 104.32984 106.07833

Voila! Rather than me having to open up hundreds of .csv files and record the mean for each file, I used looping and little conditional logical to have the computer do the work for me.

This is the kind of thing you have to do in data wrangling tasks. With a little programming skill you are on your way.