Here’s the good news: we are really close to being able to do useful stuff with R.
Here’s the bad news: we have another week of what seem like pretty esoteric programming concepts.
The seven lectures this week cover two distinct topics: operators and loops. These tools allow you perform tests on data and then repeatedly do the same process over and over again, automatically and subject to certain conditions.
You might be asking, why would we ever want to do these things? What relevance does this have to data analysis? The answer is that these processes form the the foundations of data wrangling. Once you can create data within a program (as we have in previous lessons) and you can perform basic mathematical operations, you are ready to start reshaping/reorganizing data and getting useful insights from it.
Operators
Consider the following. Let’s say you work in the advancement/fundraising division of a university. You want to identify which alumni are the most likely to give the most money.
You’ve got some information about your alumni, including their income levels, their GPAs, and whether or not they participated in some kind of intervention during their undergraduate degree designed to enhance their academic performance.
In the code block that I have hidden here, I create some fake data along these lines. You don’t need to worry about the specifics of how I generated these data, but take a look if you want.
Now, let’s imagine that we were given these data and we were asked by our advancement director, “I want to know how much money we are getting from alumni who graduated with more than a 3.0, are middle income (50k to 100k) and participated or did not participate in this special intervention program.
To subset the data, just use a series of logical and relational operators.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-84.61 57.89 108.76 108.34 159.49 258.16
What do you infer from these contrasting summaries? Holy smokes, that intervention program has a massive effect on annual giving! Let’s target these people, and let’s get this intervention rolled out university wide, ASAP!
An example of loops in practice
There are fewer cases in R where you are going to use for loops, for the purpose of data wrangling, because R has some internal programming tricks called vectorization that make for loops pretty inefficient for many data wrangling tasks.
But the concept of a for loop is still a good thing to learn about. Basically, for loops let you do very repetitive tasks much more quickly.
I’m going to provide a toy example here that is a little contrived but is based on something I actually encountered once.
Because of a very goofy data processing decision by a government agency, I had a situation where I had something like 10,000 large spreadsheets (like, hundreds of thousands of rows), and I needed to determine whether the spreadsheet included a certain kind of data. If the answer to that question was “yes”, I needed to extract the mean of a single column of that spreadsheet, and I wanted to put those means together into a vector.
Because I was dealing with 10,000 individual spreadsheets that were all very large, I couldn’t do this by hand, and I didn’t know how to load these kind of data into a single database that would allow me to manage that much data. I knew there were ways to do that, but my database skills were poor and I would have had to learn a whole new technology.
The solution I adopted was to write a for loop. Perhaps not the most elegant choice, but it worked.
I wrote a loop that iterated over all of the spreadsheet files in a directory on my computer, each time testing for whether it was a spreadsheet for which I wanted to extract the mean for one column. I then put all those means in a vector so I could use it in further calculations/reporting.
So, to illustrate, first I am going to use a for loop to create 100 .csv files, some of which will be spreadsheets that I want to collect data from, based on the values in the test_col. If the sum of that column is greater than 1000, I want to extract the mean of the data_col so I can assemble that vector for future analysis.
Note: this code is going to create a temporary directory within your working directory with 100 .csv files in it. You can delete them when you are done.
if(!("temp"%in%list.files())){dir.create("temp")} # create a subdirectory of the working directory, if it doesn't existfor (i inseq(1:100)) { #do this 100 times temp_dat <-tibble( # create a tibbletest_col =rnorm(1000,1000,100000), # test_col is a bunch of numbers, don't worry about the specificsdata_col =rnorm(1000,sample(seq(50,125,by=5),1),20) # data_col is a bunch of numbers, don't worry about the specifics )write.csv(temp_dat,paste0(here("temp"),"/","dat_",i,".csv")) # write the .csv file into that temporary directory - don't worry about the specifics of the file location}
So now we have the fake data, so let’s write a loop that will run that test, and if the data in that spreadsheet met the test, append the mean for the data_col to the empty vector my_means.
Voila! Rather than me having to open up hundreds of .csv files and record the mean for each file, I used looping and little conditional logical to have the computer do the work for me.
This is the kind of thing you have to do in data wrangling tasks. With a little programming skill you are on your way.