Lab 0: Introduction to Statistical Programming

The goal of these notes is to introduce you to some foundational concepts in computing and statistical programming. It may seem like a lot at first but it will become clearer with time. As you practice working in R, you’ll learn many techniques and ways of thinking abstractly and quantitatively about problems. These notes will familiarize you with some “meta” details about R and programming, which will help you focus more on the substantive problems we’ll discuss in the class and which you find interesting.

These notes borrow generously and without specific attribution from many sources, including Norman Matloff’s The Art of R Programming and Kieran Healy’s Data Visualization: A Practical Introduction. Healy’s book is a very nice introduction to R if you’re starting with no background in R or programming, while Matloff’s is aimed at users who have experience with other (particularly C-based) programming languages and want to learn R. Healy’s book is also great if you want to learn more about visualizing data (we’ll discuss some of the key messages from that book throughout the course, so you don’t need to get it.)

This lab might be most helpful as a reference to read over/skim at the start of your journey so that you learn the layout, and then return to periodically or when you have a question. The most immediately-useful pieces are probably the notes about installation details and file paths, and the general principles.

The right frame of mind

Learning programming languages can be a disorienting experience. There’s a lot of jargon, with terms like “object”, “class”, or “function” which have different meanings in other contexts. The syntactic rules are picky; the error messages can somehow manage to tell you too much and too little; help pages aren’t as clear as they could be; other people on the internet have had similar but not quite the same problems. When things work, it’s not always obvious that they’ve worked as you intended. If you want to learn to do one thing, there’s a sense that you need to learn a few other things about how the language works. It’s not always clear what those things are or how to go about learning them most efficiently.

Don’t panic. Everyone starts somewhere, and those feelings of frustration are more common than not. Even veteran programmers feel those things when learning a new language. Our goal in this class is to help you get started by focusing on substantive applications and building up your ability to visualize your results alongside your ability to derive results. Visualization can be particularly rewarding as you quickly see the results of your efforts.

As you gain experience in R, you’ll find that many of the tools or conventions which seemed strange start making sense. You’ll see how tools which seemed obscure will help you achieve your goals, making new tools and techniques easier to learn. The best way to go about this is simply to get started and not worry too much about the details under the hood. The understanding will come with time and practice. Initially acquiring tools piecemeal and without understanding what’s going on may feel strange, but have faith—you will gain understanding and confidence along the way. If you keep this in mind—“be patient, don’t sweat the details immediately, with time and practice you will gain fluency”—it will be easier to approach statistics with a sense of joy and play than a sense of fear and self-doubt.

Many of the applications in this class are focused on issues of inequality. Statistics can be especially powerful in revealing details of inequality and suggesting areas for further qualitative and quantitative inquiry. Programming is a useful way to complement human intuition and insight with computer speed and precision. Our goal is that you take from this class an understanding of how to use statistics and visualization to better understand the world around us, the computational skills to carry out your inquiries, and the ability to communicate your findings to a broad audience.

At its core, statistics is a set of tools for making sense of data. Neither the data nor the tools exist in a vacuum. No amount of data or computational skill will substitute for a deep understanding of the underlying subject matter and real-world issues, and no tool on its own will deliver a universal answer. But data analysis and visualization can be a powerful complement to subject matter understanding. We hope that our explorations together in this class encourage you to engage with subject matter experts and their writings to develop a multifaceted understanding of inequality and other issues in the world.

What is statistical programming?

“Statistical” programming is a lot like “regular” programming in that they both involve typing out commands, feeding them to a computer, and having the computer implement some actions based on the commands. Where they differ is in the focus. Whereas “regular” programming is focused on getting computers to do a very broad range of tasks (manage a database, stream web content, drive a car), “statistical” programming is focused on statistics.

This has a few implications for what “best practices” are in statistical vs. regular programming. For example, statistical programming often involves working with data. While modifying the raw data may be a reasonable thing to do in some regular programming tasks, it is almost never a good idea to modify the raw data in a statistical programming setting. Similarly, plotting data may not make much sense in a regular programming workflow (when would a database management routine need to generate pictures?), but building plots into your statistical analysis code at different points is extremely helpful.

Conventions

We’ll mix text and code in the lab notes for this class. Plain text for humans will look like this. References to objects in code will be in gray boxes like this. Code chunks will be in separate boxes like below:

object_name <- some_code("cool things") # pseudocode

The code chunks will sometimes produce results, which will be in white boxes below the gray code box with “##” on each line (scroll down to “Data classes” for an example). The code chunks in this lab will not always be meant to run; the chunk above is an example of code that won’t actually run. Such code is just meant to illustrate some broader concept, like naming conventions, and will have gray comment text “# pseudocode” in the code chunk. As you become familiar with R you’ll learn to identify chunks of pseudocode from the structure and syntax.

Installing R and RStudio

R is a popular open-source programming language. “Open-source” means that anyone can look under the hood of the language, make modifications, and contribute code to the project. R was originally designed for academic statistical analysis, and while it’s evolved beyond that, it’s still a creature of its roots.

To get going in this class, we’ll need to install two pieces of software. The first is the R language. The second is RStudio, a popular IDE (“Integrated Development Environment”) which makes it much easier to work in R. While R and RStudio are clearly related (you need R to run RStudio), they are different things maintained and developed by different groups of people. Pretty much all the coding we do in this class will happen in RStudio, using R.

To install R go to “https://www.r-project.org” and click the “download R” link. R is hosted on a website called CRAN, which is distributed (“mirrored”) across computers all over the world. To download R, you’ll need to choose a mirror to download it from. (It doesn’t really matter which mirror you choose, but a good rule of thumb is to choose a mirror that is geographically close to you.) Once you’ve chosen a mirror, you’ll be guided to a page with links to download R for whatever your operating system is. Download the appropriate file and go through the installation process. The video below will show you how this process looks on a Mac.

To download RStudio, go to “https://rstudio.com/products/rstudio/download/#download”. You’ll again be able to choose a download link based on your operating system, but won’t need to choose a mirror. Download the appropriate file and go through the installation process. The video below will show you how this process looks on a Mac.

Versions and compatibility

There are many versions of R. Which version should you use? In general, “the latest one”. As I write this the latest version is 4.0.2, “Taking Off Again”.

In addition to machine-friendly numbers, R versions also have user-friendly names. (My personal laptop currently uses 3.6.3, “Holding the Windsock”.)

Which version of RStudio should you use? Again, in general, “the latest”. As I write this the latest version is 1.3.1073. Unfortunately RStudio versions don’t have fun names.

Using an older version isn’t necessarily a problem. As long as important features/packages aren’t missing/broken it should be fine. But sometimes an update will change something substantial and packages built for the new version won’t work with the old ones anymore (this happened between versions 3.5.x and 3.6.x). If that happens, no problem—just get the latest version and everything should be fine (you’ll probably want to uninstall the older version to make sure there’s no confusion).

Updating can take some time, so you shouldn’t do it every time you start up R or RStudio. I keep the older version on my laptop because updating would require me to reinstall all my packages and new packages I install are still compatible with 3.6.3. If you’re just getting started or haven’t used R in a while, it’s a good idea to get the latest versions.

“Base” R and “tidyverse”

Most of the data manipulation in this class will be done using functions from the Tidyverse. These are quite convenient and often speed up routine tasks. These are contrasted with “base” R, which refers to functions present in a fresh install of R. In general, tidyverse functions are nicer to use, though not every base R function has a tidyverse equivalent (like class() or levels()).

File paths

File paths are addresses for computers. When you tell R or any other program to open a file, the program needs to know where on the computer the file is located. For example, many of my files are located in my Dropbox, which is at /home/akhil/Dropbox/. Say I created a folder R_stuff_for_ECON_210 in my Dropbox and used it to store a data file named 210_practice_data.csv. If I wanted to open the file, I would need to navigate to my Dropbox folder, then open R_stuff_for_ECON_210, then open the file. If I wanted to access 210_practice_data.csv in R, I would need to tell R to look for 210_practice_data.csv at the path /home/akhil/Dropbox/R_stuff_for_ECON_210. Below is an example of some code telling R to read that data (using the read_csv() function from the tidyverse package) and store it in the object data.

data <- read_csv("/home/akhil/Dropbox/R_stuff_for_ECON_210/210_practice_data.csv")

Don’t worry if this is a little obscure right now. As we go through the semester, you’ll see lots of examples. For now, just remember: a file path is an address on a computer, to access a file or store some output you need to specify the address.

Working directories, relative path naming

The “working directory” is the file path that R or RStudio is currently operating in. Even if you haven’t set the file path explicitly there’s still a default working directory. You can check the current working directory by running getwd() (the “wd” stands for “working directory”). You can change the working directory by running setwd("file/path/you/want/to/use").

When you run setwd(), R will try to find the file path you gave it from the directory it’s currently in. Going back to the example in the previous section, suppose the current working directory is home/akhil and I want to change the working directory to R_stuff_for_ECON_210, which is inside Dropbox. I could run

setwd("Dropbox/R_stuff_for_ECON_210")

Even though the full file path is /home/akhil/Dropbox/R_stuff_for_ECON_210, since I’m already in /home/akhil I can just tell R to go directly to Dropbox/R_stuff_for_ECON_210. This is called a “relative” file path, since it’s only a valid path “relative” to my current location.

I could also have run

setwd("/home/akhil/Dropbox/R_stuff_for_ECON_210")

which would take me to the right place even if I was currently in some other directory, like /home/akhil/Documents. If my working directory was /home/akhil/Documents and I’d run setwd("Dropbox/R_stuff_for_ECON_210"), R would give me an error telling me it cannot change working directory. I would get the same error if I’d asked R to change working directory to a folder that doesn’t exist.

When I want to go “up” one directory using a relative file path, I can use the .. command. For example, if I want to get from R_stuff_for_ECON_210 to the folder Research_projects in my Dropbox, I could write

setwd("../Research_projects")

This tells R “go one directory up and then go to Research_projects.” If I wanted to go two directories up from Research_projects and go to my Photos folder under Documents, I would run

setwd("../../Documents/Photos")

The first .. tells R to go up from Research_projects to Dropbox. The second .. tells R to go up again from Dropbox to akhil. From there I’ve told it to go to Documents and then to Photos.

So why not always write out the full file path? While it’s certainly safest to, writing out the full file path limits the code’s transportability across computers. While my computer has paths that begin with /home/akhil, yours (probably) doesn’t. If I write code with the full path, it’s guaranteed to work on my computer, but it would almost surely not work on your computer without edits

File path exercises

Before starting RStudio, create a folder for the class named Econ_210. Now start RStudio and run getwd() to get your current working directory. Use the full file path of your Econ_210 folder and setwd() to change your working directory to Econ_210. Going forward in the class we’ll use this folder as the main place to store course materials.
Now change the working directory back from Econ_210 to the original working directory using a relative file path.

Names, objects, and functions

Everything you interact with in R has a name. You refer to things by name when you want to view or modify them. Examples of named entities include variables, data, and functions.

Names in R are case sensitive: the_data is not the same as the_Data. Some names are reserved for the system and forbidden to use. These reserved words include TRUE, FALSE, NA, NaN (which stands for “Not a Number”), Inf (which stands for “Infinity”) and so on. Some names are not explicitly forbidden, but you should almost never use them to name things you create. These are mostly names which R already uses for functions and objects, including basic functions like c() and t(), statistical functions like mean() and var(), and built-in constants like pi.

As you become more familiar with R, you’ll learn which names are forbidden or reserved. A useful strategy when you’re considering using a name which is short and possibly generic (like mean) is to try typing it into the console. If the name is already in use, you’ll see some output describing what it’s used for. If it’s unused, you’ll get an error telling you the object doesn’t exist.

Everything you interact with in R is an object. Data, variables, functions—they’re all objects. Some are built-in, some come from packages you install and load, and some are created by you. This can lead to a more-advanced discussion about object-oriented programming in R, but we won’t go there. Instead, we’ll leave it at this: because everything is an object, and everything has a name, you can access and modify anything you can interact with in R. The assignment operator, <- (the less-than symbol followed by the minus sign), allows you to create objects by assigning them to names. For example, if I want to create a set of numbers and assign it to the name my_numbers, I could use the c() function and run

my_numbers <- c(1,5,9,11,13) # assign the object
my_numbers # print the numbers to the console

## [1]  1  5  9 11 13

my_Numbers # this won't work

## Error in eval(expr, envir, enclos): object 'my_Numbers' not found

You do almost everything in R using functions. A function is an object that performs an action, usually on another object. For example, suppose you wanted to calculate the mean of the numbers we just created:

mean(my_numbers)

## [1] 7.8

If the object we give to mean doesn’t exist, or we don’t give it an object at all, we’ll get an error:

mean(my_Numbers)

## Error in mean(my_Numbers): object 'my_Numbers' not found

mean()

## Error in mean.default(): argument "x" is missing, with no default

If we just type in the name of the function with no parentheses, we’ll see details about the function:

mean

## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x7fefaac42098>
## <environment: namespace:base>

When a function accepts multiple arguments (objects), you can either give the function its arguments with names telling it what’s what (the safer option) or without named arguments and it’ll assume you’ve given it just what it needs and in the right order (a little risky unless you know what you’re doing). Below is an example using the seq function, which creates a sequence of numbers:

seq(from=0,to=5,by=1)

## [1] 0 1 2 3 4 5

seq(0,5,1)

## [1] 0 1 2 3 4 5

Sometimes a function can have arguments which can’t be used together. For example, seq can be used to create a sequence in desired increments using the by argument, or a sequence of a specified length (number of elements) using the length.out argument. By default, seq will assume the third argument supplied is by. When you use seq you should specify either by or length.out, but not both:

seq(from=0,to=5,by=1)

## [1] 0 1 2 3 4 5

seq(from=0,to=5,length.out=6)

## [1] 0 1 2 3 4 5

seq(from=0,to=5,length.out=6,by=1)

## Error in seq.default(from = 0, to = 5, length.out = 6, by = 1): too many arguments

Some functions, like getwd(), produce results without being given any arguments. Use of a function is a function call, e.g. “Call the setwd() function”, or simply “Call setwd()”.

Objects and functions exercises

Create a sequence from 10 to 100 in increments of 2 using seq and assign it the name sequence_1. Calculate the mean using mean.
Use the summary function on the sequence you created to calculate some summary statistics.
Now create a new sequence from 10 to 100 of length 20 using seq’s length.out argument.
Calculate the mean of the new sequence using either mean or summary. Is it the same as the mean of the first sequence?

Data classes and data structures

Statistics is about working with data so we’ll spend some time thinking through how to work with different kinds of data. The textbook covers different data types in some detail. Section 1.2.2 in particular works through some examples—the language there is “types of variables” rather than “types of data”, but the ideas are the same. But to use R (or any other statistical programming language) comfortably, it’s worth distinguishing between the types of data and the types of structures which contain data. Section 1.2.2 describes the different types of data well, so we’ll focus on different types of data structures here.

All the material below is in base R. We’ll be working with the tidyverse in this class so you won’t need to think about data classes and structures very often; the tidyverse was written to make life much easier in these respects. But it’s still useful to have a quick reference for base R things like classes and structures in case you run into errors or want to know a little more about what’s happening under the hood.

Data classes

Data can have different classes, and can be stored in different kinds of structures. The class describes how the object is understood by R. For example, variables with numeric class can be added and subtracted, but variables with character class can’t be added or subtracted. But variables with character class can be letters or mixtures of letters and numbers, while numeric class variables must be numbers. There are three kinds of class worth focusing on right now: numeric, character, and factor.

Numeric and character are probably the most intuitive classes. Numeric is used for data which are numbers, and character for data which may be letters, numbers, punctuation, and so forth. Collections of characters are called strings, e.g. "a" is a character and "a house for my friend" is a string. ("a" is also a one-character string.)

A factor is a variable which has many discrete categories. These categories can look like numerics or character strings. Factors are used to represent categorical data. Every factor has levels, which are the underlying categories, and labels, which are the names attached to those categories. The code below shows examples of all three classes, along with the class function.

class(2500) # this is numeric

## [1] "numeric"

class("2500") # with the quotes around it, 2500 becomes a character

## [1] "character"

2500+"2500" # numerics can't be added to characters!

## Error in 2500 + "2500": non-numeric argument to binary operator

test <- as.factor(c(2500,"a","a","a",2500)) # this creates a factor with 5 entries and 2 levels.

class(test)

## [1] "factor"

summary(test) # the "summary" function is very useful

## 2500    a 
##    2    3

Unfortunately, R doesn’t (yet) read minds—if you’re running into errors about types being mismatched or arguments being non-numeric, make sure the data classes are appropriate for whatever you’re trying to do.

Data structures

All data in R (and any other programming language) are stored in structures. Here, we’ll look at about four types of data structures in R in increasing order of generality: vector, matrix, data frame, list.

Vectors

The most basic structure in R is the vector. A vector is a collection of elements with the same class. Even a single number in R (a “scalar” in math lingo) is actually a vector with a single element. All data structures in R are ultimately composed of vectors. You can have a vector with three character elements or three numeric elements, but not a vector with two numbers and one character. If you assign mixed classes to a single vector, they’ll all get coerced to characters.

first_vector <- c(1,2,3) # three numbers, no problem
class(first_vector)

## [1] "numeric"

second_vector <- c("a","b","c") # three characters, no problem
class(second_vector)

## [1] "character"

third_vector <- c(1,2,"c") #two numbers and a character? R: "you must have meant three characters"
class(third_vector)

## [1] "character"

Matrices

A matrix is like the mathematical object of the same name, a rectangular array of numbers. Matrices can be composed from vectors using the rbind (“row bind”) and cbind (“column bind”) functions. We won’t be working with matrices much in this class.

first_matrix <- rbind(c(1,2,3),c(4,5,6)) # a matrix with two rows and three columns
class(first_matrix)

## [1] "matrix" "array"

first_matrix

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

second_matrix <- cbind(c(1,2,3),c(4,5,6)) # a matrix with three rows and two columns
class(second_matrix)

## [1] "matrix" "array"

second_matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Data frames

A data frame is a matrix that can have different data classes in the columns. A typical dataset is probably best represented in the data frame format since it will have multiple classes. For example, a dataset of student grades will have columns for student names (character), test scores (numeric), and letter grades (character or factor).

grades_dfrm <- data.frame(student_names=c("student A","student B","student C"), test_scores=c(89,97,99), letter_grades=c("A-","A","A")) # a matrix with two rows and three columns
class(grades_dfrm)

## [1] "data.frame"

grades_dfrm

##   student_names test_scores letter_grades
## 1     student A          89            A-
## 2     student B          97             A
## 3     student C          99             A

Most of the datasets we work with will be data frames. In the tidyverse, data frames have been generalized a bit to a structure called a tibble. Tibbles are just data frames with some quality-of-life improvements for users, like more refined printing behavior.

Lists

A list is arguably the most general type of data structure in R. A list is a shapeless container which can hold different data types of data structures. You can, for example, create a list which holds a vector, a matrix, and a data frame. They’re meta, too—you can make a list which holds other lists.

list_of_things <- list(the_vector=first_vector, the_matrix=first_matrix, the_dfrm=grades_dfrm)
list_of_things

## $the_vector
## [1] 1 2 3
## 
## $the_matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## 
## $the_dfrm
##   student_names test_scores letter_grades
## 1     student A          89            A-
## 2     student B          97             A
## 3     student C          99             A

very_meta_list <- list(first_element=list_of_things, second_element=list_of_things)
very_meta_list

## $first_element
## $first_element$the_vector
## [1] 1 2 3
## 
## $first_element$the_matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## 
## $first_element$the_dfrm
##   student_names test_scores letter_grades
## 1     student A          89            A-
## 2     student B          97             A
## 3     student C          99             A
## 
## 
## $second_element
## $second_element$the_vector
## [1] 1 2 3
## 
## $second_element$the_matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## 
## $second_element$the_dfrm
##   student_names test_scores letter_grades
## 1     student A          89            A-
## 2     student B          97             A
## 3     student C          99             A

You can even store models and function output in lists, e.g. a list of regression models or histogram output. Lists are especially useful when working with multiple outputs of different types or with elaborate internal structures. We might never use lists explicitly in this class, but they’re still worth knowing about because they’re under the hood of a lot of things we will be working with.

General principles of statistical programming

Below are some general principles of statistical programming. These aren’t necessarily rigid commandments—there are cases where you may find them in tension, or otherwise need to break them—but they are extremely good rules of thumb. This list is adapted from Jonathan Eyer’s “10 commandments for policy/economics coding” and the USGS water data blog, and reflects hard-won lessons learned from many years of statistical programming (hours and hours of correcting mistakes, staring at the screen in puzzlement and frustration, having to erase weeks of work and start over, and more horror stories we hope to spare you).

Comment your code

“Comments” are bits of text in your code which the computer ignores, but help humans track what’s going on. They’re usually expressed by using some kind of special character which tells the computer to ignore some text. In R, you can comment out a line (or start commenting on a line) by using the # character. Any text that follows a # (including more #) will be ignored until the next line.

It’s hard to overstate how helpful a good comment can be in explaining an obscure bit of code or explaining the flow of a script. Below is an example of some code from a data cleaning script where comments are used to track the flow of operations, explain the logic behind some choices, and to flag an issue that needs attention the next time the analyst works on the code.

### Creates column to calculate if DECAY has occurred already
soy_panel <- soy_panel %>% mutate(Expected_life_left = sign(Year - DECAY))

### Assign NA values of Expected_life_left with 1
soy_panel[is.na(soy_panel$Expected_life_left),"Expected_life_left"] <- -1

### Keeps all active satellites
soy_panel <- soy_panel[soy_panel$Expected_life_left != 1,] #Gets rid of all positives, keeping NA entries that we assigned with a -1

### HHI calculations: HHI for each entity in a shell-year is the sum of their squared share of objects
test <- soy_panel %>% group_by(shell_index, Year) %>%
           mutate(count_orbit = n()) %>%  # calculate number of objects in each shell-year for denominator
         ungroup(shell_index,Year) %>% group_by(Entity_Name, shell_index, Year) %>%
            mutate(squared_shares = (n()/count_orbit)^2) %>% # calculate squared shares for each entity in a shell-year
          ungroup(Entity_Name, shell_index, Year) %>% group_by(                                                                             shell_index, Year) %>% 
            summarise(HHI = sum(squared_shares)) %>% # calculate HHI
  ungroup(shell_index, Year)
  #GL - Tried using summarise instead of mutate, but just returns a df with only 892 obs and 3 variables

The comments make the block much more readable and user-friendly. When someone works on this code, whether it’s the person who wrote it coming back after some months or someone totally new looking to extend the code, they have some guidance from when the code was written. Always write informative comments in your code.

Use scripts and code chunks, not the console

Coding in the console quickly becomes unreproduceable. What was that command you ran at the start of the session? Did it need to go before the one you just ran, or can it be ignored? Relying on the console for the bulk of your analysis is error-prone and fraught with dangers.

The much better approach is to work with code chunks in an Rmarkdown file or R script. An Rmarkdown file is a file like these notes (with a .Rmd file extension), and an R script is a plain text file for R (with a .r or .R file extension). R scripts will work with or without Rstudio but won’t blend plain text, code chunks, and visualizations like Rmarkdown files; Rmarkdown files need RStudio to run.

If you work in an Rmarkdown file or R script, all the steps will be laid out in the correct order. If you want to reproduce the results up to a certain point, just run all the lines of code until that point. You can also easily send the Rmarkdown file or script to someone else for them to run.

It’s fine to use the console in Rstudio when you’re just trying out some code and want to make sure you have the right idea/syntax. Just be sure to commit everything that works to the Rmarkdown file or script you’re using and verify that it works as expected. To verify that it all works as expected, I usually clear my workspace and re-run the script.

Don’t modify raw data

When you modify raw data, you remove your ability to go back and repeat the same calculation or modify it. You might also make the script you just used useless since the data it operated on has been changed. Don’t modify raw data.

If you need to save a copy of the data with some new attributes (new columns, fewer rows, cleaned, etc), save a new copy. Storage space is relatively cheap these days—having a few extra bytes on your hard disk dedicated to holding a copy is much much better than overwriting a copy of the raw data.

Give your variables and files informative names

This is an offshoot of the intent behind “Comment your code”. Imagine you needed to calculate the unemployment rate. You go through a bunch of steps to clean and transform the data into the most useful form, and now you have to name the result:

Uninformative:

x <- function_to_calculate_unemployment_rate(cleaned_data)  # pseudocode

More informative:

unemployment_rate <- function_to_calculate_unemployment_rate(cleaned_data)  # pseudocode

The first option assigns the output of all your efforts to the uninformative object x. The second option assigns it to the more-informative object unemployment_rate. The latter is much easier to keep track of in subsequent calculations. It also acts as a bit of “self-commenting code”: when you look back at the code you’ll wonder, “What the heck was x supposed to be again?”—you’ll never ask that question of unemployment_rate. Be kind to your collaborators and future selves; use informative names.

Honor your naming conventions

This is more of a style point, again in the spirit of “Comment your code”. If you start your code by naming things using a certain convention, stick with it. Don’t change naming conventions mid-way through the code. Below is an example where I start with the “lowercase full words separated by underscores” convention and then move to what’s called “camelCase” convention with abbreviated words.

unemployment_rate <- function_to_calculate_unemployment_rate(cleaned_data) # pseudocode

some_lines_of_code

unempCount <- function_to_calculate_unemployment_count(cleaned_data)

“Ok, it’s a bit inconsistent, but what’s the problem?” The problems come up later on, when I want to check or extend my calculations. I might end up writing something like

unempCount/laborForce == unempRate # pseudocode

which should return TRUE if my calculations were consistent. But it won’t, it’ll just throw an error. Why? Because there’s no object called unempRate. As I was writing that line, I forgot about the older naming convention and created unnecessary confusion for myself. If I’m lucky, I’ll remember “aha! I changed convetions, it’s actually unemployment_rate” and use the correct name. More typically, I’ll spend the next 10-60 minutes scratching my head and trying increasingly desperate things to resolve the problem until I think “aha! …”

(This may sound silly, but it’s a real thing and gets worse as your code gets more complicated. Once I defined some functions as f(D,S). A couple years later after I’d forgotten what convention I used, while working on a new related project I used the convention f(S,D). To speed things up I copied over some old code that used the f(D,S) convention. The results produced were very unintuitive but also interesting. I spent about a week increasingly convinced I’d found some complicated new phenomenon before I found the bug.)

Don’t hard-code values

Suppose you need to calculate the number of people who are unemployed in your region You look up the current unemployment rate (let’s say it’s 5%) and write something like this in your script:

number_unemployed <- 0.05*labor_force # pseudocode

This works for now, but next quarter your manager asks you for the same number. You pull in the data, generate the labor_force variable, and re-run your script. Problem: the unemployment rate has changed, but your code still outputs 5% of the labor force. What happened? You hard-coded an unemployment rate of 5% into your script.

What you should have done instead was either had the unemployment rate be an input to the script, or had it be calculated from the data that was pulled into it. That way you can re-run the code on different data and still get useful results. The same can be said about hard-coding regions; if you’d set up your code so that the region name was hard-coded in, you wouldn’t be able to use your code to get results about other regions without modifying the script.

While hard-coding makes things easier now, you’ll pay a price for it later. Where possible, avoid hard-coding values.

Remember to plot your data

You should look at your data. Humans are visual creatures, and our brains are usually much better at processing pictures than text. Graphs and charts help you learn about the structure of the data. Good visualizations also make it easier to communicate your ideas or results to other people. Whenever you start working with new data, your first step should be to visualize the data.

You should also build visualizations into your analysis workflows as a matter of practice. The most effective ways to do so will vary by project and person, but it’s almost always a good idea to have some kind of diagnostic plots somewhere in the code.

Generate output automatically

When you want to make a figure or table to communicate your results, it can be tempting to say “I’ll just save the data, and make the table/figure later by hand or in the console.” Don’t fall into this trap!! While it can be easier in the short run to make outputs (especially tables) by hand, in the long run you’re almost always better off producing at least the skeleton of the output as a part of your R script/Rmarkdown file.

This will help keep your analysis reproducible, and will save you a lot of time when you’re iterating on what the final output should look like. For example, it’s much easier to try out different table contents (say, should you have standard errors below or not? Should the cell entries be normalized at some level or not?) when the output is being generated in the code rather than by hand. It will also help you avoid typos and errors.

Put `library()` calls and any hard-coded variables at the top of the script/Rmarkdown file

When you need to hard-code a variable or use a specific package (even the tidyverse) for some analysis, put them at the top of the script. This makes package dependencies and any variables that might need to be changed easy to find. It’s a good idea to put in comments explaining what the hard-coded variables are and the meaning of the value they’ve been set to.

Don’t save your workspace

When you exit R/RStudio, you’ll be asked whether you want to save the workspace image. Don’t do it!

If you save your workspace image, R will automatically load it the next time you start R/RStudio in that directory. This is antithetical to the notion of reproducible workflows, where the goal is to make the code as self-contained as possible. If your code is reproducible, there’s no need to save the workspace image—everything you need (packages, variables, etc.) will be loaded or created by running the script/Rmarkdown file.

Use RStudio’s autocomplete features

Autocomplete is a wonderful feature in RStudio and some other IDEs. When you’re typing something that you’ve already typed before in the script, you can type in the first few characters and press “tab” on the keyboard. RStudio will autocomplete the word you’re typing. (Up to the point where it’s ambiguous; if you have both unemp_rate and unemp_count, typing un-tab will complete up to unemp_. You can then type r-tab to get unemp_rate, or c-tab to get unemp_count.)

Spelling mistakes and typos are a major source of errors and programmer frustration. Using autocomplete or copy-paste will help you avoid those errors.

Fail productively

Learning new skills is a little like weight training: to see the gains, you need to feel a little resistance during the exercise. You have to struggle a bit. If you find the exercise so easy that it’s no struggle, you probably aren’t learning as much as you could be. This article describes the philosophy of failing productively in some more detail.

What does this mean for learning to program, in this class and beyond? Two implications:

In general: When you’re trying to implement something new in R, try to do it with the tools you have before you start searching the internet for guidance. This will help you build your own mental model of the problem you’re trying to solve, which will make it much easier to understand and remember the eventual solution. This becomes easier to implement as you acquire more familiarity with R.
In this class: Attempt problem sets before looking at suggested solutions or asking for help. You may not get very far—that’s ok! We want to challenge you to help you grow, and the act of exerting effort will help you level up faster.

Conclusion

This was a long read, but hopefully you’re familiar with the layout of these notes and can search them for help with specific issues when you have them. Remember to be patient with R and yourself. As you may see, R (like all programming languages) will do exactly what you tell it to do, which may not be what you intended it to do. No one writes new error-free code on the first go—there’s always some iteration. Mistakes are a standard part of programming, no matter how experienced you are. Checking for errors and bugs (and fixing them) is a central activity for any programmer. So, be patient. Everyone struggles and makes mistakes.

Happy programming!

Lab 0: Introduction to Statistical Programming

ECON 210 Middlebury College

Last updated 2/11/2021

The right frame of mind

What is statistical programming?

Conventions

Installing R and RStudio

Versions and compatibility

“Base” R and “tidyverse”

File paths

Working directories, relative path naming

File path exercises

Names, objects, and functions

Objects and functions exercises

Data classes and data structures

Data classes

Data structures

Vectors

Matrices

Data frames

Lists

General principles of statistical programming

Comment your code

Use scripts and code chunks, not the console

Don’t modify raw data

Give your variables and files informative names

Honor your naming conventions

Don’t hard-code values

Remember to plot your data

Generate output automatically

Put `library()` calls and any hard-coded variables at the top of the script/Rmarkdown file

Don’t save your workspace

Use RStudio’s autocomplete features

Fail productively

Conclusion

Lab 0: Introduction to Statistical Programming

ECON 210 Middlebury College

Last updated 2/11/2021

The right frame of mind

What is statistical programming?

Conventions

Installing R and RStudio

Versions and compatibility

“Base” R and “tidyverse”

File paths

Working directories, relative path naming

File path exercises

Names, objects, and functions

Objects and functions exercises

Data classes and data structures

Data classes

Data structures

Vectors

Matrices

Data frames

Lists

General principles of statistical programming

Comment your code

Use scripts and code chunks, not the console

Don’t modify raw data

Give your variables and files informative names

Honor your naming conventions

Don’t hard-code values

Remember to plot your data

Generate output automatically

Put library() calls and any hard-coded variables at the top of the script/Rmarkdown file

Don’t save your workspace

Use RStudio’s autocomplete features

Fail productively

Conclusion

Put `library()` calls and any hard-coded variables at the top of the script/Rmarkdown file