Introduction

The purpose of this assignment is to get you familiar with a few key features of R and RStudio. You will import some data and create an R script that will calculate some statistics and generate some graphs. This will involve importing a package (a collection of code that will let you do more stuff with R).

Please note that the results I’m presenting will be slightly different than yours because I’ve modified my data (so that I don’t give away the answers).

Grading

This assignment will be graded on a \(\checkmark\)/\(\checkmark +\) basis. Completing the assignment gets you a \(\checkmark\) (worth 85%) and getting the hardest part right gets you a \(\checkmark +\) (worth 100%). Incomplete work is worth 0%.

Requirements:

I want to go over this assignment in class on September 27th so I’m asking you to print out hard copies of the following things before class:

  1. Two copies of your script (what this is will make sense by the time you’ve gone through this tutorial),
  2. The summary statistics your script generated,
  3. At least two graphs your script generated.

Your script should do the following things:

  1. Import the data
  2. Calculate the mean, median, minimum, maximum, and standard deviation of each variable.
  3. Generate at least one histogram (different than the one I made), and one scatterplot (also different than the one I made).

Note: the library has $0 printing… but you have to show up early enough that you can get on a computer.

Sidenote: Keeping Organized, RStudio, and hotkeys

To keep your work organized, you should set up a series of folders on your hard drive (preferably inside of your Dropbox folder, or whatever automatic backup program you use) with a different folder for each project you’re working on.

For example, you might have a folder structure like this:

  • Dropbox:
    • ECO 260
    • ECO 340
    • ECO 380
      • Notes
      • Computer Assignments
        • Wooldridge Data
        • HW1
        • HW2
    • ECO 410

Keeping HW1 and HW2 in separate folders can save you countless hours of trying to figure out what work you’re doing during a given week.

RStudio has a feature to make it easier to be organized: projects. Start a new project by going to File > New Project. Then click through the options, making sure to set your directory into the correct folder for this assignment.

This is especially useful when you’ve got more than one project going on at once (e.g. if your boss wants you to research your company’s marketing and its supply chain). A project file (Computer Assignments.Rproj for the project I’m using right now) will keep track of all the files you’ve got open, and will set up RStudio exactly the way you left it when you last saved your project file.

There are lot of different hotkeys that will be helpful. Using them, over the long run, can save you a second here and there which will add up over time. But the real value of hotkeys is that it keeps your focus on the problems you’re working on rather than the mechanics of menus. Not all hotkeys are worth learning, but those you use a lot will help you maintain a laser focus worth much more than the time you save.

RStudio (the company) provides a lot of resources to make your life easier. For example, their website includes cheatsheets for things like using the ggplot2 package (a collection of code the folks at RStudio maintain for the generation of plots that are consistent with the “grammar of graphics”). The one that will be important right now is the cheatsheet for Rstudio itself.

You’ll notice that my RStudio is different than yours. That’s because I’ve made some changes to the default settings. The most obvious difference is in the layout. You can change your layout by opening Tools > Global Options selecting “Pane Layout” in the newly opened Options window, and changing things to work better for you. You can also click and drag the boundaries between panes to change the size of each.

There are four panes in Rstudio, one for “Source” (your R scripts or “source code”), one for “Console” (this is where you are running R code), one for “Environment/History”, and one for “Files/Plots/Packages/Help/Viewer”.

The ones we’ll use most are “Source”, “Console”, “Environment”, and “Help”. “Plots” will also be used quite a bit, but you’ll see how smoothly it works when you generate your first graph.

The assignment

For this assignment you will

  1. Import the data
  2. Create a script
  3. Generate some summary statistics
  4. Generate some graphs of the data

Step 1: Importing the data

To start, download the data sets that accompany our textbook at this link.

By now you should have the data sets all smooshed together in a file called 130527010X_514733.zip in your Downloads folder. Copy it over to the appropriate folder, right click the file, look for an option like “Extract all” or “Extract here”, then follow the prompts from your computer. You should end up with a folder named 130527010X_514733, which contains another folder called Data Sets- R. (To make everything consistent with the folder structure above, I would rename Data Sets- R to Wooldridge Data and copy it into my ECO 380 folder.)

Open that data folder and find gpa1.RData. Copy that file into your homework folder (this isn’t strictly necessary, but it will make things simpler).

Under the File menu, click Open File (or use the shortcut Ctrl + O–hold Ctrl then press O for “open”). Find and select the file named gpa1.RData, click the Open button (or hit Enter), and click the Yes button on the pop up that asks if you want to load that data file into your “global environment”. (See images below)

End Step 1

Step 2: Creating a script

You won’t always be able to finish all your work at once, so it’s important to set yourself up in a way that makes it easy to pickup where you left off.

The first step is to create a script file. You can use the hotkey Ctrl + Shift + N (hold down the Ctrl and Shift buttons, then press the n button). Using your mouse you would go to the File menu, look for New file, then press the option for R script. Both accomplish the same result, but the hotkey does it faster.

Now you’ve got a blank script and some data that you can see in the “Environment” Pane. Click around the Environment Pane and see what you can figure out.

Find your “History” Pane (which you can access by clicking the appropriate tab in your Environment Pane). You should see something like this:

(You can see by the tabs that I’ve changed my settings a little bit, even while putting together this tutorial. That won’t change our results so it’s not important right now.)

Each time you run a line of code in the console, R saves that line of code in your History. This is useful because it allows us to tinker with code on the Console, then when we get it working we can copy it into our script.

You should have just one line in your history right now. I’ve got two lines because I loaded my data in two different ways. The first line loaded the data from my folder of Wooldridge’s data. The second line loaded the data from the folder I’m working in (Assignment 1).

Click your line of code, then click the “To Source” button. That will copy that line of code into your script. Now save the script (something like script1.R or HW1.R). Be careful to save it into the correct folder (not your data folder).

Before we go on, let’s make our script a little more useful with comments. R reads your script the same way you do: from left to right and from top to bottom. Anything after a comment character is ignored by R. This allows you to put in helpful comments, such as what a specific line of data is doing. You can also use them to leave yourself helpful notes.

Modify your script so it looks more like this:

# Homework assignment 1
# Assignment to practice importing data, creating a script,
# calculating summary statistics, and creating graphs
###########################################################
### Step 1: Import data
load("gpa1.RData") # import GPA data
###########################################################
### Step 2: calculate summary statistics
# mean
# median
# range
# standard deviation
###########################################################
### Step 3: draw pretty pictures
# ???

End Step 2

Step 3: Summary statistics

Before we go too far, we need to poke around our data. Your environment should include two data objects, data and desc. These objects’ names have two important features: they’re short, and it’s obvious what they mean.

Let’s make sure desc is going to give us descriptions:

print(head(desc))
##   variable                   label
## 1      age                in years
## 2     soph         =1 if sophomore
## 3   junior            =1 if junior
## 4   senior            =1 if senior
## 5  senior5 =1 if fifth year senior
## 6     male              =1 if male

There are a few different ways to look at our data more thoroughly, but let’s start by just actually looking at the data. On the console, type View(data). You can see that a lot of the variables are binary (0/1). Looking through desc we can see that these variables take a value of 1 to indicate that the survey respondent fits into a category.

Another way to look at the data is to ask R to summarize it for us using summary(data). But that’s sort of hard to look at, so let’s focus on a specific column. The $ operator is used to select a specific variable from a dataset. So data$colGPA will focus on just the college GPA column of the object data.

print(summary(data$colGPA))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.200   2.800   3.000   3.062   3.300   4.000

This gives us some useful information right off the bat. We can see the minimum, maximimum, and quartiles of our data. This lets us know that nobody has a GPA higher than 4.0 in this dataset, and nobody is below 2.2. Half of our sample has a GPA above 3.0 and half are below 3.0. 50% of respondents have a GPA between 2.8 and 3.3. And the average GPA is 3.057.

One thing that’s missing from summary() is a measure of dispersion (i.e. variance or standard deviation). We’ll stick with standard deviation.

Recall that the formula for sample standard deviation is

\[s = \sqrt{\frac{\sum(x-\bar{x})^2}{n-1}}\]

We can calculate that using simple mathematical operators in R.

x <- data$colGPA # create a vector of x values.
n <- length(x)
x.bar <- sum(x)/n # add up each x value. 
                          # divide by the number of elements in x
# more simply:
# x.bar <- mean(x) # verify that this does the same thing.
if(x.bar - mean(x)==0) print("It works!")
# "==" tests for equality. "=" assigns a value.
# "x == y" --> "is x equal to y?" vs 
# "x = y" --> "now x is whatever y is"
# usually we use the "<-" assignment operator instead of "="
# this is a stylistic choice that people feel makes R code clearer
x.dev <- x-x.bar # deviation from the mean
x.dev2 <- x.dev^2 # squared deviation from the mean
s <- sqrt(x.dev2/(n-1)) # sample standard deviation

Doing that for every variable is going to be a pain. We can make this easier by writing a function to do the work:

stdev <- function(x){
  n <- length(x)
  x.bar <- mean(x)
  x.dev <- x-x.bar
  x.dev2 <- x.dev^2
  s <- sqrt(sum(x.dev2)/(n-1))
  return(s)
}

Now, when we want to calculate the standard deviation for a variable we can just do this:

print(stdev(data$ACT))
## [1] 2.844179

That’s not bad, but we can do one better:

print(sd(data$ACT)) # use the built in standard deviation function!
## [1] 2.844179

Now we need to find a way to do this for all the variables at once. There are a set of functions that are relevant called apply() (there are variations like sapply() and lapply() that are streamlined for particular circumstances). But the apply() functions are confusing and there are more user friendly alternatives that are better suited for our purposes.

We’re going to use the function summarize_each() from the dplyr package. This package is meant to make it easier to work with data. If R is a language, dplyr is a collection of slang phrases that are useful for particular circumstances. It’s like comparing the English language with the language of professional cooks. Phrases like mise en place convey particular ideas more efficiently and effectively than their non-cook-speak equivalents.

First we have to install dplyr (you’ll only have to do this once), then we’ll load the package (you’ll do this every time you start R… if you’re using functions from this package).

# install.packages("dplyr") # install dplyr
library(dplyr) # load it into the workspace

I’ve already installed dplyr, so I’ve commented out the line install.packages("dplyr"). There are clever ways to have R check to see if a package is installed, but let’s economize our attention–install it once, then comment out that line of code so you remember how you did it.

Now we can start using it.

data.mean <- summarize_each(data,funs(mean)) # calculate the mean
                                             # for every column
print(t(data.mean)) # display the results
##                 [,1]
## age      20.88732394
## soph      0.02112676
## junior    0.38028169
## senior    0.50704225
## senior5   0.09154930
## male      0.52816901
## campus    0.17605634
## business  0.78873239
## engineer  0.04225352
## colGPA    3.06197184
## hsGPA     3.40633803
## ACT      24.17605634
## job19     0.41549296
## job20     0.16901408
## drive     0.20422535
## bike      0.36619718
## walk      0.42957746
## voluntr   0.21830986
## PC        0.40140845
## greek     0.31690141
## car       0.76760563
## siblings  0.93661972
## bgfriend  0.47183099
## clubs     0.60563380
## skipped   1.06866197
## alcohol   1.93697183
## gradMI    0.87323944
## fathcoll  0.59154930
## mothcoll  0.54225352
# t(X) is the "transpose" of X. In this case it turns a hard-to-read
# row of information into an easier-to-read column

We can do the same thing for other variables:

# median
data.med <- summarize_each(data,funs(median))
# range
data.min <- summarize_each(data,funs(min))
data.max <- summarize_each(data,funs(max))
# standard deviation
data.sd <- summarize_each(data,funs(sd))
data.summary <- data.frame(mean   = t(data.mean),
                           median = t(data.med),
                           min    = t(data.min),
                           max    = t(data.max),
                           sd     = t(data.sd))
print(head(data.summary))
##                mean median min max        sd
## age     20.88732394     21  19  30 1.2665841
## soph     0.02112676      0   0   1 0.1443159
## junior   0.38028169      0   0   1 0.4871744
## senior   0.50704225      1   0   1 0.5017201
## senior5  0.09154930      0   0   1 0.2894095
## male     0.52816901      1   0   1 0.5009730

When you’re presenting a statistical analysis, your audience will usually want to know something about the variables you’re using. We can turn data.summary into a table that will give our audience that basic information. You’ll probably want to make it prettier than what R is giving us by default. To do so, you can save it as a csv file, open that file in Excel, then modify it as necessary to make the results more readable:

write.csv(data.summary,"summ.stats.csv",row.names=TRUE)

You can copy this table from Excel and paste it into a Word document. For someone who hasn’t already been working with the data (e.g. your boss) this is much easier to understand and all I had to do was make some minor cosmetic changes. It also means my hypothetical boss won’t be knocking on my door every five minutes to ask “what does bgfriend mean?”

End Step 3

Step 4: Graphing

Now, let’s look at some graphs. As always, there are a few different ways to do this. We’re going to use the ggplot2 package. First install the package, then load it.

# install.packages("ggplot2")
library(ggplot2)

To keep things simple, we’re going to use the qplot() function. More advanced graphs can be drawn using ggplot(). See the cheatsheet for more details. Or run ?ggplot on the console to bring up the relevent help file.

To draw a histogram, we want to identify one \(x\) variable (as opposed to an \(x\) variable and a \(y\) variable), what data we’re using (in this case, our data is named data), and what sort of graph we want (a histogram).

print(qplot(x=colGPA,data=data,geom="histogram"))

ggsave("skipped.png") # This will save the image of the last graph

To get a scatterplot we need to add a y variable, and change the sort of “geometry” we’re dealing with–i.e. instead of a histogram we want to draw points.

print(qplot(x=colGPA,y=hsGPA,data=data,geom="point"))

# That doesn't look quite right. Let's use jitter so we don't have overlapping data.
print(qplot(x=colGPA,y=hsGPA,data=data,geom="jitter"))

ggsave("skipped-colGPA.png")

Just for kicks, here are the same graphs using the ggplot() with a little bit of extra functionality thrown in.

g <- ggplot(data)
h <- geom_histogram(aes(x=colGPA,fill=as.factor(bike)))
print(g+h)

s <- geom_jitter(aes(x=colGPA,y=hsGPA,color=as.factor(male)))
print(g+s)

Grading

This assignment will be graded on a \(\checkmark\)/\(\checkmark +\) basis. Completing the assignment gets you a \(\checkmark\) (worth 85%) and getting the hardest part right gets you a \(\checkmark +\) (worth 100%). Incomplete work is worth 0%.

Requirements:

I want to go over this assignment in class on September 27th so I’m asking you to print out hard copies of the following things before class:

  1. Two copies of your script (one for you, and one to turn in)
  2. The summary statistics your script generated
  3. At least two graphs your script generated.

Your script should do the following things:

  1. Import the data
  2. Calculate the mean, median, minimum, maximum, and standard deviation of each variable.
  3. Generate at least one histogram (different than the one I made), and one scatterplot (also different than the one I made).

Note: the library has $0 printing… but you have to show up early enough that you can get on a computer.