Lab 1: Introduction to R and RStudio

Complete all Questions and submit final PDF under Assignments in Sakai.

Welcome to statistical computing! Throughout this course, we will be using a statistical software called R to practice the statistical concepts discussed in class, to analyze real data and to make informed conclusions. We will be working with R through an interface called RStudio. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface that we will be using. RStudio allows us to see visualizations, explore stored data sets and create PDF documents all within the same interface.

We assume you have no experience with programming or with R as you come into this course, so don't worry if programming is something you have never tried. We will build the skills we need as we move through the class.

Why do we need to learn a computing language at all? In practice, real statistical analysis is performed using some sort of computing language. R, STATA, SAS, Python, MatLab and Excel are just a few examples. We will work with R because it is (1) free, (2) widely-used in industry and academia and (3) highly adaptable to multiple disciplines. The fact that you have familiarity with R is something you can put on your resume, and can be beneficial when applying for internships and jobs.

If you haven't already, make sure to watch the video on Canvas to set up your RMarkdown file and get ready for lab! The section below goes through the same steps as the video, but in words. If you have already watched the video, you can skip to the section called "Answering the Lab Questions".

Creating a reproducible lab report

For everything we do in this class that involves R, we will use RMarkdown. The beauty of RMarkdown, as we will see, is that plots and any other output you create in R can be instantly combined with any text you want. In other words, there is no need to copy and paste plots or tables as you create them. Instead, you will create the plot where you want it in your RMarkdown. This allows you to complete your lab entirely in RStudio, as well as ensuring reproducibility of your analysis and results.

To create an RMarkdown document, look at the top of your RStudio screen. Below the word "File" in the upper toolbar, you should see a symbol that looks like a small piece of paper with a green plus sign in the upper left corner. Click on it.

Choose the option RMarkdown, which should be the third option presented. You will be prompted to name your document. Make sure the title you choose matches the lab title. In this case, use "Lab 1". For the author, write your name. Finally, select PDF as your preferred output format.

Your RMarkdown document will appear in template form. Before doing anything else, find the Knit option at the top of your window. You will see a ball of yarn with a knitting needle next to the word Knit. Click on the drop down arrow and choose Knit to PDF. At this point, you will be prompted to save your document. I would highly recommend saving the document somewhere you can find later. When you name your document, make sure that you choose a name without spaces . For instance, choose "Lab1" over "Lab 1". Including spaces will make it difficult for the computer to save your document, and will sometimes cause it to refuse to knit. Remember: When you are naming a document, no spaces, and no special characters like apostrophes or hyphens.

After you have saved the document, your Markdown will start to compile. After a few second, you should see your document in a new pop-up window.

NOTE for PC Users: If you are using a PC, you may get a prompt asking you to update or install packages. Agree to the requests! You may have to do this a few times (meaning a few little windows may pop up), but once you have everything updated, you will not have to do this again.

NOTE for Mac Users: If you are using a Mac, you may get an error when you try to knit. This means that instead of a document appearing, you may see a red error message in the console that tells you that MacTex is not installed, even though you have installed it already. This is a common issue. Go to the top of your window and find Knit again, but this time click on the black drop down arrow to the right of Knit. Instead of Knit to PDF, choose Knit to Word. A Word file should appear. If so, this is your process for the rest of the course: do not knit to PDF; just knit to Word. When you are ready to submit to Sakai, save your Word document as a PDF. Only PDF documents will be accepted for grading.

Take a look at the file that appeared when you Knit. You will notice both text, code and plots in your document. Again, this is the power of Markdown. We can combine code, output and text all in one convenient place. Now, this is just template material to help you see what a Markdown document will look like. This same template will appear every time you create a new Markdown. As we move through the class, we will learn tricks for helping to format Markdown files to suit your own preferences.

Formatting your Markdown file

Go back to your Markdown file. At this stage, we are still looking at the template. Let's delete everything that is not necessary for our work today. For every Markdown you create in this class, this includes everything line 12 to the end of the document. Go ahead and delete everything line 12 to the end. DO NOT delete anything above line 12. You need those commands in order for Markdown to knit.

Now that we've cleared away what we don't need, we are ready to begin putting in our own commands. The very first step in starting any new Markdown file is to load in the data that you need. If you do not load the data inside of your Markdown file, your Markdown will not Knit. Go to line 12 on your Markdown file. Here, we are going to insert a code chunk so that we can load the data.

Look at the upper right hand corner of your Markdown file and find Insert. Click Insert and choose R. On line 12 of your Markdown file, a gray box will appear. This box is called a code chunk, or chunk. Anything you type inside of the code chunk will be processed as an R command, in other words, something that the computer will recognize as a code. For this particular chunk, we want to load in data. To do that, enter the following command in the R chunk. You can either type it in manually or copy and paste it from this document.

source("http://www.openintro.org/stat/data/present.R")

Now, when you paste the command above, the words will appear in your gray box, but nothing else appears to happen. This is because we have not yet run the command. This command will actually tell R "hey, go to this website and download a data set called present", but R won't do that until we tell it to do so.

To actually tell R to go and download the data, hit the green triangle button to the right of your chunk. We will call this the play button. When you hit play, R runs all the commands in the chunk. In this case, it runs the one line command that instructs R to access the OpenIntro website and fetch some data: the number of boys and girls born in the US each year.

You should see that the workspace area in the upper right hand corner of the RStudio window now lists a data set called present that has 63 observations on 3 variables. As you interact with R, you will create a series of objects in this panel. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed. Note that because you are accessing data from the web, this command (and the entire analysis) will work in a computer lab, in the library, or in your dorm room; anywhere you have access to the Internet.

Now, at this point we can work with the data AND, because the command to load the data is part of our Markdown file, our document will Knit. We have downloaded it from the web into R and can now create summaries, make plots, etc., using this data. However, this does not mean that we can delete the chunk. Why? Because RMarkdown (the tool you will use to create your final assignment to turn in) needs it.

This is an important distinction. For R, there is no real need to keep the command sitting in a chunk once you have hit play; the command has already been run. In this case, the command was supposed to download data. Once we've used that command, we no longer need it.

This is not true for RMarkdown. Unlike R, RMarkdown creates a document by running every single command in the document from top to bottom. Think of RMarkdown as a creature with a very, very short memory. Every single time you knit, RMarkdown literally forgets everything you told it to do last time. It doesn't matter that last time you hit knit it saw the chunk that said "hey, download data". It needs to see the chunk again. "Hey, you want me to download data!" So, any command that you need in your Markdown must be in a code chunk when you knit. And, the code chunks needs to be in the order you want them run. RMarkdown starts at the top and works its way down.

So, every single time you make a Markdown file, you should have a chunk at the top that tells RMarkdown where to find the data it will need for the analysis you want to perform. Every. Single. Time.

Answering the lab questions

Now that you have loaded the data, you are ready to start the lab. Go to line 16 of your Markdown. Notice that this leaves one line of blank space between your code chunk and what you are about to type. This space between text and code chunks is necessary in order for your document to format properly.

On line 16, you are going to create a section header for the first lab question. Create two ## signs on line 16, and then type Question 1. In other words, line 16 should have ## Question 1. This creates a new section in your Markdown called Question 1. Go ahead and knit; see that you have created a section.

You can create as many sections as you like. Under a section header, you have the ability to type, just like you would in a word document, so you can type the responses to questions. The only thing you need to remember is to put a single blank line between the ## Question 1 and whatever you start to type. RMarkdown really likes these blank line separators. Let's try typing in our Question 1 section.

What is the very first step in creating any new Markdown file?

Answering the question above involves only words. What if you needed to include code? For instance, what if you were asked to make a plot as a part of your answer? Within a section, you can insert code chunks. Remember, a new chunk can be inserted by clicking on the Insert Chunk button (drop down menu under Chunks on the upper right corner of your markdown document). To practice this, let's move on to question 2. Create a new section called ## Question 2.

Create a code chunk. In the chunk, type the command plot(cars). Show the plot as part of your answer.

Knit your document now, and make sure that so far, your document contains two sections, Question 1 and Question 2, the latter of which has a plot beneath it.

The plot you have created examines the distance (dist) a car travels, and this information is stored in a data set called cars. Notice that we did not have to load the data before making our plot. This is because there are a few data sets (like cars and ChickWeight) that R (and RMarkdown) already know. In general, you will need to load the data yourself.

The command you just used is a very typical one for R. We take an object (like the cars data that is pre-loaded into R) and we tell R to do something with it (in this case, make a plot). The command structure says "I want to make a plot using the cars data, therefore I use the command plot(cars)."

At this point, we are going to begin working with the present data. Remember, every time you have a new question, create a new section.

Exploring the Data Set

Recall that at the beginning of the lab, we loaded a data set called present using the command:

source("http://www.openintro.org/stat/data/present.R")

This data set tells us the number of male and female births in the United States from 1940 to 2002. These data come from a report by the Centers for Disease Control https://wonder.cdc.gov/natality.html. The data also contains a ton of additional information on births, causes of death, etc.

Let's take a look at the data that we will be working on. To take a look at the data set present that we have loaded, take a look at your workspace, i.e., the upper right hand panel of your RStudio window. See the name of the data set? Click on the name (not on the arrow next to it!). The data set should pop open.

What information is provided in the first row of the data set?

Another way of looking at the data allows you to look at the data within the Markdown document. As a first step, create an R chunk. Note: As we have not actually begun a new question yet, you don't need to create a new section unless you want to. This is just some data exploration.

Now, we can take a look at the data by typing its name into the chunk and hitting the play button (the green arrow at the upper right hand corner of the chunk).

present

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls born that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of the present data set. R adds them as part of its printout to help you make visual comparisons. It just says "hey, this is the 17th row", etc. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored the data in a kind of spreadsheet or table called a data frame.

When you are done looking at the data, look at the top of the data window, right below the base of the chunk. See the little x? Click it. This will close the output of the code. Notice that the code itself does not disappear, only the output does.

Remember that the output of any code you put in the chunk will appear in your PDF after you knit. Do we really want all of the data to be print out in the PDF? Probably not. To make sure that does not happen, put a # in front of present. In other words, your chunk should now look like #present. Hit the play button on the chunk again. See how nothing appears? The # sign tells R to ignore this line of code; it will not run the line, which is what we want.

Rows and Variables

Now that we have the data, it's always a good idea to explore the size or dimensions of the data set. In other words, we should determine how many rows are in the data set, and how many variables.

You can see the dimensions of this data frame by creating a new line in your chunk, typing in the code below, and hitting the play button:

dim(present)

In R, the format for coding is command(object). In other words, the command tells R what you want to do. The command dim tells R we want to print the dimensions of a data set. The object piece tells R what object or data set you want the command to work with. So, dim(present) tells R we want to see the dimensions of the present data set.

The command dim(present)should output ‘[1] 63 3’, indicating that there are 63 rows and 3 columns (you can ignore the [1], it just means we can fit all the output on one row). The same dimensions can be seen in your workspace (upper right hand corner) next to present.

How many cases, or observational units, are in the present data set? What are they? In other words, how many rows are there and what does each row represent?

How many variables are included in the present data set? What are they? Hint: There are two ways to learn this. You can (1) open the data set and look or (2) use the code
```
colnames(present)
```

Types of Variables

There are two main types of variables. Numeric, or quantitative, variables are variables that we want to treat as numbers. It should make sense to add, subtract, multiply, etc., these variables. Some examples include height, weight, temperature, number of textbooks in your backpack, etc. Categorical, or qualitative, variables are variables that we want to treat as categories. One example might be your class: are you a first year, sophomore, junior or senior? The answer to this question assigns you to a category.

When deciding whether a variable is numeric or categorical, be careful not to just look at the name of the variable. Instead, you want to look at the variable itself. What do the responses look like?

Is the variable boys numeric or categorical? What about the variable girls? Explain your choice.

Now, the variable year is a little strange. The variable contains numbers, ranging from 1940 to 2002. So, is year a numeric variable? Actually, it depends. We can use year as a number, adding, subtracting, etc. However, even though year is a number, we can also use it as a category. Each row corresponds to a specific year, and then tells us how many boys and girls were born in that year. So, we can use year as a category describing that particular set of birth counts. Think about a hypothetical variable academic_class. What if instead of freshman, sophomore, junior, senior, the variable was recorded as 1,2,3,4? Those numbers still denote categories, and as such academic_class should still be treated as a categorical variable. The moral of the story? Think before you classify.

What if we want to look at just one variable in the data set? We know how to look at the whole data set, but we can also access the data in a single column of a data frame separately using a command like

present$boys

This command will only show the number of boys born each year. The structure of this command is data set$column. We are telling R "hey R, take a look at the present data set. Go inside the present data set ($) take a look at the column called boys."

Remember, if you run a command but do not want it to show up in your final document, put a # before the command: #present$boys

What command would you use to extract just the counts of girls born?

Making Plots

We have learned in class about different types of visuals that we can create to explore the data. Let's try making one in R. We are going to use the ChickWeight data set to create the plot. This is a data set that R has pre-loaded, so you do not need to load it. The data set explores the impact of different diets on the weights of chicks. There are 4 different diets used during the study.

Use the command plot(ChickWeight$Diet, col="cyan") to make a plot. What type of plot have you made? Based on this plot, have you plotted the categorical variable representing the diet of the chick or numeric variable weight representing the weight of the chick?

Create the same plot, but make the bars red rather than cyan. Show your code.

Use the command plot(ChickWeight$weight, col="cyan") to make a plot. What type of plot have you made? Based on this plot, have you plotted the categorical variable representing the diet of the chick or numeric variable weight representing the weight of the chick?

Let's explore those plotting commands in more depth.We can create a simple plot of the number of girls born per year with the command

plot(x = present$year, y = present$girls)

Again, this gives R a command, plot. What do we want to plot? On the x-axis, we want to plot present$year and on the y-axis we want present$girls.

If you put the command above in a code chunk and hit play, the plot appears in your Markdown window. If you want the plot to close, hit the x in the upper right hand order of the plot.

What kind of plot have we just created? Why is it appropriate to use this kind of plot?

Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variable for the x-axis and the second for the y-axis. If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.

plot(x = present$year, y = present$girls, type = "l")

In this case, type="l" tells R we want a graph with lines. If we used type="p", we would see points instead of lines (try and see!).

Looking at the plot, we notice that the names of the x and y axes are a little strange. By default, R names these axes with the exact names of the variables. We can change the label for the axes like so.

plot(x = present$year, y = present$girls, type = "l",  xlab= "Year",  ylab= "Girls")

How do you need to change the above code if you want the y-axis to be labeled "Number of Girls"? Show the code.

Is there an apparent trend in the number of girls born over the years? Describe the trend.

We can make another kind of plots using the number of girls.

Use the command hist(present$girls) to create a plot. What kind of plot has been created? Label the x-axis so it is says "Number of Girls".

Does the plot suggest that the data is right skewed, left skewed, or symmetric?

Based on this, would you want to use the mean or the median as the measure of center? Explain.

Run the command summary(present$girls). State the mean and the median number of female births, and explain why the mean is less than the median.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted.

This lab written by Nicole Dalzell at Wake Forest University. It is based on a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported and by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.