The Very Basics

Begin by opening the RStudio application from the desktop.

You should see three windows within the application: A large Console window on the left, an Environment/History window on the top right, and a window on the bottom right with several tabs.

While RStudio (and the underlying program R) exist for the analysis of data, they can do other mathematical tasks, such as simple calculations.

Just to get started, let’s add 2+2. In the Console window, type 2+2. You should see something like this:

2+2
## [1] 4

Checkpoint: Use R to evaluate \(\displaystyle\frac{1+\sqrt{5}}{2}\). Use sqrt for the square root, and use parentheses to make sure you are dividing the whole top expression by 2.

(When you type parentheses, brackets, quotation marks, etc., Rstudio will furnish them in pairs. You can either retype the closing parenthesis or just use it as provided.)

Often it will be useful for us to store numerical values (and other things) for use later. We do this by making use of variables. Variable names can be anything from single letters to longer expressions. You can assign a value to a variable using the equal sign or by using an “arrow” made of a less-than sign and a dash. If you want to know the value already assigned to a variable, just type the variable name

bottles.of.beer = 99 # I generally use the = but you may see the arrow too
bottles.of.beer
## [1] 99
bottles.of.beer <- 98 # there is the arrow
bottles.of.beer
## [1] 98

The # hashtag delineates “comments” in R. Anything to the right of the hashtag will not be treated as a command to do something. This is really useful when you put your code together in a script, because you can “take notes” so that you can remember what commands do and even share your code with others.

We can change the value of a variable. (Hence, “variable”, rather than “constant”.)

bottles.of.beer = bottles.of.beer-1
bottles.of.beer
## [1] 97

Checkpoint: Enter your age into R as a variable called “age” and then have R output your age 10 years from now, but do it without changing the value of “age”.

But we generally want to use R for things much more involved than a simple calculator. One of the most important features of RStudio is that it gives you the ability to organize your work, keeping your data, analyses, and even your documents all in one place.

The first step is to establish a Project. This creates a directory where all of the materials associated with the project are kept, and RStudio will then also remember all of the commands you typed when working on that project, all of the data that you used for that project, and all of the code and documents you created in that project.

To create a Project, find the “Project: None” tab in the upper right corner of the RStudio window, then go to “New Project”. In the pop-up window, go to “New Directory” then “Empty Project”. Then you can Browse to where you want to make the project. If you are working on a Kenyon computer, I highly recommend that you make a directory on your H: drive, then name your project something like “Math106” or “Stats”. Your RStudio window will then refresh and you may notice a few changes to some of the windows.

When you use RStudio for this class, always open this project before you start working. That will keep all of your work in one place. You may also want to create other projects for more specific…well, projects…

In addition to projects, you also need to keep track of your analysis code using R-scripts or just scripts. Rather than typing directly into the console window, we can open a fourth script window in the top left of the screen. From here we can type our commands and run them in the console, but we also have the full functionality of a text editor, in case we want to copy and paste and (especially) save a set of commands we’ve typed and come back to use them later.

To open an empty script, go to File -> New File -> R Script. At the top of any new script file, use hashtag comments (#) to enter your name, the date, and “My first R Script” into the first three lines of the file.

Next, type 2+2 in the script window and hit “Enter”.

You’ll notice that nothing happened. The script is just a text file, but we can run the commands down in the console window by hitting the “Run” button at the top right of the script window. Just position your cursor anywhere on the line with the 2+2 and hit “Run”. Now look down in the console window. You should see 2+2 copied down there and executed to give you the answer.

You can save scripts just like you would any text file, to your computer or a networked drive or wherever. If you simply save and name the script, like “MyFirstScript”, RStudio will keep it in your project directory, which you can see using the “Files” tab in the lower right window.

Speaking of the windows on the right, let’s explore them a little more.

First, let’s load some data. (In the top right window, the “Environment” tab should be in front; if it’s not, bring it to the front now.) Click, reasonably enough, on “Import Dataset”. The data we’ll be working with today is in the P: drive in the folder Data/Math/Kerkhoff/math106/Data/. It’s called “classdata15.csv”. (The extension “csv” stands for “comma separated value” and it’s the lingua franca for spreadsheet data saved as plain text.) Data entry, while not literally impossible in R, is definitely not recommended. Any time you want to enter data by hand, use a spreadsheet program and save as .csv for import by R.

A scary-looking window will pop up. For now, just accept all of the defaults (please double-check that “Yes” is checked next to “Heading”) and click “Import”. A few things should happen.

classdata15 <- read.csv("/Volumes/public/Data/MATH/Kerkhoff/math106/Data/classdata15.csv")
View(classdata15)  #Your command above will look different on PC

This is what actually did the importing of the data. You can also put these lines (or their analogs for different files) into scripts so that you can just click “Run” to import data if you are working with a data set repeatedly.

The “classdata15” object that you see in the environment window is a Data Frame. This is how R keeps data. It is important to note that the data frame is internal to R. It is not the .csv file on your computer, so changes that you make to it in R are not saved to that file. Notice on the upper left “spreadsheet” that the data frame has a particular format, with each column representing a different “variable” and each row representing a different “case”. This format is important because each row is then a set of observations about the same object or person. Also notice that the column headings are relatively short, but meaningful, and they contain no spaces. Using sensible variable names like this will make writing R code a lot easier, so try to get into the habit.

And finally, let’s use the lower right window. Bring the “Packages” tab to the front. Lots of people have written extra code to go on top of basic R, for everything from specialized data analysis to fancy graphics. We’ll be making regular use of a package called “Mosaic” which was written with intro stats students in mind. Click on the box next to “mosaic” to load that package. (Lots of stuff happens in the console window when you do this. Don’t worry about the details, but let me know if ugly warning crop up.)

If you want to make sure this happens when you run a script, you can put the command

require(mosaic)

in your scripts before you get to any actual data analysis. If you install R and R Studio on a personal computer, Mosaic is generally included as part of the basic distribution, but if it’s not, there’s an “Install” button that will get it for you.

Okay, we’re finally ready to look at some data!

Dotplots

Go up to the script window, type the following, and hit “Run” (or ctrl-R). Notice the capitalization and the ~. These are important. R will only do exactly what you tell it to do! (In particular, dotPlot and dotplot are different functions in R.)

You should see the graph below appear in the “Plots” tab on the lower right of your screen. As you produce more plots, they’ll all be there. You can navigate back and forth with the left and right arrow buttons.

dotPlot(~Height, data=classdata15)

This is a dotplot of the type you read about in Chapter 1. Click on the “Zoom” button to get a larger version in a pop-up window.

Checkpoint: Make a dotplot of some other numerical data from our class data set.

But we will almost never use dotplots like that. You almost never see them outside of statistics textbooks. Instead, let’s focus on some basics about histograms and bar graphs, which are important tools for visualizing quantitative and categorical data.

Checkpoint: What is the difference between a histogram and a bar graph? Which is used for quantitative data, which is used for categorical data?

In what follows, I won’t reproduce all of the graphs you should be getting, but I encourage you to compare with your neighbors to make sure you’re getting the same figures.

Histograms

Here’s a histogram version of the same data, with a title added.

histogram(~Height, data=classdata15, xlab="Height (inches)")

Don’t worry for right now about what “Density” means on the vertical axis. (It has to do with some probability we’ll learn in a couple of weeks.) We can change that to a relative frequency measure. That will tell us what percentage of the class is in each of the height categories indicated by the different vertical bars.

histogram(~Height, data=classdata15, type="percent",
          xlab="Height (inches)")

But how did R decide how many vertical bars to make and where to put the break points? That ends up being complicated, but the good news is that we can tell R how we want them done. This choice of break points between bars (called “binning”, from the idea of sorting heights into various bins before counting) makes a big difference to how a histogram looks. Try these:

histogram(~Height, data=classdata15, type="percent", width=6)
histogram(~Height, data=classdata15, type="percent", width=1)
histogram(~Height, data=classdata15, type="percent", nint=14)

Checkpoint: Make histograms with different binning for the data in the column PulseRest. Play around; see how things look different with different bin widths. Which one seems to give the clearest idea of a distribution?

We can get R to give us histograms broken down by category in separate panels using the | operator:

histogram(~Height | Sex, data=classdata15, type="percent")

Checkpoint: Look at resting pulse by cat vs. dog preference.

Continue exploring. Whatever strikes your fancy.

Bar Graphs

We won’t make many bar graphs in this course, but R can make them and they might be fun when exploring our class data. Here’s how to make a simple chart of the number of people in each class year.

bargraph(~Class, data=classdata15)

Notice that R defaults to presenting any categorical data in alphabetical order. It turns out to be just a little complicated to change that. Feel free to Google how to switch the order if you’d like, but we’ll be making so few bar charts in this class that I didn’t think it worth our time to go over it.

R can also make comparative bar charts.

bargraph(~Class, groups=Sex, data=classdata15,auto.key=T)

Notice the color scheme. We can change that, but it’s surprisingly complicated:

bargraph(~Class, groups=Sex, data=classdata15,
         par.settings=simpleTheme(col=c("blue", "red")), auto.key=T)

I’ve got to think there’s a better way. If you know of it, please let me know!