Activity 1: Learning R
The Goal
When you work with real world data, you will use statistical software. These are computing languages and programs specifically desgined to allow you to work with data efficiently and quickly. In this lab, we are going to start to explore one of the more popular statistical languages called R.
R is a computing language that is used in both academia and in industry, so knowing R is a nice thing to add to your CV as you apply for internships, jobs or graduate school. Today, we are going to start familiarizing ourselves with R through the process of exploring data. Don’t worry if you have never done computing before - we are going to learn from the ground up in this course!
Loading the Data
The first thing we need for any data analysis is…data. This means one of the first things we need to learn how to do in R is get data. Today, we are going to work with data on nutrition information for Tall (12 ounce) drinks at Starbucks.
Step 1 : Download the Data
The data you need can be downloaded here. Save the data file on your computer somewhere you can find it - you will need to access it in just a second.
Step 2: Move the data into R
Now that we have downloaded the data, we need to move the data from your computer into R. To do that, look at your Environment Panel in RStudio - this is the upper right hand panel on your RStudio screen. You will see an option at the top of this tab called Import Dataset. Click on it!
After clicking on Import Dataset, you will be presented with a few choices. We are looking for one of two choices: “From Text (base)” or “From CSV”. The option you are presented with will depend on your machine, but either one works! The purpose of these options is to help R know what file type it is expected to download.
Once you have selected the appropriate option, you will be prompted to browse within your computer files to find the data set. Go ahead and navigate to the folder where you stored the Starbucks data, and click on it. A preview of the data set will appear.
Make sure that your data has a heading/header, i.e., that there are column labels on the data set. If it does not, find the button on the import window that says “Heading” and click TRUE.
Once the preview looks okay, go ahead and load/import the data. Take a look in your Environment and see that you should have a data set called Starbucks with 230 observations.
Step 3: Looking at the Data
Now the data is in R. Great…now what? Well, the first thing we generally want to do it to look at the data.
To open your data set, look back at your Environment Tab (the upper right hand panel of your RStudio screen). You will see Starbucks
, which is the name of the data set you loaded. Click on the word Starbucks
and you will be able to see the data set!
Exploring the Data: Measures of Center
The Starbucks data set has 230 rows and 18 columns. One of those columns, Calories
, tells us the number of calories in each Starbucks drink. This is the variable we want to focus on for now.
To focus on just one column in R, we use the $
symbol. This means that to print out just the column Calories
in our Starbucks
data set, we use
$Calories Starbucks
Let’s try a different column.
Question 1
Print out just the column for Sugars
in the data set.
When we print out the column, we are able to see all the unique values for that column. That’s useful, but typically we want to get some sort of summary of the column. What is the mean? The median? And so on.
Luckily, obtaining these summary statistics is simple in R. To obtain the mean (the average number of calories), we use
mean(Starbucks$Calories)
Question 2
What is the average number of calories in the 230 Starbucks drinks in this sample?
Essentially, this code (mean(Starbucks$Calories)
) is built up of two pieces. The first (mean
) is the command. This tells R what we are hoping to find - in this case, the mean. The second piece of code tells R the data (Starbucks$Calories
). This is what we want R to take the mean of.
Question 3
What is the average number of grams of sugar in the 230 Starbucks drinks in this sample? Hint: The information is stored in the Sugars
column in the Starbucks
data set.
Okay, so we can find the mean. The mean isn’t the only measure of center. What code to we use if we want to find the median? Luckily, this is just what you might expect - we use median
as the command instead of mean
.
Question 5
What is the median number of calories in these 230 Starbucks drinks? Based on this, do you that the Calories is right skewed, left skewed, or symmetric? Why?
Exploring Data: Measures of Spread
In addition to measures of center (what is a typical value of the variable), we also report measures of spread (about how far do we expect data to deviate from this typical value). Just like measures of center, we can use R commands to find measures of spread.
To obtain the standard deviation of calories, we use
sd(Starbucks$Calories)
Question 6
What is the standard deviation of the number of grams of sugar in the 230 Starbucks drinks in this sample? Hint: The information is stored in the Sugars
column in the Starbucks
data set.
Standard deviation is a great measure of spread if our data is fairly symmetric. However, for skewed data, we typically use the interquartile range (IQR), where
\[IQR = Q3 - Q1.\]
Here, Q3 is the third quartile and Q1 is the first quartile. Unforunately, there is not one command to find the IQR. Instead, we need to find the third quartile and the first quartile and then subtract them. So, how do we do that?
There is one convienient line of code that helps us find that information:
summary(Starbucks$Calories)
Question 7
What information about calories in these 230 Starbucks drinks do you get when you run the code above?
Question 8
What is the IQR for calories in these 230 Starbucks drinks?
Question 9
Using the summary command, find the largest and smallest number of grams of sugar in a Starbucks drink in this sample.
Next Steps
Now we know how to find some basic summary statistics in R. The next step is to use visualizations, like box plots and histograms, to further explore the distribution of the data. We’ll tackle that in the next activity.