STA 111 Lab 2: Exploring Data

In this lab, we are going to explore the distribution of heights of teenagers. We will be using the graphical and numerical techniques we have learned in class to describe the shape, center and spread of this distribution.

Clearing the workspace

We recall from last lab that the "workspace" (the upper right hand panel of RStudio) is where all of our data sets are stored. If you look now, you probably see present, which is the data set from the last lab. If your workspace is blank, don't worry about it! If it's not blank, there is a way to clear your workspace, i.e., remove data sets that you no longer need.

To clear all of the data in the workspace, look at the top of the panel. Beside "Import Dataset" there is an image that looks like a small broom. Pushing that will "clean" your workspace, meaning that it will remove all the contents within the panel. Since we are done with the present data set for now, go ahead and clean the workspace. This keeps us from having a lot of unnecessary stuff cluttering up this window.

Everything we learned in Lab 1 still applies. Flip back to those more detailed instructions as you need to throughout the course. We recall that we begin every new lab by creating a Markdown file. Go ahead and do that, and remember that you need to delete everything after line 12 in the template that comes up. You will do this for every Markdown you create in this class.

Now that you have a clean template and a clean workspace, we can start!

Loading the data

As we discussed in Lab 1, the first thing we need to do to start any analysis in R is to load the data. Last class, we loaded data directly from the internet using a single line of command. We were able to do this because the data was available on a website, and we were able to pull the data directly from the internet. However, not all data sets are able to be loaded into R in this way. Today, we are going to load the data from a very common file type, a .csv file.

The data that you need is on Canvas, under Lab 2 in the assignments tab. Go ahead and download the data set. A note for Mac users. It is strongly recommended that you use a browser other than Safari when you do this. When you are prompted to save the data, make sure that save the .csv file in the same folder where you generally save your STA 111 work. It is not helpful if you download the file and then cannot find it!

Now that we have downloaded the data, we need to move the data from your computer into R. To do that, look at the bottom right hand panel of your screen. Find Upload. Click that, and then upload your .csv file from your computer.

Next, look at your workspace (the upper right hand panel). There is an icon called Import Dataset. When you click on the drop down arrow, you will be presented with a few choices. We are looking for one of two choices: "From Text (base)" or "From CSV". The option you are presented with will depend on your machine, but either one works! The purpose of these options is to help R know what file type it is expected to download.

Once you have selected the appropriate option, you will be prompted to choose the data set. Go ahead and click on it. A preview of the data set will appear. Make sure that your data has a heading/header, i.e., that there are column labels on the data set. Once the preview looks okay, go ahead and load/import the data. Take a look in your workspace and see that you should have a data set called anthrokids with 3900 observations.

At this point, R can see the data set. This is what is means to see that data set in the workspace. However, at this point RMarkdown cannot see the data set. This is important. RMarkdown will only be able to see the data if the line to load the data is in a chunk your Markdown document. What line of code?? Look at your Console, which should be the lower right panel of your screen. You should see a line of code that includes anthrokids <- read.csv( STUFF) or anthrokids <- read_csv( STUFF ) , with a computer path for your computer included as the STUFF in between the parantheses. Copy the entire line of code. Create a code chunk in your Markdown, and paste the line of code into that chunk. Change the line of code to read read.csv if it doesn't already. Congratulations, you have now successfully loaded the data into RMarkdown!

Now, why did we have to do all that? It seemed much easier just to use the single line of code from last time! While it is true that loading that way is faster, it is often the case that data files are not able to be loaded by using a URL. More often than not, the data sets are in the form of an Excel or .csv file, and they need to be moved into R before we can work with them. The process you have just gone through allows you to load any such data set into R.

Creating a subset

Now that we have everything loaded, let's start working with the anthrokids data set. This data set includes measurements such as height (in inches) and weight (in pounds) for a large group of children and teens. This data set is the result of a Consumer Product Safety Commission (CPSC) study that was conducted in the 1970s. For more extensive information and details on the variables included in this data set, look at the source website.

The first step in the process of a data analysis is exploratory data analysis or (EDA). Let's use some of the tools we have been learning to conduct EDA.

How many individuals are in this data set? How many variables?

Classify the variables as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary?

For today, our main focus is going to be on exploring the distribution of heights. In this data set, the variable storing information about height is called height. Remember that to access a single variable from a data set, we use the code nameofdataset$thevariablewewant. In our case, we pull the height variable by using anthrokids$height.

However, before we really start to work with this variable, let's think. The data contains information on the heights of children and teens. Run the command summary(anthrokids$age) to see exactly what ages are included in this data set. The range is huge! We know that the height of children changes rapidly as they grow, so let's consider narrowing our scope a little. Specifically, we are going to focus on analyzing only data pertaining to teens age 15 and older.

When we want to select specific data points from a data set based on a value of one or more variables, we are doing what is called sub-setting. In other words, we are using only particular rows from the original data set, and these rows are chosen based on some characteristics of the data points. The result is a new data set which is a subset of the original data, meaning that only some rows from the original data set are in the new data set. In our case, we will create a subset of the anthrokids data set that contains only indivduals who are at least 15. The subset command will allow us to do this. Copy and paste the following line into a code chunk:

Over15<- subset(anthrokids, anthrokids$age >= 15)

Let's break that down. The command subset tells R that we want to pull only some of the rows from the main anthrokids data set. The first entry in the subset command is anthrokids. This tells R which data set we are going to subset. The next step is to tell R which variable we are using to determine our subset. In this case, we want to look at the age variable, anthrokids$age. We are interested in only pulling those individuals such that the age is at least 15, i.e., age>=15. The code above will produce such a subset called Over15, which will appear in your workspace (the upper right hand window of your RStudio screen).

How many rows are in the Over15 subset?

Exploratory Data Analysis (EDA)

At this point, we are ready to begin EDA. Our goal is to explore the numeric variable height in the Over15 data set.

There are a few different kinds of plots that we can make to explore a single numeric variable. List them.

Use the command boxplot(Over15$height) to create a boxplot. If you would like to make the boxplot horizontal, you can use boxplot(Over15$height, horizontal = TRUE). Show the plot as part of your response.

Based on the box plot you have created, is the distribution symmetric, right-skewed or left-skewed? How can you tell?

From the plot, can we determine if there are any outliers? Explain.

We know that a boxplot is made by computing percentiles of the data. To see this percentiles in R, we can use the command summary(Over15$height). This returns what is called a five-number summary of the data. It includes the minimum and maximum values of the variable, as well as the mean and the first through third quartiles. Note here that 1st Qu. means Q1 (the first quartile) and 3rd Qu. means Q3 (the third quartile).

What is the IQR for this data set?

Why are the median and the mean very similar for this data set?

Histogram

When we conduct EDA, we often use a variety of different plots to explore a variable. This is because each different type can provide us with new information.

What information does a histogram show us that a boxplot does not?

To make a histogram of height, we use the command hist(Over15$height). Note that, as we have learned, the command is hist, and the object we want to create a histogram of is Over15$height.

Make the histogram for height, and label the x-axis "Height of Individuals 15 and older". Make the plot any color you like, but do not use white! For suggestions, look here.

Describe the distribution of height based on the histogram. Made sure to comment on skew and modality!

Numeric Descriptions

In addition to creating plots, we also use numeric descriptions like the mean or the standard deviation as a tool to explore the distribution of a variable during EDA. Generally, finding numeric descriptions in R requires commands with straight forward names. For instance, the mean can be found by using mean(), standard deviation uses the command sd() and the median requires median(). We have also seen that the mean and the median are produced as part of the summary() command.

How much do we expect a teen's height might deviate from the mean height? What numeric description did you compute in order to answer this question?

How tall does a teen in this data set need to be in order to have a height that is 2 standard deviations above the mean?

How tall does a teen in this data set need to be in order to have a height that is 1.5 standard deviations below the mean?

This lab is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was created by Nicole Dalzell at Wake Forest University. Last updated July 8, 2021.