In this lab, we are going to explore the distribution of heights of teenagers. We will be using the graphical and numerical techniques we have learned in class to describe the shape, center and spread of this distribution.
We recall from last lab that the "workspace" (the upper right hand panel of RStudio) is where all of our data sets are stored. If you look now, you probably see present
, which is the data set from the last lab. If your workspace is blank, don't worry about it! If it's not blank, there is a way to clear your workspace, i.e., remove data sets that you no longer need.
To clear all of the data in the workspace, look at the top of the panel. Beside "Import Dataset" there is an image that looks like a small broom. Pushing that will "clean" your workspace, meaning that it will remove all the contents within the panel. Since we are done with the present
data set for now, go ahead and clean the workspace. This keeps us from having a lot of unnecessary stuff cluttering up this window.
Everything we learned in Lab 1 still applies. Flip back to those more detailed instructions as you need to throughout the course. We recall that we begin every new lab by creating a Markdown file. Go ahead and do that, and remember that you need to delete everything after line 12 in the template that comes up. You will do this for every Markdown you create in this class.
Now that you have a clean template and a clean workspace, we can start!
As we discussed in Lab 1, the first thing we need to do to start any analysis in R is to load the data. Last class, we loaded data directly from the internet using a single line of command. We were able to do this because the data was available on a website, and we were able to pull the data directly from the internet. However, not all data sets are able to be loaded into R in this way. Today, we are going to load the data from a very common file type, a .csv file.
The data that you need is on Canvas, under Lab 2 in the assignments tab. Go ahead and download the data set. A note for Mac users. It is strongly recommended that you use a browser other than Safari when you do this. When you are prompted to save the data, make sure that save the .csv file in the same folder where you generally save your STA 111 work. It is not helpful if you download the file and then cannot find it!
Now that we have downloaded the data, we need to move the data from your computer into R. To do that, look at the bottom right hand panel of your screen. Find Upload. Click that, and then upload your .csv file from your computer.
Next, look at your workspace (the upper right hand panel). There is an icon called Import Dataset. When you click on the drop down arrow, you will be presented with a few choices. We are looking for one of two choices: "From Text (base)" or "From CSV". The option you are presented with will depend on your machine, but either one works! The purpose of these options is to help R know what file type it is expected to download.
Once you have selected the appropriate option, you will be prompted to choose the data set. Go ahead and click on it. A preview of the data set will appear. Make sure that your data has a heading/header, i.e., that there are column labels on the data set. Once the preview looks okay, go ahead and load/import the data. Take a look in your workspace and see that you should have a data set called anthrokids
with 3900 observations.
At this point, R can see the data set. This is what is means to see that data set in the workspace. However, at this point RMarkdown cannot see the data set. This is important. RMarkdown will only be able to see the data if the line to load the data is in a chunk your Markdown document. What line of code?? Look at your Console, which should be the lower right panel of your screen. You should see a line of code that includes anthrokids <- read.csv( STUFF)
or anthrokids <- read_csv( STUFF )
, with a computer path for your computer included as the STUFF in between the parantheses. Copy the entire line of code. Create a code chunk in your Markdown, and paste the line of code into that chunk. Change the line of code to read read.csv
if it doesn't already. Congratulations, you have now successfully loaded the data into RMarkdown!
Now, why did we have to do all that? It seemed much easier just to use the single line of code from last time! While it is true that loading that way is faster, it is often the case that data files are not able to be loaded by using a URL. More often than not, the data sets are in the form of an Excel or .csv file, and they need to be moved into R before we can work with them. The process you have just gone through allows you to load any such data set into R.
Now that we have everything loaded, let's start working with the anthrokids
data set. This data set includes measurements such as height (in inches) and weight (in pounds) for a large group of children and teens. This data set is the result of a Consumer Product Safety Commission (CPSC) study that was conducted in the 1970s. For more extensive information and details on the variables included in this data set, look at the source website.
The first step in the process of a data analysis is exploratory data analysis or (EDA). Let's use some of the tools we have been learning to conduct EDA.
For today, our main focus is going to be on exploring the distribution of heights. In this data set, the variable storing information about height is called height
. Remember that to access a single variable from a data set, we use the code nameofdataset$thevariablewewant
. In our case, we pull the height variable by using anthrokids$height
.
However, before we really start to work with this variable, let's think. The data contains information on the heights of children and teens. Run the command summary(anthrokids$age)
to see exactly what ages are included in this data set. The range is huge! We know that the height of children changes rapidly as they grow, so let's consider narrowing our scope a little. Specifically, we are going to focus on analyzing only data pertaining to teens age 15 and older.
When we want to select specific data points from a data set based on a value of one or more variables, we are doing what is called sub-setting. In other words, we are using only particular rows from the original data set, and these rows are chosen based on some characteristics of the data points. The result is a new data set which is a subset of the original data, meaning that only some rows from the original data set are in the new data set. In our case, we will create a subset of the anthrokids
data set that contains only indivduals who are at least 15. The subset
command will allow us to do this. Copy and paste the following line into a code chunk:
Over15<- subset(anthrokids, anthrokids$age >= 15)
Let's break that down. The command subset
tells R that we want to pull only some of the rows from the main anthrokids
data set. The first entry in the subset
command is anthrokids
. This tells R which data set we are going to subset. The next step is to tell R which variable we are using to determine our subset. In this case, we want to look at the age
variable, anthrokids$age
. We are interested in only pulling those individuals such that the age is at least 15, i.e., age>=15
. The code above will produce such a subset called Over15
, which will appear in your workspace (the upper right hand window of your RStudio screen).
Over15
subset?At this point, we are ready to begin EDA. Our goal is to explore the numeric variable height
in the Over15
data set.
boxplot(Over15$height)
to create a boxplot. If you would like to make the boxplot horizontal, you can use boxplot(Over15$height, horizontal = TRUE)
. Show the plot as part of your response.We know that a boxplot is made by computing percentiles of the data. To see this percentiles in R, we can use the command summary(Over15$height)
. This returns what is called a five-number summary of the data. It includes the minimum and maximum values of the variable, as well as the mean and the first through third quartiles. Note here that 1st Qu.
means Q1 (the first quartile) and 3rd Qu.
means Q3 (the third quartile).
When we conduct EDA, we often use a variety of different plots to explore a variable. This is because each different type can provide us with new information.
To make a histogram of height
, we use the command hist(Over15$height)
. Note that, as we have learned, the command is hist
, and the object we want to create a histogram of is Over15$height
.
In addition to creating plots, we also use numeric descriptions like the mean or the standard deviation as a tool to explore the distribution of a variable during EDA. Generally, finding numeric descriptions in R requires commands with straight forward names. For instance, the mean can be found by using mean()
, standard deviation uses the command sd()
and the median requires median()
. We have also seen that the mean and the median are produced as part of the summary()
command.