Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Today we are going to be exploring the foundations of coding in R. As we move through this course, you will get experience with different things that R can do. For today, we want to start getting familiar with the set up and how to run codes using R.

The First Data Set

Load the Data

Load the data set that explores the number of discoveries that the World Almanac and Book of Facts designate as the great discoveries each year from 1860 to 1959.

To load the data, you need to create a Code Chunk (chunk for short). Remember that to do that, you go to Code at the very upper left of your RMarkdown window and then choose Insert Chunk.

Once you have created your chunk, type the following inside and press play (the green triangle on the right hand side of the chunk).

data("discoveries")

Viewing the Data

Question 1

Create a new chunk. In this chunk, type discoveries and press play. Some information about the discoveries will appear. What years of data do we have information on? We have information from 1860 to 1959.

discoveries
## Time Series:
## Start = 1860 
## End = 1959 
## Frequency = 1 
##   [1]  5  3  0  2  0  3  2  3  6  1  2  1  2  1  3  3  3  5  2  4  4  0  2  3  7
##  [26] 12  3 10  9  2  3  7  7  2  3  3  6  2  4  3  5  2  2  4  0  4  2  5  2  3
##  [51]  3  6  5  8  3  6  6  0  5  2  2  2  6  3  4  4  2  2  4  7  5  3  3  0  2
##  [76]  2  2  1  3  4  2  2  1  1  1  2  1  4  4  3  2  1  4  1  1  1  0  0  2  0

Question 2

Based on the data, how many great discoveries were there in 1863? There were two great discoveries in 1863.

Question 3

What is another question we could explore using this data set? Note: There is no one right answer here!

How many great discoveries were found between 1875 and 1900?

What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?

Maximum no of discoveries:

max(discoveries)
## [1] 12

What is the total number of great discoveries recorded by the World Almanac and Book of Facts for the period 1860-1959?

Total no of discoveries from 1860-1959:

sum(discoveries)
## [1] 310

Question 4

Was the largest number of great discoveries made in the 19th century (before 1900) or in the 20th century (1900 or later)? 19th century: 12. 20th century: 8

plot(discoveries)

This plot we have created is just one example of a data visualization, or a way to use graphs and tables to explore data.

Question 5

Suppose you are asked to describe what information this graph is showing. How would you reply? Note: There is no one right answer, but a good way to start is to comment on what information is on the x axis and what information is on the y axis, and then comment on any general story you think can be told based on the graph.

On the x axis, time is given. On the y axis, discoveries is given. A change in time relates to a change in discoveries.

Using the plot we can see the maximum number of discoveries. Also we can see how the no of discoveries changes with the time.

Question 7

What is the average (mean) number of great discoveries recorded in this data set? Hint: Think about the structure of coding we have learned. You need a command and an object.

mean(discoveries)
## [1] 3.1

Question 8

What is the smallest (min) number of great discoveries recorded in this data set?

min(discoveries)
## [1] 0

A Second Data Set

Let’s load a second data, “faithful” into R. Create a chunk, paste in the following, and press play.

data("faithful")

This data set provides information on eruption cycles of a very famous geyser called Old Faithful. This geyser can be seen in Yellowstone National Park in Wyoming and draws visitors from all over the world as it shoots water over 7000 feet into the air. The geyser is called Old Faithful because it has an eruption cycle about every 40 - 120 minutes, which is very regular for a geyser!

Question 9

How many rows are in this data set? How many columns?

str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

No of Columns = 2

No of Rows = 272

Types of Variables

For each eruption cycle (each case) we have information on two different variables:

eruption: how many times did the geyser go off during this eruption cycle? We may have decimals if the geyser did not reach its full height each time it went off.

duration: how many minutes until the geyser had another eruption cycle?

Question 10

How many numeric variables do we have in the faithful data set?

We have 2 numeric variables in the data set.

How many categorical variables?

We have 0 categorical variables in the data set.

When deciding whether a variable is numeric or categorical, be careful not to just look at the name of the variable. Instead, you want to look at the variable itself. What do the responses look like?

Considering One Variable

What if we want to look at just one variable in the data set? We know how to look at the whole data set, but suppose I just want to look at the waiting variable. This tells us how long a visitor to Yellowstone would have to wait in between eruptions of Old Faithful. We can access a single column of a data set using a $.

faithful$waiting
##   [1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
##  [26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
##  [51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
##  [76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
## [101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
## [126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
## [151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
## [176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
## [201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
## [226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
## [251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

The structure of this command is dataset$column. We are telling R “hey R, take a look at the faithful data set. Go inside the faithful data set ($) take a look at the column called waiting.”

Question 11

What command would access just the column in the data set that tells us about the number of eruptions?

faithful$eruptions
##   [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 1.950 4.350 1.833 3.917
##  [13] 4.200 1.750 4.700 2.167 1.750 4.800 1.600 4.250 1.800 1.750 3.450 3.067
##  [25] 4.533 3.600 1.967 4.083 3.850 4.433 4.300 4.467 3.367 4.033 3.833 2.017
##  [37] 1.867 4.833 1.833 4.783 4.350 1.883 4.567 1.750 4.533 3.317 3.833 2.100
##  [49] 4.633 2.000 4.800 4.716 1.833 4.833 1.733 4.883 3.717 1.667 4.567 4.317
##  [61] 2.233 4.500 1.750 4.800 1.817 4.400 4.167 4.700 2.067 4.700 4.033 1.967
##  [73] 4.500 4.000 1.983 5.067 2.017 4.567 3.883 3.600 4.133 4.333 4.100 2.633
##  [85] 4.067 4.933 3.950 4.517 2.167 4.000 2.200 4.333 1.867 4.817 1.833 4.300
##  [97] 4.667 3.750 1.867 4.900 2.483 4.367 2.100 4.500 4.050 1.867 4.700 1.783
## [109] 4.850 3.683 4.733 2.300 4.900 4.417 1.700 4.633 2.317 4.600 1.817 4.417
## [121] 2.617 4.067 4.250 1.967 4.600 3.767 1.917 4.500 2.267 4.650 1.867 4.167
## [133] 2.800 4.333 1.833 4.383 1.883 4.933 2.033 3.733 4.233 2.233 4.533 4.817
## [145] 4.333 1.983 4.633 2.017 5.100 1.800 5.033 4.000 2.400 4.600 3.567 4.000
## [157] 4.500 4.083 1.800 3.967 2.200 4.150 2.000 3.833 3.500 4.583 2.367 5.000
## [169] 1.933 4.617 1.917 2.083 4.583 3.333 4.167 4.333 4.500 2.417 4.000 4.167
## [181] 1.883 4.583 4.250 3.767 2.033 4.433 4.083 1.833 4.417 2.183 4.800 1.833
## [193] 4.800 4.100 3.966 4.233 3.500 4.366 2.250 4.667 2.100 4.350 4.133 1.867
## [205] 4.600 1.783 4.367 3.850 1.933 4.500 2.383 4.700 1.867 3.833 3.417 4.233
## [217] 2.400 4.800 2.000 4.150 1.867 4.267 1.750 4.483 4.000 4.117 4.083 4.267
## [229] 3.917 4.550 4.083 2.417 4.183 2.217 4.450 1.883 1.850 4.283 3.950 2.333
## [241] 4.150 2.350 4.933 2.900 4.583 3.833 2.083 4.367 2.133 4.350 2.200 4.450
## [253] 3.567 4.500 4.150 3.817 3.917 4.450 2.000 4.283 4.767 4.533 1.850 4.250
## [265] 1.983 2.250 4.750 4.117 2.150 4.417 1.817 4.467

Question 12

What command would tell us the largest (max) number of eruptions in this data set? Hint: We are putting together things we have learned. We need a command (max) and then the data we want the command to run on.

max(faithful$eruptions)
## [1] 5.1

Back to Data Visualization

In our discoveries data set, we only had one variable. Now, we have two. This means we can

  1. create a plot of just eruptions,
plot(faithful$eruptions)

  1. create a plot of just waiting times and
plot(faithful$waiting)

  1. create a plot comparing the eruptions and waiting times.

The type of graph we need is going to depend on what question we want to answer using the data.

For now, let’s try (3). To create a scatter plot with the waiting time on the x axis and the number of eruptions on the y-axis, we need:

plot(x = faithful$waiting, y = faithful$eruptions)

This gives R a command, plot. What do we want to plot? On the x-axis, we want to plot faithful$waiting and on the y-axis we want faithful$eruptions.

Question 13

If you had to wait 80 minutes in between Old Faithful eruption cycles, how many times do you expect the geyser to erupt during its eruption cycle? Feel free to give a range (for instance between 0 and 1 time.) I expect the geyser to erupt between 3.5 and 4.5 times.

Looking at the plot, we notice that the names of the x and y axes are a little strange. By default, R names these axes with the exact names of the variables. We can change the label for the axes like so.

plot(x = faithful$waiting, y = faithful$eruptions,  
xlab= "The X Axis Label",  ylab= "The Y Axis Label")

We can also change the color of the dots in the graph.

plot(x = faithful$waiting, y = faithful$eruptions, 
xlab= "The X Axis Label",  ylab= "The Y Axis Label", col = "blue")

Question 14

By adapting the code above, create a scatter plot where (1) the x axis is labelled “Waiting Time in Minutes”, (2) the y axis is labelled “Number of Eruptions” and (3) the color of the dots is “red”.

plot(x = faithful$waiting, y = faithful$eruptions, 
xlab= "Waiting Time in Minutes",  ylab= "Number of Eruptions", col = "red")

Question 15

An individual who is visiting the park wants to know if your work so far indicates that longer waiting times are associated with more eruptions during the eruption cycle. Based on the plot, does it look like this is the case for these 272 eruptions? Explain your reasoning.

Answer: Based on the plot, it does look like this is the case for these 272 because on average, as waiting time increases, so does the number of eruptions. An increase in x correlates to an increase in y, so it does appear that more time results in more eruptions.

Summary

In this lab, we have started to explore how to use R to work with data. As we move through the course, we will continue to use R to apply the concepts we learn in class on different data sets.