Instructions

This is an R Markdown file. This file allows you to include both your R code and your commentary in a single file. All lab work - code and answers - should be done in this notebook.

When you click Knit, R Studio creates (or overwrites) an html file containing your answers as well as the code and output. You can change this to a pdf file by changing the output: html_document line to output: pdf_document. Note that creating a pdf file requires your \(LaTeX\) distribution to work nicely with R Studio, which some students have reported problems with.

R code is written in chunks. Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I. A chunk looks like this:

# write your R comments here
#write your R code here

Note that the text in curly braces must start with the name of the language your code is written in (r) followed immediately by a (unique) chunk name, a comma, and options for how R Studio should print the code and output. Some common options are:

To run an individual chunk, click the green triangle (Play button) in the top right of the chunk.

This lab assignment is worth a total of 15 points.

Pair Programming

Pair programming is a collaborative technique used in all disciplines of software development (including data science). In typical pair programming, the “driver” has total control over the keyboard and mouse, but can only type in response to instructions from the “navigator.”

While we are virtual, all labs will be done using this pair-programming technique. The “driver” will share their screen in the breakout room, do all the typing, and run all the chunks for the group. The “navigator(s)” will tell the driver what to do, either by giving explicit code to run or by suggesting the next step and working with the driver to find the correct code to accomplish it.

(Obviously, when it is time to stop coding and interpret results or answer questions, the driver will have keyboard control but everyone should collaborate.)

Programmers are encouraged to switch roles roughly every 15 minutes. You should try to rotate drivers after finishing a problem (or a part of a problem). While we are virtual, you can share the updated files either by saving and sending via Zoom chat/e-mail or by pushing updates to a Github repository your whole group has access to.

Problem 1: Introductory Material (Code: 1 pt)

You should always include two chunks at the beginning of your R Markdown file. The first chunk is the default chunk that R Studio puts in when you create a new R Markdown file:

knitr::opts_chunk$set(echo = TRUE)

This chunk sets the global options for printing code and output. Generally the only global option you need to change is the one already in the default chunk.

The second chunk that you should include at the beginning of the file is a chunk that imports the packages (a package is a set of additional functions and data) and datasets you will be using. For example, let’s import the Auto dataset from the ISLR2 package.

If you do not have the ISLR2 package installed, in R Studio, click on the Packages tab (in the bottom right) and install the package.

library(ISLR2)  # load the library
View(Auto) # View the data in a separate window, you can also use the fix() function

Generally, to import a dataset from Canvas, download the file and then click Import Dataset in the Environment tab (top right). Select the right type of file (usually you want From Text (readr) for my csv files), click Browse and find the file, then copy the code from the Code Preview section.

Note that the code to import your dataset includes your entire file structure, which means that I and your partners won’t be able to replicate your data import. To fix this, type setwd("Your Working Directory") in the Console tab (bottom left) and run it, make sure that the dataset(s) you are importing are in that directory, and then fix the data import code to include only the filename.

The first thing you want to do after importing a dataset is do some sanity checks to make sure everything imported okay.

Typically you should check that the number of observations and number of variables in the dataset are what you expected:

dim(Auto)  # number of observations and number of variables
## [1] 392   9

It’s also good to check the names of your variables. Generally, variable names in R should be one word or multiple words separated by underscores (_) or periods (.).

names(Auto)  # names of variables in the dataset
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

Sometimes it may be useful to see a summary of the variables in the dataset, including missing data.

summary(Auto)  # brief summary of each variable in the dataset
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

To view information about a dataset or help about a function, you can type ? followed by the name of the dataset or function, or just search for it in the Help tab in R Studio.

Problem 2: More Basic R Commands

Part a (Code: 1 pt)

Work through ISLR Lab 2.3.1. Copy each chunk in the lab to its own chunk in this file.

x <- c(1, 3, 2, 5)
x
## [1] 1 3 2 5
x= c(1, 6, 2)
x
## [1] 1 6 2
y= c(1, 4, 3)
length (x)
## [1] 3
length (y)
## [1] 3
x+y
## [1]  2 10  5
ls ()
## [1] "x" "y"
rm(x, y)
ls ()
## character(0)
character(0)
## character(0)
rm(list = ls ())
?matrix
## starting httpd help server ... done
x <- matrix (data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
x <- matrix (c(1, 2, 3, 4), 2, 2)
matrix (c(1, 2, 3, 4), 2, 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
sqrt (x)
##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
x^2
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16
x <- rnorm (50)
y <- x + rnorm (50, mean = 50, sd = .1)
cor (x, y)
## [1] 0.9950154
set.seed (1303)
rnorm (50)
##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557
set.seed (3)
y <- rnorm (100)
mean (y)
## [1] 0.01103557
var (y)
## [1] 0.7328675
sqrt ( var (y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768

Part b (Explanation: 1 pt)

Explain in your own words the purpose of the set.seed() function. What happens if you don’t include this line before running the rnorm function?

Answer: The purpose of set.seed() is to be able to create a sense of reproducibility in randomness. In other words we want to be able to call the same random value in order to reproduce our experiment/procedure and determine if the results seem appropriate.

Problem 3: Indexing Data

Part a (Code: 1 pt)

Work through ISLR Lab 2.3.3. Copy each chunk in the lab to its own chunk in this file.

A <- matrix (1:16, 4, 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
A[2, 3]
## [1] 10
A[c(1, 3), c(2, 4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3, 2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2, ]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[, 1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
A[1, ]
## [1]  1  5  9 13
A[-c(1, 3), ]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
A[-c(1, 3), -c(1, 3, 4)]
## [1] 6 8
dim (A)
## [1] 4 4

Part b (Code: 0.5 pts)

When you import data into R, it usually will be stored in a data frame object instead of a matrix. There are a variety of ways to extract an individual column from a data frame.

Auto2 <- Auto[1:10,]  # first 10 rows
Auto2[, 1]
##  [1] 18 15 18 16 17 15 14 14 14 15
Auto2[[1]]
##  [1] 18 15 18 16 17 15 14 14 14 15
Auto2$mpg
##  [1] 18 15 18 16 17 15 14 14 14 15

Note that each of these methods extracts the column as a vector. There are also ways to extract the column as a one-column data frame:

Auto2[, 1, drop = FALSE]
##    mpg
## 1   18
## 2   15
## 3   18
## 4   16
## 5   17
## 6   15
## 7   14
## 8   14
## 9   14
## 10  15
Auto2["mpg"]
##    mpg
## 1   18
## 2   15
## 3   18
## 4   16
## 5   17
## 6   15
## 7   14
## 8   14
## 9   14
## 10  15

In the code chunks below, find two ways to subset the Auto2 data to get a two-column dataset containing mpg and horspower (columns 1 and 4). Note that drop = FALSE is only necessary when you want a one-column data frame.

Auto2[c(1,4)]
##    mpg horsepower
## 1   18        130
## 2   15        165
## 3   18        150
## 4   16        150
## 5   17        140
## 6   15        198
## 7   14        220
## 8   14        215
## 9   14        225
## 10  15        190
Auto2[c("mpg","horsepower")]
##    mpg horsepower
## 1   18        130
## 2   15        165
## 3   18        150
## 4   16        150
## 5   17        140
## 6   15        198
## 7   14        220
## 8   14        215
## 9   14        225
## 10  15        190

Problem 3: Hypothesis Testing in R

Part a (Code: 1 pt)

Fix the chunk below to create a vector x consisting of 50 random numbers from a normal distribution with mean 100 and standard deviation 15 and a vector y consisting of 50 random numbers from a normal distribution with mean 90 and standard deviation 15. (Refer back to the end of Problem 2 if you need help.)

set.seed(123)
x <- rnorm(n = 50, mean = 100, sd = 15) # replace the ... with appropriate arguments and values
y <- rnorm(n = 50, mean = 90, sd = 15)

Now, run the chunk below to perform a two-sample t-test to determine whether the population means of x and y are different, and store the output in a variable called t_test_sim:

t_test_sim <- t.test(x,y, alternative = "t") # "t" for two-sided

After you run this chunk, the variable t_test_sim should appear in your Environment tab as a “List of 10.” In R, the output of a hypothesis test is an htest object containing information about the methods and results of the inference. Let’s see what information we can get out of the t_test_sim object:

names(t_test_sim)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

We can extract the p-value for our test using the $ operator:

t_test_sim$p.value
## [1] 0.003141266

Part b (Code: 1 pt)

Sometimes our data is in this “wide” format (where each column represents the values from a group and we can use the individual vectors as the arguments), but more often our data is in “long” format (where each column represents a variable). Let’s see what this data would look like in long format:

group_values <- rep(c("x", "y"), each = 50) # 50 x followed by 50 y
numerical_values <- c(x, y)
sim_df_long <- data.frame(
  group = group_values,
  value = numerical_values
)

Write code in the chunk below that finds the number of rows and columns in the sim_df data frame.

dim(sim_df_long)
## [1] 100   2

Write code in the chunk below that finds the names of the variables in the data frame.

names(sim_df_long)
## [1] "group" "value"

To perform a two-sample t-test when the data is in “long” format, we use a formula argument of the form response ~ explanatory:

t_test_sim_long <- t.test(value ~ group, data = sim_df_long, alternative = "t")

Part c (Explanation: 0.5 pts)

Write a chunk to extract the p-value from t_test_sim_long. Do you get the same p-value that you got originally?

t_test_sim_long$p.value
## [1] 0.003141266

Original value: 0.003141266

Part d (Explanation: 1.5 pts)

What information is contained the statistic, parameter, and conf.int objects within t_test_sim and t_test_sim_long? (You can write and run code chunks to verify your answer.)

t_test_sim_long$statistic
##        t 
## 3.028571
t_test_sim_long$parameter
##       df 
## 97.95128
t_test_sim_long$conf.int
## [1]  2.868272 13.771586
## attr(,"conf.level")
## [1] 0.95
t_test_sim$statistic
##        t 
## 3.028571
t_test_sim$parameter
##       df 
## 97.95128
t_test_sim$conf.int
## [1]  2.868272 13.771586
## attr(,"conf.level")
## [1] 0.95

We can verify that the same values result for both simulations.

Problem 4: Graphing with Base R

For parts (a) through (c):

Part a (Code: 0.5 pts; Explanation: 1 pt)

Create and describe a scatterplot of mpg (response) against cylinders (numerical predictor). Do not attach the dataset; use the $ sign to select the variables.

plot(Auto$cylinders, Auto$mpg, main = "MPG Vs Cylinders", xlab = "Cylinders", ylab = "MPG")

The following graph shows the MPG vs Cylinders from the Auto data set. The x-axis indicates the size of the cylinders within a range of 3 to 8 and the y-axis shows the MPG in a range of 10 to 50. Our data shows the MPG range for six kinds of cylinder sizes. The data appears to have more information for Cylinder sizes 4, 6, and 8.

Part b (Code: 0.5 pts; Explanation: 1 pt)

Create and describe a boxplot of mpg (response) by cylinders (categorical predictor). The boxes should be horizontal and red.

boxplot(Auto$mpg,Auto$cylinders, horizontal = T, col = "red")

The following boxplots are from the Auto data showing the spread of the MPG and Cylinders variables. The x-axis shows the range of each of the box plots. The y-axis indicates each box plot. The upper boxplot represents the spread of the Cylinders data and the lower box plot represents the spread of the MPG data. The MPG variable has more spread out data than the Cylinders variable.

Part c (Code: 0.5 pts; Explanation: 1 pt)

Create and describe a histogram of mpg. The bars should be red and there should be roughly 15 bars.

hist(Auto$mpg, breaks = 15, col = "red")

The following histogram shows frequency of the MPG values in the Auto data set. The x-axis shows the specific values and the y-axis shows the frequency. The MPG appears to have a left-skewed distribution.

Part d (Code: 0.5 pts)

Using the pairs function, create a scatter plot matrix showing mpg, displacement, horsepower, weight, and acceleration.

pairs( ∼ mpg + displacement + horsepower + weight + acceleration , data = Auto)

The following figure illustrates a scatter plot matrix composed of variables from the Auto data set, specifically the MPG, displacement, horsepower, weight and acceleration variables. Each graph shows either a positive or linear relationship between the variables.

Problem 5: Graphing with ggplot2

The plotting commands in the textbook are designed to work with base R. However, most people who use R find the base R graphics system to be cumbersome and use use either the ggplot2 package or the lattice package. We will use the ggplot2 package. This problem just requires you to run the code chunks and understand the commands.

Install the package if you do not have it installed, then load the package.

library(ggplot2)

(This is probably a good chunk to set options message = FALSE, warning = FALSE for.)

Part a (Code: 0.5 pts)

Now let’s start by creating the basic scatterplot.

plot_setup <- ggplot(data = Auto, mapping = aes(x = cylinders, y = mpg))  # setup the plot
scatterplot <- plot_setup + geom_point()  # add a scatterplot to the setup
print(scatterplot)  # show what's been created

Yes, it took three lines to do what we did in one line in Problem 1. So why do we use ggplot2? Customization is much easier and intuitive in ggplot2. For example, to create the red points:

scatterplot_red <- plot_setup + geom_point(color = "red")
print(scatterplot_red)

We can also add a title and axis labels using the labs function:

scatterplot_labeled <- plot_setup + geom_point(color = "red") +
  labs(title = "Gas Mileage by Number of Cylinders for Old Cars",
       x = "Number of Cylinders",
       y = "Miles per Gallon")
print(scatterplot_labeled)

Part b (Code: 0.5 pts)

What about making a set of boxplots?

Auto.factor <- Auto
Auto.factor$cylinders <- as.factor(Auto.factor$cylinders) # convert to factor variable
boxplot_setup <- ggplot(data = Auto.factor, mapping = aes(x = cylinders, y = mpg))
boxplot_red_horizontal <- boxplot_setup + geom_boxplot(color = "red") + coord_flip() # flips what goes on the x vs y axis - gets us horizontal boxplots
print(boxplot_red_horizontal)

What about adding labels?

boxplot_labeled <- boxplot_red_horizontal + 
  labs(title = "Gas Mileage by Number of Cylinders", 
       x = "Number of Cylinders", 
       y = "Miles Per Gallon") # yes, it's weird, you still need to pretend you didn't flip x and y
print(boxplot_labeled)

Sometimes, especially if you are trying to plot multiple datasets on the same graph, it is easier to specify the data and mappings in the geom_ function itself:

plot_bare <- ggplot()  # no setup
boxplot2 <- plot_bare + geom_boxplot(data = Auto.factor, mapping = aes(x = cylinders, y = mpg))  # add a scatterplot to the setup
print(boxplot2)  # show what's been created

Part c (Code: 0.5 pts)

The ggplot2 package cannot create a scatterplot matrix by itself, but the GGally package can. Install it and load it.

library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(Auto.factor, columns = c("mpg", "displacement", "horsepower", "weight", "acceleration"))