For this WPA, we will use a dataset stored in a tab-separated text file called capture.txt. This dataset contains results a historical record of 1,000 ships that your ship has captured in the past 10 years. For each ship that you captured, you made a note of several aspects of the ship (e.g.; the number of cannons it had, its speed, etc.). The key column in the dataset is called treasure - this tells you how much gold you recovered from the ship.

In this WPA, you’ll try to understand how different aspects of ships are related to how much treasure they have. This way, when you spot a new ship on the horizon, you can try to estimate how much treasure it is likely to have!

Method 1: Install yarrr

install.packages("devtools")
library("devtools")
install_github("ndphillips/yarrr")
library("yarrr")

Method 2: Install directly

capture <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/12/capture.txt", 
                      sep = "\t", 
                      header = T)

Q0

First, you’ll need to re-install and load the yarrr package to access the data

  1. Install and load the latest version of the yarrr package. When you do, the capture dataset will be loaded as a new dataframe called…wait for it…capture. If you cannot install the yarrr package, you can download the datafile directly from http://nathanieldphillips.com/wp-content/uploads/2015/12/capture.txt

  2. If you were able to install the latest yarrr package, look at the help menu for capture to learn more about the dataset.

  3. What are the names of the columns in the capture dataframe?

  4. What are the first few rows of the dataframe?

Q1

Plot the relationship between the following continuous independent variable and treasure. For each plot, add axis and plot labels and a regression line showing the relationship between the independent and dependent variables.

  1. size

  2. cannons

  3. date

  4. decorations

  5. daysfromshore

  6. speed

Q2

Now do the same for the following categorical independent variables and treasure (hint: try using the new pirateplot() function in the yarrr package! Look at how it works by running ?pirateplot). Again, add appropriate labels (but don’t add a regression line)

  1. style

  2. warnshot

  3. heardof

Q3

  1. For each of the following variables (separately), calculate the median amount of treasure earned for each level of the IV: style, warnshot, decorations (hint: use aggregate or dplyr!)

Q4

The formula notation for conducting a correlation test with cor.test() is a bit different from regular formula notation. Instead of dv ~ iv, you use ~ dv + iv. For example, the following code will test the correlation between chickens’ age and weight using the ChickWeight dataset.

cor.test(~ Time + weight, 
         data = ChickWeight)
  1. Using the formula notation above, conduct a correlation test between the number of cannons a ship has and its size. What is the p-value?

  2. Now do the same with linear regression. What is the p-value?

Q5

  1. Conduct a linear regression with treasure as the dependent variable, and with all other variables as independent variables. Save the object as treasure.model

  2. Using the summary() function, print the coefficients and main statistics of the regression

  3. What are your conclusions? Which variables are significantly related to treasure and in which direction (i.e.; positive or negative)?

  4. Which variables are NOT significantly related to treasure?

Q6

  1. Now tell me again, what was your conclusion about the relationship between decorations and treasure?

  2. Ok, now plot the relationship between decorations and treasure again. Do you see anything strange?

  3. Repeat your regression analysis from Question 5 again, but ONLY include ships with treasure less than 3500. Save the object as treasure.lt3500.model

  4. Using the summary function, show me the new results from the regression analysis.

  5. Does your conclusion about the relationship betweeen decorations and treasure change? What about the other variables?

Q7

  1. Conduct a new regression analysis on the capture dataset, but only using the independent variables size, cannons and speed Call this regression object treasure.model2

  2. Using your regression results from part A, use the predict() function to predict the amount of treasure in a new ship with a size of 60, with 80 cannons, going a speed of 100

  3. Now, imagine that the ship has an extra 2 cannons (82 total). According to your regression analysis, what should the new prediction be?

  4. Test your prediction in part C!

Q8

  1. Let’s generate a dataset called my.data. Copy and paste the following code.
my.data <- data.frame(a = c(1, 5, 3, 6, 3, 5, 3, 8, 3),
                      b = c(8, 3, 1, 4, 2, 6, 4, 8, 3))
  1. Add a new variable to my.data called c, where c = 3 * a - 5 * b

  2. Imagine that you will conduct a linear regression on these data, with c as the dependent variable and a and b as the independent variables. What do you think the coefficients for a and b will be? What do you think the intercept will be?

  3. Run the regression and see if you’re right!