Q0 First, you’ll need to re-install and load the yarrr package to access the data

capture <- read.delim("/var/folders/_b/0ddnfxtd72jcc2vv2s6xr72c0000gn/T//RtmpwCTJrf/data4d137b8532a")

What are the names of the columns in the capture dataframe?

names(capture)
##  [1] "size"          "cannons"       "style"         "warnshot"     
##  [5] "date"          "heardof"       "decorations"   "daysfromshore"
##  [9] "speed"         "treasure"

What are the first few rows of the dataframe?

head(capture)
##   size cannons   style warnshot date heardof decorations daysfromshore
## 1   48      54 classic        0  172       1           8            28
## 2   51      56  modern        0   15       0           3             6
## 3   50      44  modern        0   63       0           3            23
## 4   54      54  modern        0  362       1           2            23
## 5   50      56  modern        0  183       1           2            12
## 6   51      48  modern        0  279       0           1             3
##   speed treasure
## 1    16     2175
## 2    29     2465
## 3    18     1925
## 4    19     2200
## 5    21     2290
## 6    24     2195

Q1

Plot the relationship between the following continuous independent variable and treasure. For each plot, add axis and plot labels and a regression line showing the relationship between the independent and dependent variables.

size

plot(x = capture$size,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Treasure",
     pch = 16)

size.lm <- lm(treasure ~ size, data = capture)

abline(size.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

cannons

plot(x = capture$cannons,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Cannons",
     pch = 16)

cannons.lm <- lm(treasure ~ cannons, data = capture)

abline(cannons.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

date

plot(x = capture$date,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Date",
     pch = 16)

date.lm <- lm(treasure ~ date, data = capture)

abline(date.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

decorations

plot(x = capture$decorations,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Decorations",
     pch = 16)

decorations.lm <- lm(treasure ~ decorations, data = capture)

abline(decorations.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

daysfromshore

plot(x = capture$daysfromshore,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Daysfromshore",
     pch = 16)

daysfromshore.lm <- lm(treasure ~ daysfromshore, data = capture)

abline(daysfromshore.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

speed

plot(x = capture$speed,
     y = capture$treasure,
     xlab = "Size",
     ylab = "Speed",
     pch = 16)

speed.lm <- lm(treasure ~ speed, data = capture)

abline(speed.lm,
       lty = 1,
       lwd = 2,
       col = "coral")

Q2

Now do the same for the following categorical independent variables and treasure (hint: try using the new pirateplot() function in the yarrr package! Look at how it works by running ?pirateplot). Again, add appropriate labels and a regression line in each plot.

style

boxplot(treasure ~ style, data = capture, xlab = "Style", ylab = "Treasure")

warnshot

boxplot(treasure ~ warnshot, data = capture, xlab = "Warnshot", ylab = "Treasure")

heardof

boxplot(treasure ~ heardof, data = capture, xlab = "Heardof", ylab = "Treasure")

Q3

For each of the following variables (separately), calculate the median amount of treasure earned for each level of the IV: style, warnshot, decorations (hint: use aggregate or dplyr!)

style

aggregate(treasure ~ style, data = capture, median)
##     style treasure
## 1 classic     2000
## 2  modern     1895

warnshot

aggregate(treasure ~ warnshot, data = capture, median)
##   warnshot treasure
## 1        0     1885
## 2        1     1945

decorations

aggregate(treasure ~ decorations, data = capture, median)
##    decorations treasure
## 1            1   2657.5
## 2            2   1780.0
## 3            3   1905.0
## 4            4   1797.5
## 5            5   1880.0
## 6            6   1855.0
## 7            7   1920.0
## 8            8   1935.0
## 9            9   1935.0
## 10          10   1955.0

Q4

The formula notation for conducting a correlation test with cor.test() is a bit different from regular formula notation. Instead of dv ~ iv, you use ~ dv + iv. For example, the following code will test the correlation between chickens’ age and weight using the ChickWeight dataset.

cor.test(~ Time + weight, 
         data = ChickWeight)
## 
##  Pearson's product-moment correlation
## 
## data:  Time and weight
## t = 36.7252, df = 576, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8109073 0.8599481
## sample estimates:
##       cor 
## 0.8371017

Using the formula notation above, conduct a correlation test between the number of cannons a ship has and its size. What is the p-value?

corr <- cor.test(~ cannons + size, data = capture)
corr$p.value
## [1] 0.3656786

Now do the same with linear regression. What is the p-value?

lmt <-lm(cannons ~ size, data = capture)
s <- summary(lmt)
c <-s$coefficients
c.df <- as.data.frame(c)
c.df$"Pr(>|t|)"
## [1] 0.0003083438 0.3656786097
an <- anova(lmt)
an$"Pr(>F)"
## [1] 0.3656786        NA

Q5

Conduct a linear regression with treasure as the dependent variable, and with all other variables as independent variables. Save the object as treasure.model

treasure.model <- lm(treasure ~ ., data = capture)

Using the summary() function, print the coefficients and main statistics of the regression

s <- summary(treasure.model)
s$coefficients
##                   Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)    749.8956830 351.0513830  2.1361422 3.291256e-02
## size            22.5203254   5.9601665  3.7784725 1.672129e-04
## cannons         19.3817475   1.2932014 14.9874159 6.449723e-46
## stylemodern   -165.0931822  84.6313741 -1.9507326 5.137070e-02
## warnshot        89.0164370  61.0609674  1.4578288 1.452049e-01
## date             0.1508469   0.2313377  0.6520637 5.145114e-01
## heardof         92.1270252  54.7238129  1.6834906 9.259542e-02
## decorations    -96.3997848  10.0248667 -9.6160665 5.463530e-21
## daysfromshore   -8.6118703   2.8179702 -3.0560544 2.302824e-03
## speed            9.2638772   8.3892459  1.1042563 2.697503e-01
s
## 
## Call:
## lm(formula = treasure ~ ., data = capture)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -880.96 -443.16 -211.02   66.08 2427.97 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    749.8957   351.0514   2.136 0.032913 *  
## size            22.5203     5.9602   3.778 0.000167 ***
## cannons         19.3817     1.2932  14.987  < 2e-16 ***
## stylemodern   -165.0932    84.6314  -1.951 0.051371 .  
## warnshot        89.0164    61.0610   1.458 0.145205    
## date             0.1508     0.2313   0.652 0.514511    
## heardof         92.1270    54.7238   1.683 0.092595 .  
## decorations    -96.3998    10.0249  -9.616  < 2e-16 ***
## daysfromshore   -8.6119     2.8180  -3.056 0.002303 ** 
## speed            9.2639     8.3892   1.104 0.269750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 771.4 on 990 degrees of freedom
## Multiple R-squared:  0.2661, Adjusted R-squared:  0.2594 
## F-statistic: 39.88 on 9 and 990 DF,  p-value: < 2.2e-16

What are your conclusions? Which variables are significantly related to treasure and in which direction (i.e.; positive or negative)?

The variables significantly related to trasure are: - size (positive) - cannons (positive) - decorations (negative) - daysfromshore (negative)

Which variables are NOT significantly related to treasure?

Variables not related to treasure are:

Q6

Now tell me again, what was your conclusion about the relationship between decorations and treasure?

Decorations are significantly related to trasure (negative correlation)

Ok, now plot the relationship between decorations and treasure again. Do you see anything strange?

plot(treasure ~ decorations, data = capture, 
    main = "Treasure/Decoration Relationship")

Repeat your regression analysis from Question 5 again, but ONLY include ships with treasure less than 3500. Save the object as treasure.lt3500.model

treasure.lt3500.model <- lm(treasure ~ ., data = capture, subset = treasure <  3500)

Using the summary function, show me the new results from the regression analysis.

summary(treasure.lt3500.model)
## 
## Call:
## lm(formula = treasure ~ ., data = capture, subset = treasure < 
##     3500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.703  -1.926   2.320   5.420   8.845 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -1.046e+01  3.844e+00   -2.722  0.00662 ** 
## size           2.000e+01  6.540e-02  305.746  < 2e-16 ***
## cannons        1.999e+01  1.405e-02 1422.085  < 2e-16 ***
## stylemodern    6.147e+00  9.457e-01    6.500 1.32e-10 ***
## warnshot       1.001e+02  6.702e-01  149.289  < 2e-16 ***
## date          -6.736e-04  2.561e-03   -0.263  0.79258    
## heardof        1.462e+01  6.046e-01   24.182  < 2e-16 ***
## decorations    3.183e+01  1.137e-01  279.890  < 2e-16 ***
## daysfromshore -1.000e+01  3.107e-02 -321.940  < 2e-16 ***
## speed          9.972e+00  9.173e-02  108.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.101 on 905 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 2.587e+05 on 9 and 905 DF,  p-value: < 2.2e-16

Does your conclusion about the relationship betweeen decorations and treasure change? What about the other variables?

All varuables except Date become significantly related to treasure, decorations has now positive correlation.

Q7

Conduct a new regression analysis on the capture dataset, but only using the independent variables size, cannons and speed Call this regression object treasure.model2

treasure.model2 <- lm(treasure ~ size + cannons + speed, data = capture)

Using your regression results from part A, use the predict() function to predict the amount of treasure in a new ship with a size of 60, with 80 cannons, going a speed of 100

new.ship <- data.frame(
  "size" = 60, 
  "cannons" = 80, 
  "speed" = 100)

predict(treasure.model2, new.ship)
##        1 
## 3974.313

Now, imagine that the ship has an extra 2 cannons (82 total). According to your regression analysis, what should the new prediction be?

new.ship$cannons <- 82

Test your prediction in part C!

predict(treasure.model2, new.ship)
##        1 
## 4013.005

Q8

Let’s generate a dataset called my.data. Copy and paste the following code.

my.data <- data.frame(a = c(1, 5, 3, 6, 3, 5, 3, 8, 3),
                      b = c(8, 3, 1, 4, 2, 6, 4, 8, 3))

Add a new variable to my.data called c, where c = 3 * a - 5 * b

my.data$c <-(3 * my.data$a - 5 * my.data$b)

Imagine that you will conduct a linear regression on these data, with c as the dependent variable and a and b as the independent variables. What do you think the coefficients for a and b will be? What do you think the intercept will be?

Run the regression and see if you’re right!

lin <- lm(c~.,data = my.data)
summary(lin)
## Warning in summary.lm(lin): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = c ~ ., data = my.data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.266e-15 -9.778e-16  5.658e-16  9.625e-16  2.773e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  9.474e-15  1.889e-15  5.016e+00  0.00241 ** 
## a            3.000e+00  3.702e-16  8.103e+15  < 2e-16 ***
## b           -5.000e+00  3.093e-16 -1.617e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.114e-15 on 6 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.391e+32 on 2 and 6 DF,  p-value: < 2.2e-16