Q0 First, you’ll need to re-install and load the yarrr package to access the data
capture <- read.delim("/var/folders/_b/0ddnfxtd72jcc2vv2s6xr72c0000gn/T//RtmpwCTJrf/data4d137b8532a")
What are the names of the columns in the capture dataframe?
names(capture)
## [1] "size" "cannons" "style" "warnshot"
## [5] "date" "heardof" "decorations" "daysfromshore"
## [9] "speed" "treasure"
What are the first few rows of the dataframe?
head(capture)
## size cannons style warnshot date heardof decorations daysfromshore
## 1 48 54 classic 0 172 1 8 28
## 2 51 56 modern 0 15 0 3 6
## 3 50 44 modern 0 63 0 3 23
## 4 54 54 modern 0 362 1 2 23
## 5 50 56 modern 0 183 1 2 12
## 6 51 48 modern 0 279 0 1 3
## speed treasure
## 1 16 2175
## 2 29 2465
## 3 18 1925
## 4 19 2200
## 5 21 2290
## 6 24 2195
Q1
Plot the relationship between the following continuous independent variable and treasure. For each plot, add axis and plot labels and a regression line showing the relationship between the independent and dependent variables.
size
plot(x = capture$size,
y = capture$treasure,
xlab = "Size",
ylab = "Treasure",
pch = 16)
size.lm <- lm(treasure ~ size, data = capture)
abline(size.lm,
lty = 1,
lwd = 2,
col = "coral")
cannons
plot(x = capture$cannons,
y = capture$treasure,
xlab = "Size",
ylab = "Cannons",
pch = 16)
cannons.lm <- lm(treasure ~ cannons, data = capture)
abline(cannons.lm,
lty = 1,
lwd = 2,
col = "coral")
date
plot(x = capture$date,
y = capture$treasure,
xlab = "Size",
ylab = "Date",
pch = 16)
date.lm <- lm(treasure ~ date, data = capture)
abline(date.lm,
lty = 1,
lwd = 2,
col = "coral")
decorations
plot(x = capture$decorations,
y = capture$treasure,
xlab = "Size",
ylab = "Decorations",
pch = 16)
decorations.lm <- lm(treasure ~ decorations, data = capture)
abline(decorations.lm,
lty = 1,
lwd = 2,
col = "coral")
daysfromshore
plot(x = capture$daysfromshore,
y = capture$treasure,
xlab = "Size",
ylab = "Daysfromshore",
pch = 16)
daysfromshore.lm <- lm(treasure ~ daysfromshore, data = capture)
abline(daysfromshore.lm,
lty = 1,
lwd = 2,
col = "coral")
speed
plot(x = capture$speed,
y = capture$treasure,
xlab = "Size",
ylab = "Speed",
pch = 16)
speed.lm <- lm(treasure ~ speed, data = capture)
abline(speed.lm,
lty = 1,
lwd = 2,
col = "coral")
Q2
Now do the same for the following categorical independent variables and treasure (hint: try using the new pirateplot() function in the yarrr package! Look at how it works by running ?pirateplot). Again, add appropriate labels and a regression line in each plot.
style
boxplot(treasure ~ style, data = capture, xlab = "Style", ylab = "Treasure")
warnshot
boxplot(treasure ~ warnshot, data = capture, xlab = "Warnshot", ylab = "Treasure")
heardof
boxplot(treasure ~ heardof, data = capture, xlab = "Heardof", ylab = "Treasure")
Q3
For each of the following variables (separately), calculate the median amount of treasure earned for each level of the IV: style, warnshot, decorations (hint: use aggregate or dplyr!)
style
aggregate(treasure ~ style, data = capture, median)
## style treasure
## 1 classic 2000
## 2 modern 1895
warnshot
aggregate(treasure ~ warnshot, data = capture, median)
## warnshot treasure
## 1 0 1885
## 2 1 1945
decorations
aggregate(treasure ~ decorations, data = capture, median)
## decorations treasure
## 1 1 2657.5
## 2 2 1780.0
## 3 3 1905.0
## 4 4 1797.5
## 5 5 1880.0
## 6 6 1855.0
## 7 7 1920.0
## 8 8 1935.0
## 9 9 1935.0
## 10 10 1955.0
Q4
The formula notation for conducting a correlation test with cor.test() is a bit different from regular formula notation. Instead of dv ~ iv, you use ~ dv + iv. For example, the following code will test the correlation between chickens’ age and weight using the ChickWeight dataset.
cor.test(~ Time + weight,
data = ChickWeight)
##
## Pearson's product-moment correlation
##
## data: Time and weight
## t = 36.7252, df = 576, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8109073 0.8599481
## sample estimates:
## cor
## 0.8371017
Using the formula notation above, conduct a correlation test between the number of cannons a ship has and its size. What is the p-value?
corr <- cor.test(~ cannons + size, data = capture)
corr$p.value
## [1] 0.3656786
Now do the same with linear regression. What is the p-value?
lmt <-lm(cannons ~ size, data = capture)
s <- summary(lmt)
c <-s$coefficients
c.df <- as.data.frame(c)
c.df$"Pr(>|t|)"
## [1] 0.0003083438 0.3656786097
an <- anova(lmt)
an$"Pr(>F)"
## [1] 0.3656786 NA
Q5
Conduct a linear regression with treasure as the dependent variable, and with all other variables as independent variables. Save the object as treasure.model
treasure.model <- lm(treasure ~ ., data = capture)
Using the summary() function, print the coefficients and main statistics of the regression
s <- summary(treasure.model)
s$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 749.8956830 351.0513830 2.1361422 3.291256e-02
## size 22.5203254 5.9601665 3.7784725 1.672129e-04
## cannons 19.3817475 1.2932014 14.9874159 6.449723e-46
## stylemodern -165.0931822 84.6313741 -1.9507326 5.137070e-02
## warnshot 89.0164370 61.0609674 1.4578288 1.452049e-01
## date 0.1508469 0.2313377 0.6520637 5.145114e-01
## heardof 92.1270252 54.7238129 1.6834906 9.259542e-02
## decorations -96.3997848 10.0248667 -9.6160665 5.463530e-21
## daysfromshore -8.6118703 2.8179702 -3.0560544 2.302824e-03
## speed 9.2638772 8.3892459 1.1042563 2.697503e-01
s
##
## Call:
## lm(formula = treasure ~ ., data = capture)
##
## Residuals:
## Min 1Q Median 3Q Max
## -880.96 -443.16 -211.02 66.08 2427.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 749.8957 351.0514 2.136 0.032913 *
## size 22.5203 5.9602 3.778 0.000167 ***
## cannons 19.3817 1.2932 14.987 < 2e-16 ***
## stylemodern -165.0932 84.6314 -1.951 0.051371 .
## warnshot 89.0164 61.0610 1.458 0.145205
## date 0.1508 0.2313 0.652 0.514511
## heardof 92.1270 54.7238 1.683 0.092595 .
## decorations -96.3998 10.0249 -9.616 < 2e-16 ***
## daysfromshore -8.6119 2.8180 -3.056 0.002303 **
## speed 9.2639 8.3892 1.104 0.269750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 771.4 on 990 degrees of freedom
## Multiple R-squared: 0.2661, Adjusted R-squared: 0.2594
## F-statistic: 39.88 on 9 and 990 DF, p-value: < 2.2e-16
What are your conclusions? Which variables are significantly related to treasure and in which direction (i.e.; positive or negative)?
The variables significantly related to trasure are: - size (positive) - cannons (positive) - decorations (negative) - daysfromshore (negative)
Which variables are NOT significantly related to treasure?
Variables not related to treasure are:
Q6
Now tell me again, what was your conclusion about the relationship between decorations and treasure?
Decorations are significantly related to trasure (negative correlation)
Ok, now plot the relationship between decorations and treasure again. Do you see anything strange?
plot(treasure ~ decorations, data = capture,
main = "Treasure/Decoration Relationship")
Repeat your regression analysis from Question 5 again, but ONLY include ships with treasure less than 3500. Save the object as treasure.lt3500.model
treasure.lt3500.model <- lm(treasure ~ ., data = capture, subset = treasure < 3500)
Using the summary function, show me the new results from the regression analysis.
summary(treasure.lt3500.model)
##
## Call:
## lm(formula = treasure ~ ., data = capture, subset = treasure <
## 3500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.703 -1.926 2.320 5.420 8.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.046e+01 3.844e+00 -2.722 0.00662 **
## size 2.000e+01 6.540e-02 305.746 < 2e-16 ***
## cannons 1.999e+01 1.405e-02 1422.085 < 2e-16 ***
## stylemodern 6.147e+00 9.457e-01 6.500 1.32e-10 ***
## warnshot 1.001e+02 6.702e-01 149.289 < 2e-16 ***
## date -6.736e-04 2.561e-03 -0.263 0.79258
## heardof 1.462e+01 6.046e-01 24.182 < 2e-16 ***
## decorations 3.183e+01 1.137e-01 279.890 < 2e-16 ***
## daysfromshore -1.000e+01 3.107e-02 -321.940 < 2e-16 ***
## speed 9.972e+00 9.173e-02 108.711 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.101 on 905 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 2.587e+05 on 9 and 905 DF, p-value: < 2.2e-16
Does your conclusion about the relationship betweeen decorations and treasure change? What about the other variables?
All varuables except Date become significantly related to treasure, decorations has now positive correlation.
Q7
Conduct a new regression analysis on the capture dataset, but only using the independent variables size, cannons and speed Call this regression object treasure.model2
treasure.model2 <- lm(treasure ~ size + cannons + speed, data = capture)
Using your regression results from part A, use the predict() function to predict the amount of treasure in a new ship with a size of 60, with 80 cannons, going a speed of 100
new.ship <- data.frame(
"size" = 60,
"cannons" = 80,
"speed" = 100)
predict(treasure.model2, new.ship)
## 1
## 3974.313
Now, imagine that the ship has an extra 2 cannons (82 total). According to your regression analysis, what should the new prediction be?
new.ship$cannons <- 82
Test your prediction in part C!
predict(treasure.model2, new.ship)
## 1
## 4013.005
Q8
Let’s generate a dataset called my.data. Copy and paste the following code.
my.data <- data.frame(a = c(1, 5, 3, 6, 3, 5, 3, 8, 3),
b = c(8, 3, 1, 4, 2, 6, 4, 8, 3))
Add a new variable to my.data called c, where c = 3 * a - 5 * b
my.data$c <-(3 * my.data$a - 5 * my.data$b)
Imagine that you will conduct a linear regression on these data, with c as the dependent variable and a and b as the independent variables. What do you think the coefficients for a and b will be? What do you think the intercept will be?
Run the regression and see if you’re right!
lin <- lm(c~.,data = my.data)
summary(lin)
## Warning in summary.lm(lin): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = c ~ ., data = my.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.266e-15 -9.778e-16 5.658e-16 9.625e-16 2.773e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.474e-15 1.889e-15 5.016e+00 0.00241 **
## a 3.000e+00 3.702e-16 8.103e+15 < 2e-16 ***
## b -5.000e+00 3.093e-16 -1.617e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.114e-15 on 6 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.391e+32 on 2 and 6 DF, p-value: < 2.2e-16