Here we are at the end of term, for me to create the summative assigned to assess my knowledge of the course, and likely if you’re reading this you’re here to do the assessing. If you are not the assessor…you’re probably lost. To keep it brief, as each student gets 3,000 words of your time in the paper, here you will find the code for the tests done to play with the data enough to know what statistical analyses are appropriate for each question as well as the tests and figures deemed appropriate for utilizing in the paper.
Please excuse the fun I’m having with my wording here as the paper has to be professional and formal.
Libraries are Open
#| label: Opening the librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
library(broom)# This is just calling the libraries that will be used in this project
#| label: Naming file 1sharks <-read.csv("C:/Users/talan/Documents/R/Bowne_files/sharks.csv")# While my files were imported and running I had to do this to make the files render. I honestly do not know why but it made it work
#| label: Naming file 2sharksub <-read.csv("C:/Users/talan/Documents/R/Bowne_files/sharksub.csv")# same as previous chunk just with the second dataset
Question 1
Is there a correlation between the air and water variables?
Checking the Data
The first thing for choosing a test was checking the data. The histogram, qq test and shapiro-wilk test was checking for normality in the distribution of the data. The shapiro-wilk result was small so that was an indication that the data was not normally distributed. The processes were the same for both air and water to check the data. Both variables were continuous and numeric, The data was unpaired and due to the non-normal distribution a non-parametric test was appropriate thus I went with the spearman’s correlation.
Air
#| label: Air Distribution Checkggplot(data = sharks, aes(x = air)) +# creating my histogram of air temps in the shark data geom_histogram(bins =30, col ="purple4", fill ="darkorchid") +# setting the bins aka bars and picking colorstheme_dark() +# lines 67-75 are just setting the theme to dark modetheme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"),axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold") ) +scale_color_brewer() +labs(x ="Air (Celsius)", # 78-80 are creating labelsy ="Frequency", title ="Data Distribution of Ambient Air Temperatures")
#| label: QQplot Airpar(bg ="black", col.axis ="white", col.lab ="white", col.main ="white") # turing my plot to dark modeqqnorm(sharks$air, pch =1, col ="white", frame =FALSE, #makiing a QQ plot of the air data in the shark datasetmain ="QQ Plot of Air Data", #labels for the next 3 linesxlab ="Theoretical Quantiles", ylab ="Sample Quantiles")qqline(sharks$air, col ="darkorchid", lwd =2) # adding the line to the plot and making it a pretty color
#| label: Air Shapiro-Wilkshapiro.test(sharks$air)
Shapiro-Wilk normality test
data: sharks$air
W = 0.95885, p-value = 1.338e-10
# shapiro-wilk test to check for normality of air temp distribution in the shark dataset
Water
#| label: Water Distribution Checkggplot(data = sharks, aes(x = water)) +geom_histogram(bins =30, col ="purple4", fill ="darkorchid") +theme_dark() +theme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"), axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold") ) +scale_color_brewer() +labs(x ="Water temperature at surface at time of capture (Celsius)", y ="Frequency", title ="Data Distribution of Surface Water Temperature")
# repeated the process for the air histogram with the water temps from the shark dataset
#| label: QQplot Waterpar(bg ="black", col.axis ="white", col.lab ="white", col.main ="white")qqnorm(sharks$water, pch =1, col ="white", frame =FALSE, main ="QQ Plot of Water Data", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")qqline(sharks$water, col ="darkorchid", lwd =2)
#repeated the process from the air QQ plot with the water temps from the shark dataset
#| label: Water Shapiro-Wilkshapiro.test(sharks$water)
Shapiro-Wilk normality test
data: sharks$water
W = 0.96035, p-value = 2.371e-10
# shapiro-wilk test to check for normality of water temp distribution in the shark dataset
The Stats
After running the spearman’s correlation the p value was above 0.05 and as such above the threshold for significance showing no relationship between the air and water variables.
#| label: Soearmans's Correlation for Air & Watercor.test(x=sharks$air, y=sharks$water, method ='spearman')
Spearman's rank correlation rho
data: sharks$air and sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.05637344
# code to run the spearman's correlation to check for a relationship between air and water temps
Question 2
Do multiple captures have an effect on blotching time?
Checking the Data
Checking the data for this question started just the same as question one with creating histograms, QQ plots and running the shapiro-wilk test to check for normality in the distribution of the data. For this test like the previous the data was continuous and numeric. However this question is based upon looking for differences and the data in this question was paired. The 3 distribution checks for this question showed that the blotching times for the first and second capture was normally distributed. In this care the shapiro-wilk test had a high p value which indicated normal distribution. For this question based upon the checks of the data was the paired T-test.
First Capture
#| label: Blotch1 Distribution Checkggplot(data = sharksub, aes(x = blotch1)) +geom_histogram(bins =30, col ="purple4", fill ="darkorchid") +theme_dark() +theme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"), axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold") ) +scale_color_brewer() +labs(x ="Time for blotching to cover 30% of ventral surface (s)", y ="Frequency", title ="Data Distribution of blotching time for first capture")
# Same as the previous histograms but with blotching time for the first capture in the sharksub dataset
#| label: QQplot Blotch1par(bg ="black", col.axis ="white", col.lab ="white", col.main ="white")qqnorm(sharksub$blotch1, pch =1, col ="white", frame =FALSE, main ="QQ Plot of Blotch1 Data", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")qqline(sharksub$blotch1, col ="darkorchid", lwd =2)
# Same as previous QQ plot code but for first capture blotch time in the sharksub dataset
Shapiro-Wilk normality test
data: sharksub$blotch1
W = 0.97958, p-value = 0.5345
# shapiro-wilk normality check for first capture blotch times of sharksub dataset
####Second Capture
#| label: Blotch2 Distribution Checkggplot(data = sharksub, aes(x = blotch2)) +geom_histogram(bins =30, col ="purple4", fill ="darkorchid") +theme_dark() +theme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"), axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold") ) +scale_color_brewer() +labs(x ="Time for blotching to cover 30% of ventral surface (s)", y ="Frequency", title ="Data Distribution of Blotching Time for Second Capture")
# only change in this histogram is checking the second capture blotch time in the sharksub dataset
#| label: QQplot Blotch2par(bg ="black", col.axis ="white", col.lab ="white", col.main ="white")qqnorm(sharksub$blotch2, pch =1, col ="white", frame =FALSE, main ="QQ Plot of Blotch2 Data", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")qqline(sharksub$blotch2, col ="darkorchid", lwd =2)
# Only change is second capture blotch times in the sharksub dataset
Shapiro-Wilk normality test
data: sharksub$blotch2
W = 0.97936, p-value = 0.5255
# shapiro-wilk normality check for second capture blotch times of sharksub dataset
The Stats
The paired T-test showed that there was a significant difference between the means of the blotch times for the first catch and the second catch. The combined histogram had transparent bars to show where the two different capture blotch times overlap. The combined histogram illustrates how the second capture blotch times trends to longer than the first capture.
Paired t-test
data: sharksub$blotch1 and sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.037176 -0.822301
sample estimates:
mean difference
-0.9297384
# code for the paired t-test which test for difference in means of the first capture and second capture blotch times in the sharksub dataset
sharksub_long <- sharksub %>%pivot_longer(cols =c(blotch1, blotch2), names_to ="Capture", values_to ="BlotchTime") # these three lines of code reshaped the data of blotching time into long format to better visualize it in the combined histogram. utilized the blotching times for the first and second capture in the sharksub dataset ggplot(data = sharksub_long, aes(x = BlotchTime, fill = Capture)) +# the same as previous histograms but calling that long format data to used for the combines histogram geom_histogram(position ="identity", alpha =0.85, bins =30, color ="black") +theme_dark() +theme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"), axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold") ) +scale_fill_manual(values =c("blotch1"="darkorchid1", "blotch2"="purple4")) +labs(x ="Time for blotching to cover 30% of ventral surface (s)",y ="Frequency",title ="Distribution of Blotching Time After Capture") +theme(legend.title =element_blank()) # taking out the legend title
Question 3
Is it possible to predict blotching time?
Checking the Data
Initially a mutlitple regression was chosen to check the relationship between multiple variables and how they affected blotching time. However as the only variable that affected blotching time was depth, this became a convenient way to check that a lineear regression model was the most appropriate test, something that can be called a happy accident. The data was unpaired, continuous and numeric for all variables. The question that was asked was looking for a relationship where a dependent variable was being altered by an independent variable.
#| label: Multi-Regression Blotching Timemulti_reg <-lm(blotch ~ BPM + weight + length + air + water + meta + depth, data = sharks) # creates the multi regressionsummary(multi_reg) # displays the summary of the data for the multi regression
Call:
lm(formula = blotch ~ BPM + weight + length + air + water + meta +
depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.83745 -0.66117 -0.00702 0.60110 2.74108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1405851 1.8958668 5.876 7.74e-09 ***
BPM -0.0019723 0.0031890 -0.618 0.537
weight 0.0016283 0.0033511 0.486 0.627
length 0.0012295 0.0009710 1.266 0.206
air -0.0281474 0.0318707 -0.883 0.378
water -0.0188934 0.0270782 -0.698 0.486
meta -0.0009712 0.0025951 -0.374 0.708
depth 0.5061285 0.0223191 22.677 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.002 on 492 degrees of freedom
Multiple R-squared: 0.514, Adjusted R-squared: 0.507
F-statistic: 74.32 on 7 and 492 DF, p-value: < 2.2e-16
regression_summary <-tidy(multi_reg) #creating a version of the multi regression summary which can easily be displayed in the tableregression_summary %>%# calling the tidy version of the data to be put into the table kable(col.names =c("Variable", "Estimate", "Std. Error", "t-value", "p-value"), #labeling the columns caption ="Regression Model Summary for Blotch Prediction",format ="html") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"), # a shadow occurs when the mouse hovers over the table full_width =FALSE,font_size =14) %>%row_spec(0, bold =TRUE, background ="purple4", color ="white") %>%# the last bit of this chunk is just making pretty colorscolumn_spec(1, background ="purple", bold =TRUE) %>%column_spec(5, background =ifelse(regression_summary$p.value <0.05, "darkviolet", "violet"),color =ifelse(regression_summary$p.value <0.05, "pink", "black"))
Regression Model Summary for Blotch Prediction
Variable
Estimate
Std. Error
t-value
p-value
(Intercept)
11.1405851
1.8958668
5.8762490
0.0000000
BPM
-0.0019723
0.0031890
-0.6184807
0.5365447
weight
0.0016283
0.0033511
0.4859077
0.6272489
length
0.0012295
0.0009710
1.2662214
0.2060330
air
-0.0281474
0.0318707
-0.8831753
0.3775729
water
-0.0188934
0.0270782
-0.6977369
0.4856714
meta
-0.0009712
0.0025951
-0.3742564
0.7083748
depth
0.5061285
0.0223191
22.6769242
0.0000000
The Stats
The linear regression of the relationship between blotch time and depth revealed a positive correlation between the two. A scatter plot with a regression line was created which displayed this positive relationship where blotching time was longer when the sharks were captured at a deeper depth. To further confirm the linear regression was accurate a qq plot of the residuals was done and showed a normal distribution which is the assumption of the residuals from ta linear regrression.
#| label: linear Regression of Depth and Blotchinglin_reg <-lm(blotch ~ depth, data = sharks) #creating the linear regressionsummary(lin_reg) # calling the summary to display the data from the linear regression
Call:
lm(formula = blotch ~ depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.81869 -0.65427 -0.01035 0.58825 2.83116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.82178 1.11207 8.832 <2e-16 ***
depth 0.50467 0.02216 22.772 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared: 0.5101, Adjusted R-squared: 0.5091
F-statistic: 518.6 on 1 and 498 DF, p-value: < 2.2e-16
#| label: Linear Model 95% Confidenceconfint(lin_reg) # generates a confidence interval for the linear regression
Linear Regression Model Summary: Depth as Predictor of Blotch
Variable
Estimate
Std. Error
t-value
p-value
(Intercept)
9.8217799
1.1120672
8.832002
0
depth
0.5046733
0.0221619
22.772112
0
# I'm not going to keep repeating things anymore
ggplot(sharks, aes(x = depth, y = blotch)) +# creates a scatter plot with blotch time as a dependent variable of depthgeom_point(color ="purple3", size =3) +geom_smooth(method ="lm", color ="darkorchid", se =TRUE) +# adds in the regression linelabs(title ="Blotch vs. Depth",x ="Depth (m)",y ="Blotch (seconds)") +theme_dark() +# do I even need to point out dark mode anymore? theme(plot.background =element_rect(fill ="black"), panel.background =element_rect(fill ="black"), panel.grid.major =element_line(color ="gray"), panel.grid.minor =element_line(color ="gray30"), axis.text =element_text(color ="white"), axis.title =element_text(color ="white"), plot.title =element_text(color ="white", size =14, face ="bold"))
`geom_smooth()` using formula = 'y ~ x'
#| label: QQplot Residualspar(bg ="black", col.axis ="white", col.lab ="white", col.main ="white")qqnorm(lin_reg$residuals, pch =1, col ="white", frame =FALSE, main ="QQ Plot of residuals", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")qqline(lin_reg$residuals, col ="darkorchid", lwd =2)
Extra
Bits of extra code to supplement the paper
I don’t even know if I’ll use any of these, but just in case.
#| label: Summary of Sex of sharks dataset table kable(table, col.names =c("Sex", "Count"), caption ="Summary of Sharks by Sex") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F) %>%row_spec(0, bold =TRUE, color ="white", background ="purple4") %>%row_spec(1:2, background ="darkorchid")
Summary of Sharks by Sex
Sex
Count
Female
236
Male
264
#| label: Summary of Sex of sharksub dataset table2 <- sharksub %>%group_by(sex) %>%summarize(count =n()) %>%ungroup()
#| label: Summary of Sex of sharksub dataset table kable(table2, col.names =c("Sex", "Count"), caption ="Summary of Sharks by Sex") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F) %>%row_spec(0, bold =TRUE, color ="white", background ="purple4") %>%row_spec(1:2, background ="darkorchid")
Summary of Sharks by Sex
Sex
Count
Female
25
Male
25
summary(sharksub)
ID sex blotch1 blotch2
Length:50 Length:50 Min. :32.49 Min. :33.47
Class :character Class :character 1st Qu.:34.38 1st Qu.:35.31
Mode :character Mode :character Median :34.94 Median :35.94
Mean :35.03 Mean :35.96
3rd Qu.:35.90 3rd Qu.:36.78
Max. :37.07 Max. :38.18
summary_table <- sharksub %>%summarize(blotch1_Minimum =min(blotch1, na.rm =TRUE), # summarizing the mean, min, and max of the first and second capture blotch times blotch1_Mean =mean(blotch1, na.rm =TRUE),blotch1_Maximum =max(blotch1, na.rm =TRUE),blotch2_Minimum =min(blotch2, na.rm =TRUE),blotch2_Mean =mean(blotch2, na.rm =TRUE),blotch2_Maximum =max(blotch2, na.rm =TRUE)) %>%pivot_longer(cols =everything(), names_to ="Metric", values_to ="Value")summary_table <- summary_table %>%mutate(Variable =case_when(grepl("blotch1", Metric) ~"blotch1",grepl("blotch2", Metric) ~"blotch2"),Statistic =case_when(grepl("Minimum", Metric) ~"Minimum",grepl("Mean", Metric) ~"Mean",grepl("Maximum", Metric) ~"Maximum")) %>%select(Variable, Statistic, Value) %>%arrange(Variable, Statistic)
#| label: Making a table of the blotch times kable(summary_table, caption ="Comparison of Blotch1 and Blotch2")
Comparison of Blotch1 and Blotch2
Variable
Statistic
Value
blotch1
Maximum
37.07165
blotch1
Mean
35.03042
blotch1
Minimum
32.49322
blotch2
Maximum
38.18380
blotch2
Mean
35.96016
blotch2
Minimum
33.46802
#| label: Making a pretty table of blotch times kable(summary_table, col.names =c("Variable", "Statistic", "Value"), caption ="Summary of Blotch Times") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F) %>%row_spec(0, bold =TRUE, color ="white", background ="purple4") %>%row_spec(1:6, background ="darkorchid")
Summary of Blotch Times
Variable
Statistic
Value
blotch1
Maximum
37.07165
blotch1
Mean
35.03042
blotch1
Minimum
32.49322
blotch2
Maximum
38.18380
blotch2
Mean
35.96016
blotch2
Minimum
33.46802
#| label: Summary of shark dataset summary(sharks)
ID sex blotch BPM
Length:500 Length:500 Min. :30.78 Min. :119.0
Class :character Class :character 1st Qu.:34.16 1st Qu.:129.0
Mode :character Mode :character Median :35.05 Median :142.0
Mean :35.13 Mean :141.8
3rd Qu.:36.05 3rd Qu.:153.2
Max. :40.08 Max. :166.0
weight length air water
Min. : 65.10 Min. :128.3 Min. :33.00 Min. :20.01
1st Qu.: 75.68 1st Qu.:172.0 1st Qu.:34.42 1st Qu.:21.55
Median : 87.82 Median :211.1 Median :35.43 Median :23.11
Mean : 87.94 Mean :211.0 Mean :35.54 Mean :23.02
3rd Qu.:100.40 3rd Qu.:251.8 3rd Qu.:36.71 3rd Qu.:24.37
Max. :110.94 Max. :291.0 Max. :38.00 Max. :25.99
meta depth
Min. : 50.03 Min. :44.64
1st Qu.: 67.39 1st Qu.:48.90
Median : 82.45 Median :50.14
Mean : 82.04 Mean :50.14
3rd Qu.: 95.97 3rd Qu.:51.35
Max. :112.45 Max. :56.83
summary_table2 <- sharks %>%# pulling the data and stats I want from the sharks dataset summarize(air_Minimum =min(air, na.rm =TRUE),air_Mean =mean(air, na.rm =TRUE),air_Maximum =max(air, na.rm =TRUE),water_Minimum =min(water, na.rm =TRUE),water_Mean =mean(water, na.rm =TRUE),water_Maximum =max(water, na.rm =TRUE)) %>%pivot_longer(cols =everything(),names_to ="Metric",values_to ="Value") %>%mutate(Variable =case_when(grepl("air", Metric) ~"Air",grepl("water", Metric) ~"Water"),Statistic =case_when(grepl("Minimum", Metric) ~"Minimum",grepl("Mean", Metric) ~"Mean",grepl("Maximum", Metric) ~"Maximum")) %>%select(Variable, Statistic, Value) %>%arrange(Variable, Statistic)kable(summary_table2, # creating the table of the data I want from the sharks dataset of the mean, min and max of the air and water temperatures col.names =c("Variable", "Statistic", "Value"),caption ="Summary of Air and Water Temperatures")
Summary of Air and Water Temperatures
Variable
Statistic
Value
Air
Maximum
37.99978
Air
Mean
35.53526
Air
Minimum
33.00454
Water
Maximum
25.98523
Water
Mean
23.02052
Water
Minimum
20.00503
#| label: Another pretty table kable(summary_table2, col.names =c("Variable", "Statistic", "Value"), caption ="Summary Air and Water Temperatures") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F) %>%row_spec(0, bold =TRUE, color ="white", background ="purple4") %>%row_spec(1:6, background ="darkorchid")
Summary Air and Water Temperatures
Variable
Statistic
Value
Air
Maximum
37.99978
Air
Mean
35.53526
Air
Minimum
33.00454
Water
Maximum
25.98523
Water
Mean
23.02052
Water
Minimum
20.00503
#| label: Another Summary created...Depth this time depth_summary <- sharks %>%# summarizing the mean, min and max depths of the sharks dataset summarize(Depth_Minimum =min(depth, na.rm =TRUE),Depth_Mean =mean(depth, na.rm =TRUE),Depth_Maximum =max(depth, na.rm =TRUE)) %>%pivot_longer(cols =everything(),names_to ="Statistic",values_to ="Value") %>%mutate(Statistic =case_when(grepl("Minimum", Statistic) ~"Minimum",grepl("Mean", Statistic) ~"Mean",grepl("Maximum", Statistic) ~"Maximum"))kable(depth_summary,col.names =c("Statistic", "Value"), # another table made of the summarized datacaption ="Summary Statistics for Depth in Sharks Data")
Summary Statistics for Depth in Sharks Data
Statistic
Value
Minimum
44.64474
Mean
50.13864
Maximum
56.82916
#| label: Pretty table of depth data kable(depth_summary, col.names =c("Statistic", "Value"), caption ="Summary of Catch Depth") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F) %>%row_spec(0, bold =TRUE, color ="white", background ="purple4") %>%row_spec(1:3, background ="darkorchid")