My summative

Author

Renad Galal

Published

January 1, 2025

Data Wrangling

install.packages("readxl") # Install the necessary libraries for reading Excel file

Installing package into 'C:/Users/renad/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)

package 'readxl' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\renad\AppData\Local\Temp\Rtmp2DlQSQ\downloaded_packages

library(readxl) # Load the Library

Warning: package 'readxl' was built under R version 4.4.2

sharks_data <- read_excel("C:/Users/renad/Desktop/NTU/1 Study/1 slides/1 Rsrch methods and data analyis/summative R/data files/sharks.xlsx")
sharksub_data <- read_excel("C:/Users/renad/Desktop/NTU/1 Study/1 slides/1 Rsrch methods and data analyis/summative R/data files/sharksub.xlsx") # Use the read_excel() function to load the data. Note excel files path

head(sharks_data)

# A tibble: 6 × 10
  ID    sex    blotch   BPM weight length   air water  meta depth
  <chr> <chr>   <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 SH001 Female   37.2   148   74.7   187.  37.7  23.4  64.1  53.2
2 SH002 Female   34.5   158   73.4   189.  35.7  21.4  73.7  49.6
3 SH003 Female   36.3   125   71.8   284.  34.8  20.1  54.4  49.4
4 SH004 Male     35.3   161  105.    171.  36.2  21.6  86.3  50.3
5 SH005 Female   37.4   138   67.1   264.  33.6  21.8 108.   49.0
6 SH006 Male     33.5   126  110.    270.  36.4  20.9 109.   46.8

head(sharksub_data) # Check Your Data

# A tibble: 6 × 4
  ID    sex    blotch1 blotch2
  <chr> <chr>    <dbl>   <dbl>
1 SH269 Female    36.1    37.2
2 SH163 Female    33.4    34.4
3 SH008 Female    36.3    36.5
4 SH239 Female    35.0    36.0
5 SH332 Female    35.7    36.8
6 SH328 Female    34.9    35.9

View (sharks_data)
View (sharksub_data) # view your data

str (sharks_data)

tibble [500 × 10] (S3: tbl_df/tbl/data.frame)
 $ ID    : chr [1:500] "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr [1:500] "Female" "Female" "Female" "Male" ...
 $ blotch: num [1:500] 37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : num [1:500] 148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num [1:500] 74.7 73.4 71.8 104.6 67.1 ...
 $ length: num [1:500] 187 189 284 171 264 ...
 $ air   : num [1:500] 37.7 35.7 34.8 36.2 33.6 ...
 $ water : num [1:500] 23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num [1:500] 64.1 73.7 54.4 86.3 108 ...
 $ depth : num [1:500] 53.2 49.6 49.4 50.3 49 ...

str (sharksub_data) # to display the structure or internal details

tibble [50 × 4] (S3: tbl_df/tbl/data.frame)
 $ ID     : chr [1:50] "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr [1:50] "Female" "Female" "Female" "Female" ...
 $ blotch1: num [1:50] 36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num [1:50] 37.2 34.4 36.5 36 36.8 ...

names (sharks_data)

 [1] "ID"     "sex"    "blotch" "BPM"    "weight" "length" "air"    "water" 
 [9] "meta"   "depth"

names (sharksub_data) # a quick view of the variables names

[1] "ID"      "sex"     "blotch1" "blotch2"

Sharksdata wrangling

library (dplyr) # loading dplyr package to use the group_by function


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Grouping by sex, calculating means, and counting each group

 # Grouping and summarizing
sharks_data_grouped <- sharks_data %>%
  group_by(sex) %>%
  summarize(
    mean_air_temp = mean(air, na.rm = TRUE),
    mean_water_temp = mean(water, na.rm = TRUE),
    count = n()  # Count of observations in each group
  )

print(sharks_data_grouped) # View the grouped summary

# A tibble: 2 × 4
  sex    mean_air_temp mean_water_temp count
  <chr>          <dbl>           <dbl> <int>
1 Female          35.5            23.1   236
2 Male            35.6            22.9   264

Results

the number of female sharks = 236
the number of male sharks = 264
The average air temperature for female sharks was 35.48716°C, and the average water temperature was 23.10948°C
The average air temperature for male sharks was 35.57826°C, and the average water temperature was 22.94100°C

sharksub_data wrangling

Is there a correlation between the variables air and water?

sharks_data_summary <- sharks_data %>%
  group_by(ID) %>%
  summarize(mean_air_temp = mean(air, na.rm = TRUE),
            mean_water_temp = mean(water, na.rm = TRUE))

mean_air <- mean(sharks_data$air, na.rm = TRUE)
mean_water <- mean(sharks_data$water, na.rm = TRUE)

???

summarise(air_water_data)

type of data? both continuous
do we expect a correlation? yes, because they are often influenced by the same environmental factors and exchange heat with each other.

sharks_data %>% select (air, water) # isolate air and water columns to test the correlation between them

# A tibble: 500 × 2
     air water
   <dbl> <dbl>
 1  37.7  23.4
 2  35.7  21.4
 3  34.8  20.1
 4  36.2  21.6
 5  33.6  21.8
 6  36.4  20.9
 7  33.1  21.8
 8  36.8  21.3
 9  35.3  22.2
10  35.7  24.6
# ℹ 490 more rows

air_water_data <- select(sharks_data, air, water) # Create a new table with only air and water columns

write.csv(air_water_data, "air_water_data.csv", row.names = FALSE) # save as csv file

Normality of data

shapiro.test(sharks_data$air)


    Shapiro-Wilk normality test

data:  sharks_data$air
W = 0.95885, p-value = 1.338e-10

shapiro.test(sharks_data$water) # Check for normality to determine whether to use Pearson's correlation (for normally distributed data) or Spearman's correlation (for non-normally distributed data)


    Shapiro-Wilk normality test

data:  sharks_data$water
W = 0.96035, p-value = 2.371e-10

Tip

If the p-value from the Shapiro-Wilk test is < 0.05, the data is not normally distributed.

p-value for air = 1.338e-10 and for water = 2.371e-10 which indicates a statistically significant result. The data is not normally distributed (reject the null hypothesis of normality) > we’ll use Spearman’s correlation

cor.test(air_water_data$air, air_water_data$water, method = "spearman") # testing the correlation


    Spearman's rank correlation rho

data:  air_water_data$air and air_water_data$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.05637344

Tip

rho (Spearman’s rank correlation coefficient) measures the strength and direction of the monotonic relationship between two variables
rho = 1 : Perfect positive monotonic relationship (as one variable increases, the other increases).
rho = -1: Perfect negative monotonic relationship (as one variable increases, the other decreases).
rho = 0 : No monotonic relationship between the variables.

Tip

Reject H₀: If the p-value is less than or equal to 0.05 > there is significant evidence to reject the null hypothesis in favor of the alternative hypothesis.
Fail to Reject H₀: If the p-value is greater than 0.05 > there is insufficient evidence to reject the null hypothesis.

Results

p-value = 0.2082 > more than 0.05 > not enough evidence to reject > no significant difference
rho = -0.05637344 > a very weak negative relationship between air and water (very weak correlation)
This means that as air increases, water slightly tends to decrease, but the effect is very small

Conclusion

There is no strong or significant monotonic relationship between air temperature and water temperature.

Visualizing

library(ggplot2) # Load required libraries

Warning: package 'ggplot2' was built under R version 4.4.2

ggplot(air_water_data, aes(x = air, y = water)) +
  geom_point(color = "blue") +
  ggtitle("Correlation between air and water") +
  xlab("air") +
  ylab("water") +
  theme_minimal() # Scatterplot

i dont think i can do it this easily i must wrangle the data properly first, but at least i kinda know how to answer the correlation question :)