Formative Assessment 1-Data Exploration & Choosing Statisical Analysis

Author

Wendy Furness, Natasha Richardson-Law, Oliver Barnes, Weronika Staniak, Tia Fernandez, Nick Park

FORMATIVE ASSESSMENT 1 Group activity: Data Exploration & Choosing Statistical Analyses

The question..??

Is there any difference within the wingspan of female and males ?

In mosquitoes, sex and gender are intricately linked, as male and female individuals exhibit distinct behaviors, reproductive roles, and physiological traits, influencing their ecological interactions and mating strategies.

Pre-work

  • setting working directory

  • installing and loading packages

  • setting up YAML (see above)

  • loading the data-downloaded from NOW then imported using the Import Dataset function. Do need to then load the data into Quarto too

    mosquitos <- read.delim("~/Desktop/MSC/RMDA/Assessments/RMDA Assessment1/mosquitos.txt")
library(tidyverse)
library(ggplot2)
library(rlang)
library(dplyr)

Exploring the data & performing some basic descriptive analysis

  • there are 3 variables (2 categorical and one numerical) and 100 observations in the dataset

  • there are no missing values or unexpected values such as ’NA” in any column

    glimpse(mosquitos) 
    Rows: 100
    Columns: 3
    $ ID   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
    $ wing <dbl> 37.83925, 50.63106, 39.25539, 38.05383, 25.15835, 57.95632, 46.58…
    $ sex  <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", …
    str(mosquitos)
    'data.frame':   100 obs. of  3 variables:
     $ ID  : int  1 2 3 4 5 6 7 8 9 10 ...
     $ wing: num  37.8 50.6 39.3 38.1 25.2 ...
     $ sex : chr  "f" "f" "f" "f" ...
  • some basic descriptive statistics on the wing data initially such as calculating mean, median, and standard deviation of wingspan for males and females:

mean(mosquitos$wing)
[1] 48.77822
sd(mosquitos$wing)
[1] 9.680198
median(mosquitos$wing)
[1] 48.41719
var(mosquitos$wing)
[1] 93.70623
range(mosquitos$wing)
[1] 25.15835 69.81825
summary(mosquitos$wing) #or could just use the summarise function for all these values
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.16   41.42   48.42   48.78   56.24   69.82 
  • clearly most interesting if the data was grouped by sex then summarised so could compare the 2:
summary_stats <- mosquitos %>%
  group_by(sex) %>%
    summarise(
      mean_wing = mean(wing),
      sd_wing = sd(wing),
      median.wing = median(wing),
      var.wing = var(wing),
      min_wing = min(wing),
      max_wing = max(wing))
print(summary_stats)
# A tibble: 2 × 7
  sex   mean_wing sd_wing median.wing var.wing min_wing max_wing
  <chr>     <dbl>   <dbl>       <dbl>    <dbl>    <dbl>    <dbl>
1 f          47.2    9.99        46.4     99.7     25.2     69.8
2 m          50.4    9.19        52.0     84.4     27.4     66.1

Visualising the data:

  1. Are there similar numbers of males and females in the dataset? This is a single categorical variable so useful to show it as bar graph. This shows that there is no difference in the number of males and females in the data so it should be possible to make useful comparisons:
ggplot(mosquitos, aes(x = sex, fill = sex)) + 
  geom_bar(color = "black",
           show.legend = FALSE,
           alpha = 5) +  
  labs(title = "Number of Mosquitos by Sex", 
       x = "Sex",
       y = "Count") +
  theme_minimal()

  1. Is the distribution of wing size reasonably normal? Looking at a single numerical variable-could use histogram, density plot or box plot.

    A histogram tells us the shape of the data (how it is distributed among groups), how it is spread (the amount of variability) and where the centre of the data is. The above displays frequency distribution, allowing for comparison of wingspan counts by sex. They show the frequency of different wingspan ranges, helping identify which sizes are most common among male and female mosquitoes. Also by overlaying or side-by-side histograms for each sex, you can easily compare the distributions, revealing differences in wingspan characteristics. Lastly, histograms help to identify the skewness of the data, indicating whether the wingspan of mosquitoes is normally distributed or if there are potential outliers.

    ggplot(mosquitos, aes(x = wing,
                          fill = sex)) + 
      geom_histogram(bins = 50,
                     alpha = 0.6) +
        labs(x = "WingSpan (mm)",
             y = "Count",
             title = "Histogram of Wings by sex")

Whereas a density plot visualises the distribution and overlaps between male and female wingspans. Density plots can also reveal whether the data has multiple modes (peaks), indicating distinct groups within the sexes that may not be evident in box plots. They may also give some insight into the variability by showing the entire distribution, and so density plots can help visualise how wingspan varies within each sex, providing insights into overall variability. In addition, when the distributions of males and females overlap, density plots can effectively illustrate the degree of overlap and highlight subtle differences.

ggplot(mosquitos, aes(x = wing, 
                      fill = sex)) + 
  geom_density(trim = FALSE, 
               color = "black",
               alpha = 0.3) +
    labs(x = "WingSpan (mm)",
         y = "Density",
         title = "Density plot of wing measures by sex")

Box Plots can be useful as they show the distribution of wingspan for each sex, highlighting medians and any outliers with none visible here. Helps identify if one sex generally has a larger wingspan than the other and gives a clear visualisation of the distributions. Box plots provide a visual summary of the data distribution for each sex, highlighting the median, quartiles, and any potential outliers which might warrant further investigation. They also allow easy comparison between male and female mosquitoes, showing variations in wingspan at a glance.

ggplot(mosquitos, aes(x = wing, y = sex,
                      fill = sex)) + 
  geom_boxplot(alpha = 0.4) +
    labs(x = "WingSpan (mm)",
         y = "Sex",
         title = "Boxplot of Wings by sex")

And adding in a point based box plot too:

ggplot(mosquitos, aes(x = wing, y = sex,
                      fill = sex, colour =  sex)) + 
  geom_boxplot(alpha = 0.4) +
  geom_jitter()

    labs(x = "WingSpan (mm)",
         y = "Sex",
         title = "Boxplot of Wings by sex")
$x
[1] "WingSpan (mm)"

$y
[1] "Sex"

$title
[1] "Boxplot of Wings by sex"

attr(,"class")
[1] "labels"

Performing statistical analysis

Using an independent t-test to compare wing size by sex

One of the obvious questions we would like to answer would be to see if there is a statistically significant difference between male and female wingspan. This requires a statistical test which can test categorical variables against numerical variables which are Mean Tests.

t_test_result <- t.test(wing ~ sex, data = mosquitos)
print(t_test_result)

    Welch Two Sample t-test

data:  wing by sex
t = -1.6686, df = 97.324, p-value = 0.09842
alternative hypothesis: true difference in means between group f and group m is not equal to 0
95 percent confidence interval:
 -7.0098735  0.6064862
sample estimates:
mean in group f mean in group m 
       47.17738        50.37907 
p_value <- t_test_result$p.value
print(paste("P-value:", p_value))
[1] "P-value: 0.0984171086062613"

The P value is 0.098 which is greater than the typical significance level of 0.05. Since 0.098>0.05 the result is not statistically significant so there is not enough evidence to conclude that there is a significant difference in wing sizes based on sex. The data could be rerun on a much large data set to reassess.