mosquitos <- read.delim("C:/Users/Owen King/Downloads/mosquitos.txt") # firstly you load the data set into R using the code above but tailoured to whherever the data file is downloaded.
View(mosquitos) # this will allow us to see the data set in a seperate tab.1 Formative 1 - Mosquitoes:
chose the data outlining the ID, Sex, and wingspan (mm) of 100 mosquito.
2 Question?
Can you identify the sex of a mosquito by measuring its wingspan?
3 Hypothesis:
The wingspan of a mosquito is a sexually dimorphic feature.
4 what statistical test is needed?
Due to the predictor variable being quantatitve (wingspan mm) and the outcome variable being categorical (sex) a logistic regression model would be best fit for this data set.
Although running the Logistic regression model shows that there is a positive coefficient of 0.0353 this is not statistically significant (p value = 0.1) therefore it is not possible to accurately identify whether or not a mosquito is male or female based of their wingspan, resulting in a rejection of the hypothesis.
5 Graphical analysis:
As you can see from the graph there is a slight difference in the distribution of wingspans males having a slightly larger average wingspan than females but the boxplots overlap significantly, meaning you cannot rely entirely on wingspan to determine the sex of a mosquito however, expanding the data set and adding more variables could prevail more answers.
6 R code used:
- loading and viewing the data set in R
- creating the graphs which can be used for analysis
library(ggplot2) # this line of code opens the package ggplot in R
ggplot(mosquitos, aes(x = sex, y = wing, fill = sex)) + geom_boxplot() + scale_fill_manual(values = c("m" = "skyblue", "f" = "pink" )) + scale_x_discrete(labels = c("m" = "male", "f" = "female")) + labs(title = "Comparison of wingspan for different sex of mosquitoes", x = "Sex", y = "Wingspan (mm)")
# the first line, the "ggplot" function opens up the graphical design function which allows you to identify the data set you will be creating the box plot around. The aes function allocates the x and y variables to the graph, in this case the categorical variable "sex" is allocated to x and the numerical variable "wing" is allocated to y. The last bit within these brackets are fill = sex which tells r to fill in the boxes within the boxplot.
# geom_ boxplot tells R that we want to create a boxplot
# scale_fill_manual tells R the colour in which we wanted to fill the boxes rather than r automatically allocating it a colour, in this case I chose pink for females and skyblue for males.
# scale_x_discrete(labels = ) function changes the labels on the graph to whatever is desired, in this case changing the "f" and "m" to male and female along the x axis.
# finally the labs function allows us to input titles, into the graph itself, the main title of the graph being labelled "Comparison of wingspan for different sex of mosquitoes", the x axis "Sex" and the y axis "Wingspan (mm)".running the logistic regression model:
str(mosquitos$sex) chr [1:100] "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" "f" ...
mosquitos$sex <- as.factor(mosquitos$sex)
# before any statistical test is done we need to check whether sex appears as a factor or not in the mosquitos data set. It does not as it does not appear as a factor when running the str() function. Therefore you need to change it to a factor using the as.factor function, using this in collaboration with the <- function converts the column in mosquitos labelled sex to a factor-based data set. The reason you have to do this as to run a logistical regression model the r has to be able to assign each value within the sex column a number in this case if we run str(mosquitos$sex) we can see that it has assigned f = female is assigned 1 and m = male is assigned 2.
lrmodel <- glm(sex ~ wing, data = mosquitos, family = "binomial")
# after cleaning up the data ready for the statistical test we can begin running the logistic regression model, the <- function assigns the logistic regression model to a vector allowing us to open and visualise the results using the summary() function later on. The glm() function tells r we are running a type of generalised linear model similar to how the ggplot works, we open up the function and tell the r what data sets we are comparing in this case it is sex and wing. We also have to tell r that we are using the data from the mosquitos data set using the function data = mosquitos. Finally, the family = "binomial" function which tells R that we want to do a logistic regression model.
summary(lrmodel)
Call:
glm(formula = sex ~ wing, family = "binomial", data = mosquitos)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.72276 1.06805 -1.613 0.107
wing 0.03531 0.02149 1.643 0.100
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.63 on 99 degrees of freedom
Residual deviance: 135.84 on 98 degrees of freedom
AIC: 139.84
Number of Fisher Scoring iterations: 4
# to check what results have been produced running the summary() function opens up the results in the console. there are lots of numbers that don't have much real importance other than the estimated Coefficients and the p value represented by Pr. the estimated Coefficient tells us that every time the wingspan increases so does the probability of the individual being male, we know it is referring to males as it has a positive estimate and in the boxplot we can see that the males on average have a larger data wingspan.