When studying the penguins on the Palmer islands, how do penguin mass body mass, penguin species, and penguin sex all interact with each other? We will try to explore these questions by looking at an appropriate data set.
We do some exploratory data analysis to determine more about this data set.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
dim(penguins)
## [1] 344 8
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
As we can see, there are eight variables: species(categorical), island(categorical), sex(categorical), body mass(quantitative), flipper length(quantitative), bill depth(quantitative), and bill length(quantitative)
We might be interested in knowing what the “categories” in our categorical variables might be. We enter the following R code to check this:
levels(as.factor(penguins$species))
## [1] "Adelie" "Chinstrap" "Gentoo"
levels(as.factor(penguins$island))
## [1] "Biscoe" "Dream" "Torgersen"
levels(as.factor(penguins$sex))
## [1] "female" "male"
We see that each categorical variable has the following categories: species(Adelie, Chinstrap, Gentoo), island(Torgersen, Dream, Biscoe), amd sex(male and female) We did see that this data set had some NA values in it. We will “clean” the data by removing the NAs, and naming this new data set “penguins2”. We will only work with penguins2 for the rest of this report.
colSums(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
penguins2 <- penguins %>% drop_na()
We are interested in checking the independence of the categorical variables species and island. We can do this by making a contingency table and by checking appropriate graphs. We include a few of those below:
table(penguins2$species, penguins2$island)
##
## Biscoe Dream Torgersen
## Adelie 44 55 47
## Chinstrap 0 68 0
## Gentoo 119 0 0
addmargins(table(penguins2$species, penguins2$island))
##
## Biscoe Dream Torgersen Sum
## Adelie 44 55 47 146
## Chinstrap 0 68 0 68
## Gentoo 119 0 0 119
## Sum 163 123 47 333
In the data below, we see that the distribution for species changes for different islands. Specifically, while Gentoo and Chinstrap pinguins are only accounted for on Biscoe and Dream respectively, the Adelie peguens are distibuted relitivly evenly throughout all three islands. This difference in species distibution across gthe islands suggests that these two variables are NOT independent.
We might also be interested in comparing body mass (a quantitative variable) across categories of species and of sex. We can do this by calculating summary statistics (mean, standard deviation, and so on) for each species and/or sex category. We include a few of those below:
ggplot(data = penguins2, aes(x = body_mass_g, fill = species)) +
geom_histogram(binwidth = 200) +
facet_wrap(~sex)
We can draw the following conclusion: Adelle pinguins tend to be heaveier acroos male and female, with females tending to be heavier.
To further explore the difference in body mass across categories of sex and/or species, we can create displays of the data! We do so here, creating appropriate histograms and boxplots:
ggplot(data = penguins2, aes(y = interaction(species, sex), x = body_mass_g, fill = sex)) +
geom_boxplot()
By looking at the histograms and the boxplots, we see the following: gentoo males are the heviest with the second heviest group being gentoo females
This data set allows us to explore the interplay between species and island, and between species, sex, and body mass, of the Palmer Penguins. By using a contingency table and some bar charts, we see that there is a relationship between species and island (with different islands having vastly different distributions of penguin species). By comparing summary statistics and looking at histograms and boxplots, we see that there is a relationship between body mass and species / sex (with males typically having more body mass than females for each species, and Gentoo penguins typically having more body mass than the other two species of penguin). There are many other comparisons we could make in this data set, and we may return to it in the future!