The presentation approach is up to you but it should contain the following:
Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
library(ggplot2)
library(ggthemes)
library(dplyr)
URL <- "https://raw.githubusercontent.com/okhaimova/CerealR/master/penguins.csv"
penguins <- na.omit(read.csv(URL)) #omitting the NA values
str(penguins) #identifies the variables
## 'data.frame': 333 obs. of 8 variables:
## $ X : int 1 2 3 5 6 7 8 13 14 15 ...
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : chr "male" "female" "female" "female" ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
## ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
#removing the X variable
penguins <- subset(penguins, select = -c(X,island))
#converting some columns from character to factor
penguins$species <- as.factor(penguins$species)
penguins$sex <- as.factor(penguins$sex)
#replace male/female values with M/F
levels(penguins$sex) <- c("F","M")
#adding a column for length to depth ratio of bills
penguins$billRatio <- penguins$bill_length_mm / penguins$bill_depth_mm
#reordering the data
penguins <- penguins[c(1,6,2:3,7,4:5)]
head(penguins)
## species sex bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1 Adelie M 39.1 18.7 2.090909 181
## 2 Adelie F 39.5 17.4 2.270115 186
## 3 Adelie F 40.3 18.0 2.238889 195
## 5 Adelie F 36.7 19.3 1.901554 193
## 6 Adelie M 39.3 20.6 1.907767 190
## 7 Adelie F 38.9 17.8 2.185393 181
## body_mass_g
## 1 3750
## 2 3800
## 3 3250
## 5 3450
## 6 3650
## 7 3625
#Summary statistics for the data
summary(penguins)
## species sex bill_length_mm bill_depth_mm billRatio
## Adelie :146 F:165 Min. :32.10 Min. :13.10 Min. :1.640
## Chinstrap: 68 M:168 1st Qu.:39.50 1st Qu.:15.60 1st Qu.:2.163
## Gentoo :119 Median :44.50 Median :17.30 Median :2.577
## Mean :43.99 Mean :17.16 Mean :2.607
## 3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.:3.097
## Max. :59.60 Max. :21.50 Max. :3.613
## flipper_length_mm body_mass_g
## Min. :172 Min. :2700
## 1st Qu.:190 1st Qu.:3550
## Median :197 Median :4050
## Mean :201 Mean :4207
## 3rd Qu.:213 3rd Qu.:4775
## Max. :231 Max. :6300
#finding the standard deviation for the numeric columns
penguins %>% summarize_if(is.numeric, sd)
## bill_length_mm bill_depth_mm billRatio flipper_length_mm body_mass_g
## 1 5.468668 1.969235 0.495436 14.01577 805.2158
#calculating the average for numeric columns, grouping by sex
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ sex, penguins, mean)
## sex bill_length_mm bill_depth_mm billRatio flipper_length_mm body_mass_g
## 1 F 42.09697 16.42545 2.609508 197.3636 3862.273
## 2 M 45.85476 17.89107 2.604988 204.5060 4545.685
#calculating the average for numeric columns, grouping by species
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ species, penguins, mean)
## species bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1 Adelie 38.82397 18.34726 2.121478 190.1027
## 2 Chinstrap 48.83382 18.42059 2.653756 195.8235
## 3 Gentoo 47.56807 14.99664 3.176602 217.2353
## body_mass_g
## 1 3706.164
## 2 3733.088
## 3 5092.437
#calculating the average for numeric columns, grouping by sex and species to highlight the differences between males and females among each species
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ sex+species, penguins, mean)
## sex species bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1 F Adelie 37.25753 17.62192 2.119121 187.7945
## 2 M Adelie 40.39041 19.07260 2.123835 192.4110
## 3 F Chinstrap 46.57353 17.58824 2.651010 191.7353
## 4 M Chinstrap 51.09412 19.25294 2.656501 199.9118
## 5 F Gentoo 45.56379 14.23793 3.202391 212.7069
## 6 M Gentoo 49.47377 15.71803 3.152081 221.5410
## body_mass_g
## 1 3368.836
## 2 4043.493
## 3 3527.206
## 4 3938.971
## 5 4679.741
## 6 5484.836
# Alternatively, the data could be read from the package by doing so:
# install.packages("palmerpenguins")
# library(palmerpenguins)
# require(palmerpenguins)
# data(penguins)
There are seven variables. When species and sex variables are converted to factor variables, their levels are easily seen in the summary table along with their frequencies. There are three different species that were found on three islands that are part of the Palmer Archipelago in Antarctica. The other five variables are numeric and they describe the bill length, bill depth, bill length to depth ratio, flipper length, and body mass of the penguins.I also omitted 11 NA values from the data set to summarize the data more effectively.
Based on the summary, there is variation among the penguins. It can be seen that their bills vary from 32.10 mm to 59.50 mm, with an average being 43.99 mm and a standard deviation of 5.4687. Likewise, their flippers vary in length from 172 mm to 231 mm, with 197 mm being the average flipper length and a standard deviation of 14.0158. The body mass varies from 2700 g to 6300 g with a standard deviation of 802.2158 and an average weight of 4207 g.
Furthermore, it can be seen that females have smaller bills according to their length and depth, along will shorter flippers. They also tend to weigh less. However, they tend to have similar bill length to depth ratios.
It also can be seen that different species have different characteristics. For example, Adelie penguins seem to have the shortest bills on average, with less variation among the other two species. The Gentoo penguins seem to have the shallowest bill, on average, compared to the other two species who seem to have similar bill depth. The Gentoo penguins also seem to have the longest flippers while weighing the most.It can also be seen that the bill length to depth ratio for Gentoo penguins is the greatest while Adelie penguins have the smallest ratio of bill length to depth.
When further analyzing the penguins by species, it can be seen that there are differences between the male and females penguins. For each species, the female penguins tend to have shorter and shallower bills, shorter flippers, and weigh less. The ratios of bill length to bill depth seem to be similar between genders for each species, with the Gentoo having a little more noticeable ratio.
ggplot(penguins, aes(fill=species)) +
geom_histogram (aes(x=species), stat="count") +
ggtitle("Species Count") +
ylab("Frequency")
There is more data available about Adelie and Gentoo penguins than there are of Chinstrap penguins.
ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_point(aes(color=species)) + facet_wrap(~sex) +
ggtitle("Flipper Length vs Bill Length")
The scatterplot further proves that the Adelie penguins have shorter bills and flippers compared to the other species. On the other hand, the Gentoo penguins have longer flippers but their bills seem to appear to be equal length with the Chinstrap penguins. It also shows that the flipper length and bill length is shorter for female penguins compared to thier male counterparts as the female penguin values looks shifted down and to the left.
ggplot(data=penguins, aes(y=body_mass_g, x=species)) +
geom_boxplot(aes(color=species)) + facet_wrap(~sex) +
ggtitle("Body Mass Distribution") +
stat_summary(fun=mean, geom="point", shape=3, size=2)
The boxplot is separated by sex for each species, as it shows the difference between the males and females which varies greatly for body mass. The female penguins weigh less compared to the male penguins. The Gentoo penguins have a more noticeable difference between the sexes. There also seems to be a bigger variation for the male Gentoo penguins and Adelie penguins compared to the females. On average, the Adelie and Chinstrap penguins tend to be more similar in body mass.
ggplot(penguins, aes(x=bill_depth_mm, fill=species, color=species)) +
geom_histogram(position="identity", alpha=0.4) +
ggtitle("Bill Depth Distribution") +
facet_grid(species~sex)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram shows the distribution of bill depths for each species and sex. It should be noted that the count is lower for Chinstrap penguins due to less data about that. The Chinstrap and Adelie penguins have similar bill depths. The Gentoo penguins seem to have the shallowest bill depth out of the three species. The chart also does not follow normal distribution. The charts do show that the females have shallower bill depths by a few mm compared to their male counterparts.
Overall, it can be concluded that there are certain characteristics that help differentiate the different species and genders of penguins. Females tend to have shorter and shallower bills, shorter flippers, and weigh less. Gentoo penguins seem to be the most noticeable because they weigh more than the other penguins, have shallower bills, and have longer flippers. Adelie penguins have the shortest bills and flippers. They have measurements that can be similar to the Chinstrap penguins, which can make it harder to distinguish between the two.
It should also be noted, that although there are distinguishing characteristics between species and genders based on bill depth, bill length, flipper length, and body mass, I did not take into account the possible variations between the three islands.