The presentation approach is up to you but it should contain the following:

  1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

  2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

  3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

  4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

  5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Are there certain charcteristics that help differentiate the different species and genders of penguins?

library(ggplot2)
library(ggthemes)
library(dplyr)
URL <- "https://raw.githubusercontent.com/okhaimova/CerealR/master/penguins.csv"
penguins <- na.omit(read.csv(URL)) #omitting the NA values

str(penguins) #identifies the variables
## 'data.frame':    333 obs. of  8 variables:
##  $ X                : int  1 2 3 5 6 7 8 13 14 15 ...
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : chr  "male" "female" "female" "female" ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
#removing the X variable
penguins <- subset(penguins, select = -c(X,island))

#converting some columns from character to factor
penguins$species <- as.factor(penguins$species)
penguins$sex <- as.factor(penguins$sex)

#replace male/female values with M/F
levels(penguins$sex) <- c("F","M")

#adding a column for length to depth ratio of bills
penguins$billRatio <- penguins$bill_length_mm / penguins$bill_depth_mm

#reordering the data
penguins <- penguins[c(1,6,2:3,7,4:5)]

head(penguins)
##   species sex bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1  Adelie   M           39.1          18.7  2.090909               181
## 2  Adelie   F           39.5          17.4  2.270115               186
## 3  Adelie   F           40.3          18.0  2.238889               195
## 5  Adelie   F           36.7          19.3  1.901554               193
## 6  Adelie   M           39.3          20.6  1.907767               190
## 7  Adelie   F           38.9          17.8  2.185393               181
##   body_mass_g
## 1        3750
## 2        3800
## 3        3250
## 5        3450
## 6        3650
## 7        3625
#Summary statistics for the data
summary(penguins)
##       species    sex     bill_length_mm  bill_depth_mm     billRatio    
##  Adelie   :146   F:165   Min.   :32.10   Min.   :13.10   Min.   :1.640  
##  Chinstrap: 68   M:168   1st Qu.:39.50   1st Qu.:15.60   1st Qu.:2.163  
##  Gentoo   :119           Median :44.50   Median :17.30   Median :2.577  
##                          Mean   :43.99   Mean   :17.16   Mean   :2.607  
##                          3rd Qu.:48.60   3rd Qu.:18.70   3rd Qu.:3.097  
##                          Max.   :59.60   Max.   :21.50   Max.   :3.613  
##  flipper_length_mm  body_mass_g  
##  Min.   :172       Min.   :2700  
##  1st Qu.:190       1st Qu.:3550  
##  Median :197       Median :4050  
##  Mean   :201       Mean   :4207  
##  3rd Qu.:213       3rd Qu.:4775  
##  Max.   :231       Max.   :6300
#finding the standard deviation for the numeric columns
penguins %>% summarize_if(is.numeric, sd) 
##   bill_length_mm bill_depth_mm billRatio flipper_length_mm body_mass_g
## 1       5.468668      1.969235  0.495436          14.01577    805.2158
#calculating the average for numeric columns, grouping by sex
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ sex, penguins, mean)
##   sex bill_length_mm bill_depth_mm billRatio flipper_length_mm body_mass_g
## 1   F       42.09697      16.42545  2.609508          197.3636    3862.273
## 2   M       45.85476      17.89107  2.604988          204.5060    4545.685
#calculating the average for numeric columns, grouping by species
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ species, penguins, mean)
##     species bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1    Adelie       38.82397      18.34726  2.121478          190.1027
## 2 Chinstrap       48.83382      18.42059  2.653756          195.8235
## 3    Gentoo       47.56807      14.99664  3.176602          217.2353
##   body_mass_g
## 1    3706.164
## 2    3733.088
## 3    5092.437
#calculating the average for numeric columns, grouping by sex and species to highlight the differences between males and females among each species
aggregate(cbind(bill_length_mm, bill_depth_mm, billRatio, flipper_length_mm, body_mass_g) ~ sex+species, penguins, mean)
##   sex   species bill_length_mm bill_depth_mm billRatio flipper_length_mm
## 1   F    Adelie       37.25753      17.62192  2.119121          187.7945
## 2   M    Adelie       40.39041      19.07260  2.123835          192.4110
## 3   F Chinstrap       46.57353      17.58824  2.651010          191.7353
## 4   M Chinstrap       51.09412      19.25294  2.656501          199.9118
## 5   F    Gentoo       45.56379      14.23793  3.202391          212.7069
## 6   M    Gentoo       49.47377      15.71803  3.152081          221.5410
##   body_mass_g
## 1    3368.836
## 2    4043.493
## 3    3527.206
## 4    3938.971
## 5    4679.741
## 6    5484.836
# Alternatively, the data could be read from the package by doing so:
  # install.packages("palmerpenguins")
  # library(palmerpenguins)
  # require(palmerpenguins)
  # data(penguins)

Summary Analysis

There are seven variables. When species and sex variables are converted to factor variables, their levels are easily seen in the summary table along with their frequencies. There are three different species that were found on three islands that are part of the Palmer Archipelago in Antarctica. The other five variables are numeric and they describe the bill length, bill depth, bill length to depth ratio, flipper length, and body mass of the penguins.I also omitted 11 NA values from the data set to summarize the data more effectively.

Based on the summary, there is variation among the penguins. It can be seen that their bills vary from 32.10 mm to 59.50 mm, with an average being 43.99 mm and a standard deviation of 5.4687. Likewise, their flippers vary in length from 172 mm to 231 mm, with 197 mm being the average flipper length and a standard deviation of 14.0158. The body mass varies from 2700 g to 6300 g with a standard deviation of 802.2158 and an average weight of 4207 g.

Furthermore, it can be seen that females have smaller bills according to their length and depth, along will shorter flippers. They also tend to weigh less. However, they tend to have similar bill length to depth ratios.

It also can be seen that different species have different characteristics. For example, Adelie penguins seem to have the shortest bills on average, with less variation among the other two species. The Gentoo penguins seem to have the shallowest bill, on average, compared to the other two species who seem to have similar bill depth. The Gentoo penguins also seem to have the longest flippers while weighing the most.It can also be seen that the bill length to depth ratio for Gentoo penguins is the greatest while Adelie penguins have the smallest ratio of bill length to depth.

When further analyzing the penguins by species, it can be seen that there are differences between the male and females penguins. For each species, the female penguins tend to have shorter and shallower bills, shorter flippers, and weigh less. The ratios of bill length to bill depth seem to be similar between genders for each species, with the Gentoo having a little more noticeable ratio.

ggplot(penguins, aes(fill=species)) + 
    geom_histogram (aes(x=species), stat="count") +
    ggtitle("Species Count") +
    ylab("Frequency")

There is more data available about Adelie and Gentoo penguins than there are of Chinstrap penguins.

ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) + 
    geom_point(aes(color=species)) + facet_wrap(~sex) + 
    ggtitle("Flipper Length vs Bill Length")

The scatterplot further proves that the Adelie penguins have shorter bills and flippers compared to the other species. On the other hand, the Gentoo penguins have longer flippers but their bills seem to appear to be equal length with the Chinstrap penguins. It also shows that the flipper length and bill length is shorter for female penguins compared to thier male counterparts as the female penguin values looks shifted down and to the left.

ggplot(data=penguins, aes(y=body_mass_g, x=species)) + 
    geom_boxplot(aes(color=species)) + facet_wrap(~sex) + 
    ggtitle("Body Mass Distribution") + 
    stat_summary(fun=mean, geom="point", shape=3, size=2) 

The boxplot is separated by sex for each species, as it shows the difference between the males and females which varies greatly for body mass. The female penguins weigh less compared to the male penguins. The Gentoo penguins have a more noticeable difference between the sexes. There also seems to be a bigger variation for the male Gentoo penguins and Adelie penguins compared to the females. On average, the Adelie and Chinstrap penguins tend to be more similar in body mass.

ggplot(penguins, aes(x=bill_depth_mm, fill=species, color=species)) + 
    geom_histogram(position="identity", alpha=0.4) + 
    ggtitle("Bill Depth Distribution") +
    facet_grid(species~sex) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram shows the distribution of bill depths for each species and sex. It should be noted that the count is lower for Chinstrap penguins due to less data about that. The Chinstrap and Adelie penguins have similar bill depths. The Gentoo penguins seem to have the shallowest bill depth out of the three species. The chart also does not follow normal distribution. The charts do show that the females have shallower bill depths by a few mm compared to their male counterparts.

Conlusion

Overall, it can be concluded that there are certain characteristics that help differentiate the different species and genders of penguins. Females tend to have shorter and shallower bills, shorter flippers, and weigh less. Gentoo penguins seem to be the most noticeable because they weigh more than the other penguins, have shallower bills, and have longer flippers. Adelie penguins have the shortest bills and flippers. They have measurements that can be similar to the Chinstrap penguins, which can make it harder to distinguish between the two.

It should also be noted, that although there are distinguishing characteristics between species and genders based on bill depth, bill length, flipper length, and body mass, I did not take into account the possible variations between the three islands.