Introduction

  • This is the midterm project presentation for DAT 301, Exploring Data in R and Python. In this presentation, I will first describe the data set I chose to use. I will then display the data in 4 different graphs created using R, displaying various important information. Lastly, I will show statistical analysis on the data and discuss that analysis. Thank you!

Data

  • I chose to use the penguins data set from the palmerpenguins library.
  • This data set contains information on individual penguin’s species, island, bill length, bill depth, flipper length, body mass, sex, and year.
  • There are 344 penguins contained within the data set.

Data Cleaning Code

# Removes rows with any na values
penguins_df <- penguins %>%
  filter(complete.cases(.))
  • In order to use the data for visualization and analyzation purposes, it was necessary to clean the data from the rows containing na values.

Plot 1: Boxplot of Body Mass by Island and Species

Plot 1: Discussion

  • This plot is useful because it visualizes the range of body weights for penguins between the islands.
  • It also shows what species of penguins exist on each island and how their body weights differ from one another.
  • This plot shows that Gentoo penguins weigh by far the most.
  • This plot also shows that the Biscoe and Dream island have two penguin species, whereas Torgersen only has one.

Plot 1: Code

# Sets the x, y, and fill
ggplot(penguins_df, aes(x = island, y = body_mass_g, fill = species)) +
  # Creates the box plot
  geom_boxplot() +
  # Sets the labels and legend
  labs(title = "Body Mass by Island and Species", 
       x = "Island", 
       y = "Body Mass (g)", 
       fill = "Species") +
  # Adds a minimal theme
  theme_minimal()

Plot 2: Pie Chart for the Proportion of Penguin Species

Plot 2: Discussion

  • This plot is useful because it demonstrates the proportion of an individual penguin species to the total amount of penguins.
  • This plot shows that Adelie penguins are the most common species of penguin, closely followed by Gentoo penguins, with Chinstrap penguins being by far the least common.

Plot 3: Boxplot of Average Bill Depth by Species and Sex

Plot 3: Discussion

  • This plot is useful because it shows the average bill depth for each species.
  • It also has a bar for both sexes providing more information on the difference in bill depth between males and females.
  • This plot shows that Adelie and Chinstrap penguins have a similar bill depth, whereas Gentoo penguins have a shorter bill depth.
  • This plot also shows that male bill depths tend to be larger than female bill depths amongst these 3 species of penguins.

Plot 4: 3D Scatter Plot of Bill Length vs. Bill Depth vs. Flipper Length

Plot 4: Discussion

  • This plot is useful because it visualizes individual penguins based on their bill length, bill depth, and flipper length.
  • It also shows what species a penguin is through the color of point on the scatter plot.
  • This plot shows that Gentoo penguins tend to have the largest flipper lenghts amongst the species.
  • This plot also shows that Adelie and Chinstrap penguins tend to have similar flipper lengths and bill depths, but differ in bill lengths.

Statistical Analysis

  • This is the result of an ANOVA test to determine whether body mass differs between individual penguin species:
##              Df    Sum Sq  Mean Sq F value Pr(>F)    
## species       2 145190219 72595110   341.9 <2e-16 ***
## Residuals   330  70069447   212332                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The results of the ANOVA test show that there is clearly a difference between body mass in penguins since the p value (2e-16) is so low.

Conclusion

  • In conclusion, the penguins data set was analyzed and displayed through 4 different graphs and discussions, along with a statistical analysis.
  • Thank you!