While doing a course on Coursera I came across this dataset, which was meant to give hands on experience of working on R package “TIDYVERSE” & “GGPLOT”. The objective of this project is do explanatory data analysis using this data and answer following questions :
To create various kind of plots using GGPLOT ?
Number of species and details ?
Which is the largest penguin Species ?
Which species has more Body Mass ?
While doing Research I come to know that this dataset is collected and made availble to public by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. The dataset consist of 2 data frames Penguins and Penguins raw, Both datasets contain data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.
ROCCC analysis
Reliable
Original
Comprehensive
Current
Cited
Data set checks yes in all criteria of ROCCC analysis.
All the data used in this EDA are in built part of GGPLOT and can be viewed loaded using command library(ggplot2), library(palmerpenguins).
The packages used in this data analysis are : tidyverse, ggplot
and installed using : install.packages(“package_name”)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 0.3.5
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.5.0
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
Loading Data set in R Studio
library(palmerpenguins)
str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_l~1 body_~2 sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema~ 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # ... with abbreviated variable names 1: flipper_length_mm, 2: body_mass_g
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_l~1 body_~2 sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema~ 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # ... with abbreviated variable names 1: flipper_length_mm, 2: body_mass_g
clean_names(penguins)
penguins %>% group_by(island) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## # A tibble: 3 x 4
## island mean_bill_length_mm mamximum_bill_length minimum_bill_length
## <fct> <dbl> <dbl> <dbl>
## 1 Biscoe 45.2 59.6 34.5
## 2 Dream 44.2 58 32.1
## 3 Torgersen 39.0 46 33.5
penguins %>% group_by(species) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## # A tibble: 3 x 4
## species mean_bill_length_mm mamximum_bill_length minimum_bill_length
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 38.8 46 32.1
## 2 Chinstrap 48.8 58 40.9
## 3 Gentoo 47.6 59.6 40.9
penguins %>% group_by(island, species) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups: island [3]
## island species mean_bill_length_mm mamximum_bill_length minimum_bill_le~1
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Biscoe Adelie 39.0 45.6 34.5
## 2 Biscoe Gentoo 47.6 59.6 40.9
## 3 Dream Adelie 38.5 44.1 32.1
## 4 Dream Chinstrap 48.8 58 40.9
## 5 Torgersen Adelie 39.0 46 33.5
## # ... with abbreviated variable name 1: minimum_bill_length
penguins %>% group_by(island) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## # A tibble: 3 x 4
## island mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
## <fct> <dbl> <dbl> <dbl>
## 1 Biscoe 15.9 21.1 13.1
## 2 Dream 18.3 21.2 15.5
## 3 Torgersen 18.5 21.5 15.9
penguins %>% group_by(species) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## # A tibble: 3 x 4
## species mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 18.3 21.5 15.5
## 2 Chinstrap 18.4 20.8 16.4
## 3 Gentoo 15.0 17.3 13.1
penguins %>% group_by(island, species) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups: island [3]
## island species mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Biscoe Adelie 18.4 21.1 16
## 2 Biscoe Gentoo 15.0 17.3 13.1
## 3 Dream Adelie 18.2 21.2 15.5
## 4 Dream Chinstrap 18.4 20.8 16.4
## 5 Torgersen Adelie 18.5 21.5 15.9
penguins %>% group_by(island) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## # A tibble: 3 x 4
## island mean_body_mass mamximum_body_mass minimum_body_mass
## <fct> <dbl> <int> <int>
## 1 Biscoe 4719. 6300 2850
## 2 Dream 3719. 4800 2700
## 3 Torgersen 3709. 4700 2900
penguins %>% group_by(species) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## # A tibble: 3 x 4
## species mean_body_mass mamximum_body_mass minimum_body_mass
## <fct> <dbl> <int> <int>
## 1 Adelie 3706. 4775 2850
## 2 Chinstrap 3733. 4800 2700
## 3 Gentoo 5092. 6300 3950
penguins %>% group_by(island, species) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups: island [3]
## island species mean_body_mass mamximum_body_mass minimum_body_mass
## <fct> <fct> <dbl> <int> <int>
## 1 Biscoe Adelie 3710. 4775 2850
## 2 Biscoe Gentoo 5092. 6300 3950
## 3 Dream Adelie 3701. 4650 2900
## 4 Dream Chinstrap 3733. 4800 2700
## 5 Torgersen Adelie 3709. 4700 2900
Num_of_species <- penguins %>% count(species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + labs(title = " Number Of Species ")
Num_of_species #Viewing Plot
Num_of_species_Gender <- penguins %>% drop_na() %>% count(sex, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + facet_wrap(~sex)+ theme(axis.text.x = element_text(angle = 90)) + labs(title = "Number Of Species by Gender")
Num_of_species_Gender # Viewing plot
Num_of_species_island <- penguins %>% drop_na() %>% count(island, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + facet_wrap(~island)+ theme(axis.text.x = element_text(angle = 90)) + labs(title = "Number Of Species On Islands")
Num_of_species_island # Viewing Plot
Flipper_body_mass <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species)) + labs(title = " Body Mass VS. Flipper Length ")
Flipper_body_mass #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Flipper_Bill_len <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(mapping = aes(x = bill_length_mm, y = flipper_length_mm, color = species)) + labs(title = " Bill Length mm VS. Flipper Length mm ")
Flipper_Bill_len #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Bill_len_depth <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + labs(title = "Bill Length mm VS. Bill depth mm")
Bill_len_depth #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Bill_len_body_mass <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = body_mass_g)) + geom_point(mapping = aes(x = bill_length_mm, y = body_mass_g, color = species)) + labs(title = "Bill length mm VS. Body mass (g)")
Bill_len_body_mass #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Flipper_len_frequency <- ggplot(data = penguins) + geom_histogram(mapping = aes(x=flipper_length_mm, fill= species)) + labs(title = "Flipper Length Frequency Grouped by Species")
Flipper_len_frequency #Viewing plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Body_mass_frequency <- ggplot(data = penguins) + geom_histogram(mapping = aes(x=body_mass_g, fill= species)) + labs(title = "Body Mass Frequency grouped by Species ")
Body_mass_frequency #Viewing plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
} # Code to create box plot
multiplot(Num_of_species, Num_of_species_Gender, Num_of_species_island, Flipper_body_mass , cols = 2) #Plot 1
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
multiplot(Flipper_len_frequency, Body_mass_frequency, Flipper_Bill_len, Flipper_Bill_len, cols = 2) #Plot 2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).
While analyzing the data set I come to know that :
It appears that The number of Penguin on Biscoe, Dream, and Torgersen islands are 163, 123, and 47 respectively and the number of penguin species present there are Adelle, Chinstrap, Gentoo are respectively 152, 68, 124.
One of the key findings of our study was the correlation we observed between body mass and flipper length. We found that Gentoo penguins had the largest body mass and flipper length of all the species observed. However, it is important to note that island was not found to be a determining factor in penguin size and body mass.