We are interested in how bill length differ between the three species of penguin (Adelie, Chinstrap & Gentoo) of penguins. We will use the penguins dataset from the palmerpenguins package with 344 observations and 8 variables.
Were conducted a exploratory data analysis, summary statistics including measures of centrality, spread and skewness, hypothesis testing, and correlation. The section 5 has the theory that support the each conclusion and analysis.

1. Exploratory Analysis

1.1. Variables

species island sex flipper_length_mm body_mass_g year bill_length_mm bill_depth_mm
factor factor factor integer integer integer numeric numeric

1.2. Summary of Bill Lengths

Variable type: numeric

skim_variable species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm Adelie 1 0.99 38.79 2.66 32.1 36.75 38.80 40.75 46.0 ▁▆▇▆▁
bill_length_mm Chinstrap 0 1.00 48.83 3.34 40.9 46.35 49.55 51.08 58.0 ▂▇▇▅▁
bill_length_mm Gentoo 1 0.99 47.50 3.08 40.9 45.30 47.30 49.55 59.6 ▃▇▆▁▁

1.3. Scatter of Species Bill Lengths

2. Summary statistics Analysis

Species Mean Median sd Skewness Skewness Classification
Adelie 38.824 38.85 2.663 0.156 fairly symmetrical
Chinstrap 48.834 49.55 3.339 -0.089 fairly symmetrical
Gentoo 47.568 47.40 3.106 0.604 moderately skewed

2.1. Mean & Median

The mean is more sensitive to outliers than the median. Upon examining the results, we observe a relatively small difference between the mean and median values for each species. This suggests either a limited presence of outliers or their absence altogether.

2.2. Standard Deviation (sd)

  • Adelie: expected that 68% of values are between 36.161mm (mean - sd) and 41.487mm (mean + sd);
  • Chinstrap: expected that 68% of values are between 45.495mm and 52.173mm.
  • Gentoo: expected that 68% of values are between 44.462mm and 50.674mm.

2.3. Skewness

Skewness to know how tends the distribution of the observations
  • Adelie (skewness = 0.156) , tends to be distorted to the right, and about the skewness magnitude classification is fairly symmetrical (|skewness| < 0.5).
  • Chinstrap (skewness = -0.089) , tends to be distorted to the left. About the skewness magnitude classification is fairly symmetrical (|skewness| < 0.5)
  • Gentoo (skewness = 0.604) , tends to be distorted to the right. About the skewness magnitude is classified as moderately skewed (0.5 =< |skewness| =< 1)

3. Hypothesis test & Probability

A hypothesis test to check whether the mean bill length of Gentoo is significantly longer than the mean bill length of Adelie.

H0: The average of bill length of Gentoo specie is the same of the average of Adelie specie. \[H_0: \mu_{average \ gentoo \ bill \ length} - \mu_{average \ adelie \ bill \ length} = 0\] H1: The average of bill length of Gentoo specie is longer than the average of Adeleie specie. \[H_1: \mu_{average \ gentoo \ bill \ length} -\mu_{average \ adelie \ bill \ length} > 0\]

Observed Statistic p-value min max
8.713 0 -2.618 2.457
The observed statistic is 8.713 (marked with a red line) goes beyond the null distribution (representation of a sampling distribution of H0). With the observed statistic value far from any tail side of the null distribution, it means that there is no probability of getting a value under the null distribution, and this way H0 is rejected.
A type I error (false positive) is the probability of rejecting the null hypothesis when the null hypothesis is not false. It means that are 5% (significance level equal to 0,05) of probability to reject the hypothesis that the difference of the means of the bill length by species is equal, being this hypothesis true.
The p-value is lower than the significance level (0.05), so there is strong evidence to reject H0 in favour of H1 and conclude that the average of of bill length of gentoo specie is longer than the average of adelie specie.

4. Correlation Analysis

4.1. Correlation Matrix

The length of the penguin's bill has a negative weak correlation with the depth of the bill (r = -0.24) and a strong positive and strong correlation with the length of the flipper (r = 0.66) and the weight (r = 0.6).
With a weak dependency (weak linear relationship), it is not reasonable to make predictions of the penguin's bill length through the bill depth.
On the other hand, we could look at the length of the flipper and the weight to predict the length of the penguin's bill. Both those physical characteristics have a strong linear relationship with the bill's length, revealing dependency from each other.

4.2. Correlation

5. Auxiliar - Theory concepts

Expand each option below to get more details about the theory concepts that support the analysis.

Boxplot Interpretation Explanation
The central box corresponds to the interquartile range of data distribution (IQR). 50% of values falling in the IQR which is limited by Q1 to Q3. Values beyond the whiskers on both sides are normally considered outliers.
The bar inside the central box represents the median (Q2) and shows the data symmetry and skewness. Skewness is the distortion or asymmetry in a normal distribution.
  • Symmetric distribution: the median is in the middle of the box, and the lengths of the whiskers are about the same on both sides;
  • Distribution is positively skewed (right skewed) the median is closer to the bottom of the box (Q1), and the whisker is shorter on the lower end of the box.
  • Distribution is negatively skewed (left skewed) the median is close to the top of the box and the whisker is shorter on the upper end of the box.
Summary Statistics: Skewness & Standard Deviation (sd)
Skewness: We can describe how symmetrical a distribution it is, this is called skewness. Skewness in a distribution refers to asymmetry, to a tendency to be distorted to the left or right.
  • Left Skewed distribution:
    • Skewness value < 0
    • Centrality measures typically fall in the order: mean < median < |skewness|
    • In terms of visualisation, data will show the bulk of the data on the right of the plot, and the tail extending out to the left.
  • Right Skewed distribution:
    • Skewness value > 0
    • Centrality measures typically fall in the order: |skewness| < median < mean
    • Data will show the bulk of the data on the right of the plot, and the tail extending out to the left.
Magnitude of Skewness Classification
< 0.5 fairly symmetrical
0.5 - 1 moderately skewed
> 1 highly skewed
The magnitude of the skewness is determined by the absolute value of skewness and can be interpreted on the right table:




Standard deviation (sd): is a quantity expressing by how much, on average, the values in a distribution differ from the distribution mean. This way we can to know how spread out the observations are. For a normal distribution, is expected that:
  • 68% of values fall within one standard deviation of the mean
  • 95% of values would be within the two standard deviation, and
  • 99.7% within the 3 standard deviation.
Theory Concepts for Hypothesis testing Interpretation
Significance level: is a pre-determined threshold for the p-value determining whether we reject H0 or not. The significance level defines Type I error, which is the probability of rejecting the null hypothesis when the null hypothesis is true.
Null distribution: is a sampling distribution, typically generated using bootstrap methods, representing what one would expect if the null hypothesis (H0) were true. In essence, H0 is employed to simulate this null sampling distribution.
p-value: signifies the probability of observing a test statistic as extreme as, or more extreme than, the obtained value, assuming the null hypothesis is true. It serves as the smallest significance level at which we would reject the null hypothesis, considering the observed sample statistic.
Correlation Concepts
Magnitude
r(x, y)
Correlation
Strength
0 none
0.01 - 0.19 very weak
0.20 - 0.39 weak
> 0.40 - 0.59 moderate
0.60 - 0.79 strong
0.80 - 0.99 very strong
1 perfect
The correlation provides both the direction and the strength of a linear relationship between two variables. The direction could be negative or positive and the strength is measured by the correlation coefficient (r) limited from -1 to +1.
How much closer the correlation coefficient (r) is to 1, stronger will be the linear relationship between two variables. Intermediate values indicate that variables tend to be related.
  • Null Correlation: a correlation equal to 0 indicates no linear relationship between the two variables; the variables are independent.
  • Positive Correlation: with a correlation of +1 the two random variables have a perfect positive linear relationship, therefore, a specific value of one variable, X, predicts the other variable, Y, exactly. The variables are positively dependent.
  • Negative Correlation: a correlation of -1 indicates a perfect negative linear relationship between two variables, with one variable, X, predicting the negative of the other variable, Y. Perfect negative linear dependency is indicated by a correlation of -1.
The magnitude to classify the strength of the correlation between variables is shown in the right table.