Dataset Overview and Source

HairEyeColor Dataset

This analysis examines data from 592 individuals to identify patterns in hair color, eye color, and sex distribition.

Data Source: Built-in R dataset (HairEyeColor)

Key Variables:

  • Hair: Hair color (Black, Brown, Red, Blond)
  • Eye: Eye color (Brown, Blue, Hazel, Green)
  • Sex: Male or Female
  • Freq: Count of individuals in each category combination

R Code for Data Preparation

This is how I loaded and prepared the data:

library(ggplot2)
library(plotly)
library(dplyr)

# conversion from 3d to df
hec_df <- as.data.frame(HairEyeColor)

head(hec_df)

hec_df$Hair <- factor(hec_df$Hair)
hec_df$Eye <- factor(hec_df$Eye)

3D Plotly: Hair, Eye, and Frequency

3D Plot Analysis

Key Observations:

  • Dominant Combination: Brown hair and brown eyes form the largest concentration, indicating this is the most common trait pairing in the dataset.

  • Blond-Blue Cluster: There is a clear secondary cluster where blond hair strongly aligns with blue eyes, suggesting a notable association between these traits.

  • Rarity of Traits: Red hair and green eyes appear at much lower frequencies, highlighting their relative rarity in the population.

  • Structured Distribution:The data is not evenly distributed with distinct clusters suggesting that certain trait combinations occur way more frequently than others.

Plotly Scatter: Frequency by Hair Color and Eye Color

ggplot Boxplot: Hair Color Distribution by Sex

ggplot Bar Chart: Eye Color Distribution

Statistical Analysis: Summary Statistics

hec_df %>%
  group_by(Hair) %>%
  summarise(
    Total = sum(Freq),
    Mean = mean(Freq),
    Max = max(Freq),
    Min = min(Freq)
  )
## # A tibble: 4 × 5
##   Hair  Total  Mean   Max   Min
##   <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Black   108 13.5     36     2
## 2 Brown   286 35.8     66    14
## 3 Red      71  8.88    16     7
## 4 Blond   127 15.9     64     3

Summary Statistics: Interpretation

Detailed Findings:

  • Uneven Distribution: Hair color frequencies are not balanced. Brown and blond hair dominate the dataset, while red hair is notably less common.

  • Concentration Effect: A small number of combinations (especially brown/brown, blond/blue) account for a large proportion of the total observations.

  • Low-Frequency Categories: Several combinations have very small counts, reinforcing the idea that some genetic traits are rare.

  • Categorical Variation: Unlike continuous data, the variation here reflects differences in category frequency rather than spread or dispersion.

Statistical Analysis: Chi-Square Test

## 
##  Pearson's Chi-squared test
## 
## data:  table_hec
## X-squared = 138.29, df = 9, p-value < 2.2e-16

Chi-Square Test: Interpretation

Comprehensive Analysis:

  • Statistical Significance: The chi-square test produces a very low p-value, indicating that the relationship between hair color and eye color is statistically significant.

  • Dependence Between Variables Hair color and eye color are not independent. Certain combinations occur more often than expected under randomness.

  • Strength of Association: The observed clustering (again, blond/blue, brown/brown) supports the conclusion that these traits are meaningfully associated.

  • Interpretation in Context: This suggests underlying biological or genetic relationships influencing how these traits are expressed together.

Key Insights and Conclusions

Major Findings:

First, trait combinations are highly structured, with brown hair/brown eyes and blond hair/blue eyes dominating the dataset.

Second, hair color and eye color show a statistically significant relationship, meaning their pairing is not random.

Third, rare traits such as red hair and green eyes occur at consistently low frequencies, emphasizing uneven distribution across categories.

Study Implications and Future Directions

Practical Applications:

This dataset provides a strong example of how categorical data can reveal meaningful relationships through visualization and statistical testing. It is particularly useful for understanding trait association patterns, demonstrating chi-square analysis, and looking at foundations of genetic distribution.

Study Limitations:

This dataset is aggregated and does not include individual-level observations, which really hinders deep analysis. Additionally, it represents a specific population and may not be generalized broadly.

Future Research Directions:

  • Expanding to larger and more diverse populations
  • Incorporating additional traits (like skin tone, ancestry, etc.)
  • Using probabilistic models to predict trait combinations

Thank You!

Dataset Source: R Datasets

Tools Used: R, ggplot2, plotly, dplyr