Simple Exploratory Analysis of R’s US women’s dataset

1 - Brief Introduction

This dataset contains heights and weights of 15 american women aged between 30 - 39. Here I perform simple visual, univariate and multivariate analysis on it’s variables and summarise my findings. For more info on this dataset, please visit: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/women.html.

2 - Load the required libraries

In R, as with other statistical analysis programming languages, you have to load in the packages you need. As a wise person once said: We are standing on the shoulders of Giants - Thanks to the amazing R community for these super awesome packages.

library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(skimr)
library(knitr)

3 - Open the dataset and understand the initial shape of it

Quick data preview

It’s always good to use the head function. It gives you a quick visual feel of the dataset.

Remember height is in inches and weight is in lbs

head(women)

##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

Now a check on the number of rows, columns and data types of the variables/columns

glimpse is my favorite function for giving a simple. clear and concise explanation of the overall dimension of the dataset, it’s variables and their types. It comes with the dplyr package which is part of the tidyverse

class simply tells you what data structure format the dataset is in

glimpse(women)

## Observations: 15
## Variables: 2
## $ height <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
## $ weight <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, ...

class(women)

## [1] "data.frame"

Now a univariate summary of all the variables

Since both variables are numerical, using the summary function gives you a descriptive statistical summary of them

summary(women)

##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

4 - Visual univariate and multivariate analysis of the variables

The box and whiskser plot give you a simple visualisation of the central tendency and variability for each numeric variable while the scatterplot shows the strength of relationship/association between both varibles.

I have also added the height and weight data points of the 15 women measured to their corresponding height and weight box and whisker plots for more visual context.

Box and whiskers plot of the Women’s heights

ggplot(women, aes(x = 1, y = height)) + 
  geom_boxplot(fill = "white", colour = "red", outlier.colour = "red",   outlier.shape = 1, width = 0.1) +
  geom_jitter(width = 0.1) + 
  labs(
  title = "Summary of US Women's Heights", x = "arbitrary", y = "height (inches)",
  subtitle = "Height in inches",
  caption = "datasource: The World Almanac and Book of Facts, 1975.") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())

Box and whiskers plot of the Women’s weights

ggplot(women, aes(x = 1, y = weight)) + 
  geom_boxplot(fill = "white", colour = "blue", outlier.colour = "red",   outlier.shape = 1, width = 0.1) +
  geom_jitter(width = 0.1) + 
  labs(
  title = "Summary of US Women's Weights", x = "arbitrary", y = "Weight (lbs)",
  subtitle = "Weight in lbs",
  caption = "datasource: The World Almanac and Book of Facts, 1975.") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())

Plot of multivariate relationship between both variables

The slope (rise vs run) in the scatterplot below is rising right linearly as both variables increase in value which signifies a strong positive correlation between them.

ggplot(women, aes(x = height, y = weight)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(
  title = "Women's Weight as a function of Height", 
  y = "Weight (lbs)",
  x = "Height (inches)",
  subtitle = "Weight in lbs, Height in inches",
  caption = "datasource: The World Almanac and Book of Facts, 1975.") +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_blank())

Correlation analysis between the two variables

correlation_coefficient <- round(cor(women$height, women$weight), digits = 3)
correlation_coefficient

## [1] 0.995

5 - Summary

- The Correlation Coefficient 0.995 highlights that there is a very strong association between the measured Women’s Height and weight.

- This is an extrememly small sample size. This means it’s no where near a large enough representation of all the women in the US between age 30 - 39. Therefore, any resulting inference made from just this dataset would be heavily biased

- Despite evidence of strong association between women’s height and weight, as the saying goes, Correlation does not equal causation. More variables and factors would have to be added into the data and considered.

This brings me to the end of this simple analysis - you can connect with me on linkedin: https://www.linkedin.com/in/brightuduji/

Until next time, take care and keep Vizalysing!