This dataset contains heights and weights of 15 american women aged between 30 - 39. Here I perform simple visual, univariate and multivariate analysis on it’s variables and summarise my findings. For more info on this dataset, please visit: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/women.html.
In R, as with other statistical analysis programming languages, you have to load in the packages you need. As a wise person once said: We are standing on the shoulders of Giants - Thanks to the amazing R community for these super awesome packages.
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(skimr)
library(knitr)It’s always good to use the head function. It gives you a quick visual feel of the dataset.
Remember height is in inches and weight is in lbs
head(women)## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
glimpse is my favorite function for giving a simple. clear and concise explanation of the overall dimension of the dataset, it’s variables and their types. It comes with the dplyr package which is part of the tidyverse
class simply tells you what data structure format the dataset is in
glimpse(women)## Observations: 15
## Variables: 2
## $ height <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
## $ weight <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, ...
class(women)## [1] "data.frame"
Since both variables are numerical, using the summary function gives you a descriptive statistical summary of them
summary(women)## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
The box and whiskser plot give you a simple visualisation of the central tendency and variability for each numeric variable while the scatterplot shows the strength of relationship/association between both varibles.
I have also added the height and weight data points of the 15 women measured to their corresponding height and weight box and whisker plots for more visual context.
ggplot(women, aes(x = 1, y = height)) +
geom_boxplot(fill = "white", colour = "red", outlier.colour = "red", outlier.shape = 1, width = 0.1) +
geom_jitter(width = 0.1) +
labs(
title = "Summary of US Women's Heights", x = "arbitrary", y = "height (inches)",
subtitle = "Height in inches",
caption = "datasource: The World Almanac and Book of Facts, 1975.") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.title = element_text(hjust = 0.5),
panel.grid = element_blank())ggplot(women, aes(x = 1, y = weight)) +
geom_boxplot(fill = "white", colour = "blue", outlier.colour = "red", outlier.shape = 1, width = 0.1) +
geom_jitter(width = 0.1) +
labs(
title = "Summary of US Women's Weights", x = "arbitrary", y = "Weight (lbs)",
subtitle = "Weight in lbs",
caption = "datasource: The World Almanac and Book of Facts, 1975.") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.title = element_text(hjust = 0.5),
panel.grid = element_blank()) The slope (rise vs run) in the scatterplot below is rising right linearly as both variables increase in value which signifies a strong positive correlation between them.
ggplot(women, aes(x = height, y = weight)) + geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
labs(
title = "Women's Weight as a function of Height",
y = "Weight (lbs)",
x = "Height (inches)",
subtitle = "Weight in lbs, Height in inches",
caption = "datasource: The World Almanac and Book of Facts, 1975.") +
theme(plot.title = element_text(hjust = 0.5),
panel.grid = element_blank()) correlation_coefficient <- round(cor(women$height, women$weight), digits = 3)
correlation_coefficient## [1] 0.995
This brings me to the end of this simple analysis - you can connect with me on linkedin: https://www.linkedin.com/in/brightuduji/
Until next time, take care and keep Vizalysing!
B!