Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.
“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey
EDA consists of:
And can be useful for:
In this lesson, we will explore pulse dataset using R. In addition, we will perform exploratory data analysis.
# Load packages
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(gtsummary)
library(gt)
library(datasets)
# Read Data
data <- read.csv("data/pulse_data.csv", stringsAsFactors = TRUE)
gt(head(data))
| Height | Weight | Age | Gender | Smokes | Alcohol | Exercise | Ran | Pulse1 | Pulse2 | BMI | BMICat |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.73 | 57 | 18 | Female | No | Yes | Moderate | No | 86 | 88 | 19.04507 | Underweight |
| 1.79 | 58 | 19 | Female | No | Yes | Moderate | Yes | 82 | 150 | 18.10181 | Underweight |
| 1.67 | 62 | 18 | Female | No | Yes | High | Yes | 96 | 176 | 22.23099 | Normal |
| 1.95 | 84 | 18 | Male | No | Yes | High | No | 71 | 73 | 22.09073 | Normal |
| 1.73 | 64 | 18 | Female | No | Yes | Low | No | 90 | 88 | 21.38394 | Normal |
| 1.84 | 74 | 22 | Male | No | Yes | Low | Yes | 78 | 141 | 21.85728 | Normal |
# Check data structure
glimpse(data)
Rows: 108
Columns: 12
$ Height <dbl> 1.73, 1.79, 1.67, 1.95, 1.73, 1.84, 1.62, 1.69, 1.64, 1.68, 1…
$ Weight <dbl> 57, 58, 62, 84, 64, 74, 57, 55, 56, 60, 75, 58, 68, 59, 72, 1…
$ Age <int> 18, 19, 18, 18, 18, 22, 20, 18, 19, 23, 20, 19, 22, 18, 18, 2…
$ Gender <fct> Female, Female, Female, Male, Female, Male, Female, Female, F…
$ Smokes <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, No, No, …
$ Alcohol <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Ye…
$ Exercise <fct> Moderate, Moderate, High, High, Low, Low, Moderate, Moderate,…
$ Ran <fct> No, Yes, Yes, No, No, Yes, No, No, No, Yes, Yes, No, No, No, …
$ Pulse1 <dbl> 86, 82, 96, 71, 90, 78, 68, 71, 68, 88, 76, 74, 70, 78, 69, 7…
$ Pulse2 <dbl> 88, 150, 176, 73, 88, 141, 72, 77, 68, 150, 88, 76, 71, 82, 6…
$ BMI <dbl> 19.04507, 18.10181, 22.23099, 22.09073, 21.38394, 21.85728, 2…
$ BMICat <fct> Underweight, Underweight, Normal, Normal, Normal, Normal, Nor…
Here is some information that would be interesting to get from these data:
In order to summarize the distribution of a categorical variable, we first create a table of the different values (categories) the variable takes, how many times each value occurs (count) and, more importantly, how often each value occurs (by converting the counts to percentages).
# One-way frequency table
data %>%
group_by(BMICat) %>%
summarise(frequency = n())
# A tibble: 4 × 2
BMICat frequency
<fct> <int>
1 Normal 62
2 Obese 2
3 Overweight 17
4 Underweight 27
data %>%
group_by(BMICat) %>%
summarise(counts = n()) %>%
mutate(percent = counts/sum(counts) *100)
# A tibble: 4 × 3
BMICat counts percent
<fct> <int> <dbl>
1 Normal 62 57.4
2 Obese 2 1.85
3 Overweight 17 15.7
4 Underweight 27 25
There are two simple graphical displays for visualizing the distribution of one categorical variable:
# Visualize one categorical variable; `fct_infreq()` for sorting the bar
data %>%
ggplot(aes(x = BMICat))+
geom_bar(fill = "#97B3C6")
# # Sorting Bar Chart by using `fct_infreq()`
data %>%
ggplot(aes(x = fct_infreq(BMICat)))+
geom_bar(fill = "#97B3C6")
# Summaries(counts) data for visualizing the distribution
df1 <- data %>%
group_by(BMICat) %>%
summarise(counts = n()) %>%
arrange(counts)
df1
# A tibble: 4 × 2
BMICat counts
<fct> <int>
1 Obese 2
2 Overweight 17
3 Underweight 27
4 Normal 62
# Show the observations number on the top the bar
ggplot(df1, aes(x = BMICat, y = counts)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = counts), vjust = -0.3)
# # Sorting bar by using `reorder()`
ggplot(df1, aes(x = reorder(BMICat, counts), y = counts)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = counts), vjust = -0.3)
# # Sorting bar by using `reorder()` and `desc()`
ggplot(df1, aes(x = reorder(BMICat, desc(counts)), y = counts)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = counts), vjust = -0.3)
# Calculate percentage of each category
df2 <- data %>%
group_by(BMICat) %>%
summarise(counts = n()) %>%
arrange(desc(BMICat)) %>%
mutate(prop = round(counts*100/sum(counts), 1))
df2
# A tibble: 4 × 3
BMICat counts prop
<fct> <int> <dbl>
1 Underweight 27 25
2 Overweight 17 15.7
3 Obese 2 1.9
4 Normal 62 57.4
# Sorting the bars using `reorder()`
ggplot(df2, aes(x = reorder(BMICat, counts), y = prop)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = prop), vjust = -0.3)
# Show the percentage(%) on the of the bar
ggplot(df2, aes(x = BMICat, y = prop)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = prop), vjust = -0.3)
# Sorting the bars using `reorder()` and `desc()`
ggplot(df2, aes(x = reorder(BMICat, desc(prop)), y = prop)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = prop), vjust = -0.3)
# Customize the plot
ggplot(df2, aes(x = reorder(BMICat, desc(counts)), y = prop)) +
geom_bar(fill = "#97B3C6", stat = "identity") +
geom_text(aes(label = prop), vjust = -0.3)+
labs(title = "Distribution of BMICat",
x = "BMI Category",
y = "Proportion",
caption = "Data Source: https://bolt.mph.ufl.edu/")
# Create bar chart using ggpubr
ggbarplot(df2, x = "BMICat", y = "counts", fill = "#97B3C6")
# show counts
ggbarplot(df2, x = "BMICat", y = "counts", fill = "#97B3C6", label = TRUE, lab.pos = "out")
# show counts
ggbarplot(df2, x = "BMICat", y = "prop", fill = "#97B3C6", label = TRUE, lab.pos = "out")
ggpie(df2, "prop", label = "BMICat", fill = "BMICat",
color = "white",
palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"))
# Show group names and value as labels
labs <- paste0(df2$BMICat, " (", df2$prop, "%)")
ggpie(df2, "prop", label =labs, fill = "BMICat",
color = "white",
palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"), lab.pos = "in")
# Change the position and font color of labels
labs <- paste0(df2$BMICat, "(", df2$prop, "%)")
ggpie(df2, "prop", label =labs,
lab.pos = "in", lab.font = "white",
fill = "BMICat",
color = "white",
palette = c("#00AFBB", "#E7B800", "#FC4E07", "#97B3C6"))
In this section, we will explore the data collected from a quantitative variable, and learn how to describe and summarize the important features of its distribution.
We will learn how to display the distribution using graphs and discuss a variety of numerical measures.
# Average BMI
data %>%
summarise(avg_bmi = mean(BMI))
avg_bmi
1 22.03186
# Median BMI
data %>%
summarise(median_bmi = median(BMI))
median_bmi
1 21.57798
Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.
Center: Mean or Median
Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.
Outliers: Data points that fall far from the bulk of the data
Interpretation: The distribution of height is bell
shaped with a center of about 10.001, a range of 11 inches (5 to 16),
and no apparent outliers.
# Calculate average height
data %>%
summarise(avg_height = mean(Height))
avg_height
1 1.732685
# Show the center in histogram
gghistogram(data, x = "Height", add = "mean")
# Calculate median height
data %>%
summarise(median_height = median(Height))
median_height
1 1.73
# Show the center in histogram
gghistogram(data, x = "Height", add = "median")
# Add mean
gghistogram(data, x = "Height", bins = 15, fill = "#97B3C6", title = "Histogram of Height", xlab = "Height(m)", ylab = "Frequency", add = "mean")
Interpretation: The distribution of height is roughly
bell shaped with a center of about 1.7m, a range of 0.55 meters (1.40 to
1.95), and no apparent outliers.
# Change the bins size
gghistogram(data, x = "Height", bins = 15, fill = "#58508d" , add = "mean")
# Compare mean and median
data %>%
summarise(avg_bmi = mean(BMI),
median_bmi = median(BMI))
avg_bmi median_bmi
1 22.03186 21.57798
# Load and explore diabetes data
diabetes <- read.csv("data/diabetes.csv", stringsAsFactors = TRUE)
gt(head(diabetes))
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
|---|---|---|---|---|---|---|---|---|
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
When describing the shape of a distribution, we should consider:
Note that all three distributions are symmetric, but are different in
their modality (peakedness).
# Check distribution of age
gghistogram(diabetes, x = "BMI", fill = "#665191")
A distribution is called skewed right if, as in the histogram above,
the right tail (larger values) is much longer than the left tail (small
values).
# Check distribution of age
gghistogram(diabetes, x = "Age", fill = "#665191")
A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).
# Check distribution of Glucose
gghistogram(diabetes, x = "Glucose", fill = "#665191")
Comments: - Distributions with more than two peaks are generally called multimodal. - Bimodal or multimodal distributions can be evidence that two distinct groups are represented. - Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.
The center of the distribution is often used to represent a typical value.
# Check distribution of BMI
gghistogram(diabetes, x = "BMI", fill = "#665191", add = "mean", add_density = TRUE)
# Check distribution of BMI
gghistogram(diabetes, x = "BMI", fill = "#665191", add = "median")
One way to measure the spread (also called variability or variation) of the distribution is to use the approximate range covered by the data.
# Check distribution of BloodPressure
gghistogram(diabetes, x = "BloodPressure", fill = "#665191", add = "median")
Outliers are observations that fall outside the overall pattern.
# Check distribution of BloodPressure
gghistogram(diabetes, x = "BloodPressure", fill = "#665191", add = "median")
The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).
Range = Max – min
data %>%
summarise(max_height = max(Height),
min_height = min(Height),
range = max_height - min_height)
max_height min_height range
1 1.95 1.4 0.55
While the range quantifies the variability by looking at the range covered by ALL the data, the Inter-Quartile Range or IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.
data %>%
summarise(
min = fivenum(Weight)[1],
Q1 = fivenum(Weight)[2],
median = fivenum(Weight)[3],
Q3 = fivenum(Weight)[4],
max = fivenum(Weight)[5],
IQR = Q3 - Q1)
min Q1 median Q3 max IQR
1 41 56.5 63 75 110 18.5
data %>%
summarise(avg_height = mean(Height),
std = sd(Height))
avg_height std
1 1.732685 0.1012133
data %>%
summarise(avg_height = mean(Height),
var_height = var(Height))
avg_height var_height
1 1.732685 0.01024412
In general the P-th percentile can be interpreted as a location in the data for which approximately P% of the other values in the distribution fall below the P-th percentile and (100 –P)% fall above the P-th percentile.
data %>%
summarise(
min = fivenum(Weight)[1],
Q1 = fivenum(Weight)[2],
median = fivenum(Weight)[3],
Q3 = fivenum(Weight)[4],
max = fivenum(Weight)[5])
min Q1 median Q3 max
1 41 56.5 63 75 110
Z = (x – mean)/standard deviation
data %>%
mutate(zscore = (BMI - mean(BMI) / sd(BMI))) %>%
head()
Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 BMI
1 1.73 57 18 Female No Yes Moderate No 86 88 19.04507
2 1.79 58 19 Female No Yes Moderate Yes 82 150 18.10181
3 1.67 62 18 Female No Yes High Yes 96 176 22.23099
4 1.95 84 18 Male No Yes High No 71 73 22.09073
5 1.73 64 18 Female No Yes Low No 90 88 21.38394
6 1.84 74 22 Male No Yes Low Yes 78 141 21.85728
BMICat zscore
1 Underweight 12.37439
2 Underweight 11.43112
3 Normal 15.56030
4 Normal 15.42004
5 Normal 14.71326
6 Normal 15.18659
Measures of position also allow us to compare values from different distributions. For example, we can present the percentiles or z-scores of an individual’s height and weight. These two measures together would provide a better picture of how the individual fits in the overall population than either would alone.
Although measures of position are not stressed in this course as much as measures of center and spread, we have seen and will see many measures of position used in various aspects of examining the distribution of one variable and it is good to recognize them as measures of position when they appear.
So far we have quantified the idea of center, and we are in the middle of the discussion about measuring spread, but we haven’t really talked about a method or rule that will help us classify extreme observations as outliers. The IQR is commonly used as the basis for a rule of thumb for identifying outliers.
An observation is considered a suspected outlier or potential outlier if it is:
The following picture (not to scale) illustrates this rule:
An observation is considered an EXTREME outlier if it is:
So far, in our discussion about measures of spread, some key players were:
Recall that the combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
data %>%
summarise(
min = fivenum(Age)[1],
Q1 = fivenum(Age)[2],
median = fivenum(Age)[3],
Q3 = fivenum(Age)[4],
max = fivenum(Age)[5])
min Q1 median Q3 max
1 18 19 20 21 45
4. outliers are marked with asterisks (*).
To summarize: the following information is visually depicted in the
boxplot:
the five number summary (blue) the range and IQR (red) outliers (green)
While it is fundamentally important to know how to describe the distribution of a single variable, most studies pose research questions that involve exploring the relationship between two (or more) variables. These research questions are investigated using a sample from the population of interest.
In most studies involving two variables, each of the variables has a role. We distinguish between:
As we mentioned earlier the variable we wish to predict is commonly called the dependent variable, the outcome variable, or the response variable. Any variable we are using to predict (or explain differences) in the outcome is commonly called an explanatory variable, an independent variable, a predictor variable, or a covariate.
Typically the explanatory variable is denoted by X, and the response variable by Y.
Is performance on the test related to gender?
Is there a gender effect on test scores?
Are there differences in test scores between males and females?
Gender is the explanatory variable
Test score is the response variable
If we further classify each of the two relevant variables according to type (categorical or quantitative), we get the following 4 possibilities for “role-type classification”
Figure Caption
Is performance on the test related to gender?
Is there a gender effect on test scores?
Are there differences in test scores between males and females?
Gender is the explanatory variable
Test score is the response variable
Therefore this is an example of case C → Q.
data %>%
select(Gender, BMI) %>%
group_by(Gender) %>%
summarise(Avg_BMI = mean(BMI))
# A tibble: 2 × 2
Gender Avg_BMI
<fct> <dbl>
1 Female 20.8
2 Male 23.1
data %>%
group_by(Gender) %>%
summarise(n = n(),
min = fivenum(BMI)[1],
Q1 = fivenum(BMI)[2],
median = fivenum(BMI)[3],
Q3 = fivenum(BMI)[4],
max = fivenum(BMI)[5])
# A tibble: 2 × 7
Gender n min Q1 median Q3 max
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 50 16.6 19.0 20.6 22.2 29.0
2 Male 58 16.8 20.2 22.9 25.1 32.1
ggboxplot(data, x = "Gender", y = "BMI")
data %>%
select(Gender, BMICat) %>%
tbl_summary()
| Characteristic | N = 1081 |
|---|---|
| Gender | |
| Female | 50 (46%) |
| Male | 58 (54%) |
| BMICat | |
| Normal | 62 (57%) |
| Obese | 2 (1.9%) |
| Overweight | 17 (16%) |
| Underweight | 27 (25%) |
| 1 n (%) | |
# https://www.statology.org/dplyr-crosstab/
df3 <- data %>%
group_by(Gender, Ran) %>%
tally() %>%
spread(Ran, n)
df3
# A tibble: 2 × 3
# Groups: Gender [2]
Gender No Yes
<fct> <int> <int>
1 Female 28 22
2 Male 35 23
# https://www.statology.org/dplyr-crosstab/
df3 <- data %>%
group_by(Gender, Ran) %>%
tally() %>%
spread(Ran, n)
df3
# A tibble: 2 × 3
# Groups: Gender [2]
Gender No Yes
<fct> <int> <int>
1 Female 28 22
2 Male 35 23
Figure Caption
Figure Caption
A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other.
A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.
Not all relationships can be classified as either positive or negative.
Figure Caption
The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. Here are a couple that are quite common: Relationships with a linear form are most simply described as points scattered about a line:
Figure Caption
A scatterplot in which the points are slightly above or below a line which has been drawn through the points. Overall, the points create a shape that appears to be a fat line. In this example, the points create a negative relationship.Relationships with a non-linear (sometimes called curvilinear) form are most simply described as points dispersed around the same curved line:
Figure Caption
There are many other possible forms for the relationship between two quantitative variables, but linear and curvilinear forms are quite common and easy to identify. Another form-related pattern that we should be aware of is clusters in the data:
Figure Caption
The strength of the relationship is determined by how closely the data follow the form of the relationship. Let’s look, for example, at the following two scatterplots displaying positive, linear relationships:
The strength of the relationship is determined by how closely the data points follow the form. We can see that in the left scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship. In the right scatterplot, the points also follow the linear pattern, but much less closely, and therefore we can say that the relationship is weaker. In general, though, assessing the strength of a relationship just by looking at the scatterplot is quite problematic, and we need a numerical measure to help us with that. We will discuss that later in this section.
Figure Caption
Data points that deviate from the pattern of the relationship are called outliers. We will see several examples of outliers during this section. Two outliers are illustrated in the scatterplot below:
Figure Caption