Exploratory analysis of the Iris dataset to assess the structure, distribution, and normality of its features.
1 | Introduction
The Iris dataset, introduced by the statistician and biologist Ronald Fisher in 1936, consists of 150 observations of three different species of the iris flower (Iris setosa, Iris virginica, and Iris versicolor). Each observation includes measurements of four features: sepal length, sepal width, petal length, and petal width.
Objectives of the Analysis
The main goal of this analysis is to perform a thorough exploration of the data to understand the structure and distribution of the measurements of the iris flowers. We will look for distinctive patterns among the species and check if the features follow a normal distribution, using both graphical techniques and statistical tests.
2 | Exploration and Summary of the Dataset
Before starting the exploration, it is necessary to load the Iris
dataset along with the R libraries that will facilitate the analysis. We
will use tidyverse for data manipulation and
ggplot2 for visualization.
# Load necessary libraries
library(tidyverse)
library(psych)
library(knitr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(plotly)
library(webshot)
library(tidyr)
library(scales)
library(RColorBrewer)Visualization of the First Rows
To get a first impression of the data, we will visualize the first rows of the Iris dataset. This will help us understand the basic structure of the data we are working with.
# Show the first rows of the Iris dataset
head(iris) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
Statistical Summary of the Variables
A statistical summary will provide an overview of the central tendency and dispersion of the measured features in the dataset. This is essential for identifying initial patterns and possible anomalies in the data.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
3 | Data Visualization
Data visualization is crucial for understanding the distribution and relationships between the different features in the Iris dataset. In this section, we will create histograms, box plots, and scatter plots to visualize the differences and similarities between the species.
Histograms for Each Feature by Species
Histograms are useful for visualizing the distribution of numerical data. We will generate a histogram for each of the four features in the Iris dataset, separated by species.
# Convert the data to a longer format for easier manipulation
iris_long <- iris %>%
pivot_longer(cols = -Species, names_to = "Feature", values_to = "Value")
# Add a count column for tooltips, grouping by feature, species, and value
iris_long <- iris_long %>%
group_by(Feature, Species, Value) %>%
mutate(Count = n()) %>%
ungroup()
# Create the histogram with ggplot
g <- ggplot(iris_long, aes(x = Value, fill = Species, text = paste("Species:", Species, "<br>Measure:", round(Value, 2), "<br>Frequency:", Count))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30, show.legend = FALSE) +
facet_wrap(~Feature, scales = "free") +
labs(title = "Histograms of Iris Features by Species") + # Removed x and y labels
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(
legend.title = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# Convert the ggplot object to an interactive plotly object
p <- ggplotly(g, tooltip = "text")
# Adjust title position upwards and add space between title and chart
p <- p %>% layout(
title = list(
text = "Histograms of Iris Features by Species",
font = list(size = 16),
xanchor = "center",
yanchor = "top",
x = 0.5,
y = 1,
pad = list(b = 50)
),
xaxis = list(title = "", titlefont = list(size = 14)), # Set x-axis label to empty
yaxis = list(title = "", titlefont = list(size = 14)), # Set y-axis label to empty
legend = list(title = list(text = "Species"), font = list(size = 12)),
hoverlabel = list(bgcolor = "white", font = list(size = 12))
)
# Render the plot
pThe histograms shown above provide a comparative view of the distributions of the morphological characteristics of iris flowers for the three species: setosa, versicolor, and virginica.
In conclusion, the differences in petal length and width between species are more pronounced and could be effectively used for classifying iris species. Meanwhile, the measurements of sepals show greater variability and overlap between species, indicating that they are less discriminative for classification.
Box Plots for Each Feature by Species
The boxplot, also known as the box-and-whisker plot, is a standardized method to graphically represent the distribution of a dataset based on a five-number summary: the minimum, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum.
This type of plot is extremely useful in data exploration as it provides a clear view of centrality, dispersion, skewness, and the presence of outliers.
The central box of the boxplot represents the interquartile range (IQR), which is the distance between the first and third quartiles and contains the middle half of the data. The line inside the box indicates the median, which is the central value of the distribution. The whiskers extend from Q1 to the minimum value and from Q3 to the maximum value, within 1.5 times the IQR. Data points outside this range are considered outliers and are represented as individual points.
Using boxplots is especially convenient in exploratory analysis for several reasons:
For these reasons, boxplots are an indispensable tool in descriptive statistics and offer valuable insight in the early stages of analyzing any dataset.
Below is an interactive boxplot showing the distribution of the morphological characteristics of the three species of Iris present in the dataset:
# Convert the data to a longer format for easier manipulation
iris_long <- iris %>%
pivot_longer(cols = -Species, names_to = "Feature", values_to = "Value")
# Define the colors using the Set1 palette to maintain consistency with the histogram
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)
# Create the interactive boxplot with plotly
p <- iris_long %>%
plot_ly(x = ~Feature, y = ~Value, color = ~Species, colors = colors) %>%
add_boxplot() %>%
layout(yaxis = list(title = 'Value'),
xaxis = list(title = ''),
boxmode = 'group',
title = 'Boxplots of Iris Features by Species',
showlegend = TRUE,
legend = list(title = list(text = ""), orientation = 'h', x = 0.3, y = -0.1))
# Render the plot
pThe boxplot chart illustrates the differences in the distribution of iris flower measurements among the three species: setosa, versicolor, and virginica.
From these boxplots, we can infer that petal length and width provide a good basis for distinguishing between species, especially for differentiating Iris setosa from the other two. As for sepal length and width, while some differences are observed, they do not appear to be as distinctive as the petal measurements. The highlighted outliers in the boxplots of virginica and versicolor may be indicative of natural variations or measurement errors, which would require more detailed investigation.
In summary, the boxplot is an effective tool that highlights the key differences in the morphological characteristics of iris species and emphasizes the importance of these measurements in classifying and understanding the variability among iris flower species.
Scatter Plots
The scatter plot, also known as a scatter diagram, is a fundamental tool in exploratory data analysis that allows visualization and determination of whether correlations or patterns exist between two numerical variables. It represents each observation as a point on the Cartesian plane, with one variable assigned to the X-axis and another to the Y-axis.
The use of scatter plots is especially convenient for several reasons:
1. Identification of Correlations: It allows for quick and easy observation of whether a linear or non-linear relationship exists between the two variables.
2. Anomaly Detection: It helps identify outliers that do not follow the general trend of the data.
3. Group Comparison: It facilitates comparison between different groups or data categories by using distinct colors or markers.
4. Predictive Modeling: It serves as an important preliminary step before performing regressions or any other type of statistical modeling.
In the context of the Iris dataset, a scatter plot can reveal how sepal and petal measurements relate to each other and whether these relationships differ between the various Iris species. This visual analysis can be crucial in designing classification strategies or gaining a better understanding of the biology of these species.
Below is an interactive scatter plot that shows the relationship between sepal length and sepal width for the three Iris species included in the dataset:
# Define the colors using the Set1 palette to maintain consistency with previous charts
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)
# Create the interactive scatter plot with plotly
p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Sepal.Width,
type = 'scatter', mode = 'markers',
color = ~Species, colors = colors,
marker = list(size = 12, opacity = 0.7),
hoverinfo = 'text',
text = ~paste('Sepal Length:', Sepal.Length, '<br>Sepal Width:', Sepal.Width)) %>%
layout(title = 'Scatter Plot of Sepal Width vs. Sepal Length by Species',
xaxis = list(title = 'Sepal Length', titlefont = list(size = 14)),
yaxis = list(title = 'Sepal Width', titlefont = list(size = 14)),
legend = list(
title = list(text = "<b>Species</b>"),
orientation = 'v',
x = 1.05,
y = 1,
font = list(family = "Arial, sans-serif", size = 12, color = "black")
),
margin = list(r = 140)
)
# Render the plot
pThis scatter plot shows the relationship between the sepal length and width of iris flowers for the three species: setosa, versicolor, and virginica.
This visual analysis suggests that sepal measurements, especially when combined with petal measurements, could be useful features for the automated classification of iris species. Additionally, the presence of outliers may require further investigation to better understand their nature or to refine measurement and classification methods.
# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)
for (feat in features) {
results[[feat]] <- list()
for (spec in species) {
subset_data <- iris[iris$Species == spec, feat]
test_result <- shapiro.test(subset_data)
results[[feat]][[spec]] <- test_result
}
}
# Prepare the results for visualization
results_df <- data.frame(
Feature = rep(features, each = length(species)),
Species = rep(species, times = length(features)),
W_statistic = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
Normality = rep("", length(features) * length(species))
)
# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")
# Style the table for better presentation
results_df %>%
kable("html", caption = "Results of the Shapiro-Wilk Test for the Features of the Iris Dataset by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
column_spec(1, bold = T, color = "black") %>%
column_spec(2, width = "150px") %>%
column_spec(3, width = "100px") %>%
column_spec(4, width = "100px") %>%
column_spec(5, width = "100px") %>%
scroll_box(width = "100%", height = "500px")| Feature | Species | W_statistic | P_value | Normality | |
|---|---|---|---|---|---|
| Sepal.Length.setosa.W | Sepal.Length | setosa | 0.9776985 | 0.4595132 | Yes |
| Sepal.Length.versicolor.W | Sepal.Length | versicolor | 0.9778357 | 0.4647370 | Yes |
| Sepal.Length.virginica.W | Sepal.Length | virginica | 0.9711794 | 0.2583147 | Yes |
| Sepal.Width.setosa.W | Sepal.Width | setosa | 0.9717195 | 0.2715264 | Yes |
| Sepal.Width.versicolor.W | Sepal.Width | versicolor | 0.9741333 | 0.3379951 | Yes |
| Sepal.Width.virginica.W | Sepal.Width | virginica | 0.9673905 | 0.1808960 | Yes |
| Petal.Length.setosa.W | Petal.Length | setosa | 0.9549768 | 0.0548115 | Yes |
| Petal.Length.versicolor.W | Petal.Length | versicolor | 0.9660044 | 0.1584778 | Yes |
| Petal.Length.virginica.W | Petal.Length | virginica | 0.9621864 | 0.1097754 | Yes |
| Petal.Width.setosa.W | Petal.Width | setosa | 0.7997645 | 0.0000009 | No |
| Petal.Width.versicolor.W | Petal.Width | versicolor | 0.9476263 | 0.0272778 | No |
| Petal.Width.virginica.W | Petal.Width | virginica | 0.9597715 | 0.0869542 | Yes |
4 | Shapiro-Wilk Normality Test
The Shapiro-Wilk Test is a statistical test used to evaluate whether a sample of data fits a normal distribution. It is widely used to check the normality of data, which is a common assumption in many parametric statistical tests.
What is the p-value?
The p-value in a statistical test is an indicator of the probability of obtaining the observed results, or more extreme ones, if the null hypothesis were true. In the context of the Shapiro-Wilk test, the null hypothesis is that the data follows a normal distribution.
Importance of the Test
The importance of performing the Shapiro-Wilk test before proceeding with more complex statistical analyses lies in its ability to validate one of the key assumptions of many statistical methods. Checking for normality can help decide whether parametric methods (which assume normality) or non-parametric methods should be applied, directly impacting the validity of the statistical conclusions obtained.
# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)
for (feat in features) {
results[[feat]] <- list()
for (spec in species) {
subset_data <- iris[iris$Species == spec, feat]
test_result <- shapiro.test(subset_data)
results[[feat]][[spec]] <- test_result
}
}
# Prepare the results for visualization
results_df <- data.frame(
Feature = rep(features, each = length(species)),
Species = rep(species, times = length(features)),
W_statistic = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
Normality = rep("", length(features) * length(species))
)
# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")
# Style the table for better presentation
results_df %>%
kable("html", caption = "Results of the Shapiro-Wilk Test for Iris Dataset Features by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
column_spec(1, bold = T, color = "black") %>%
column_spec(2, width = "150px") %>%
column_spec(3, width = "100px") %>%
column_spec(4, width = "100px") %>%
column_spec(5, width = "100px") %>%
scroll_box(width = "100%", height = "500px")| Feature | Species | W_statistic | P_value | Normality | |
|---|---|---|---|---|---|
| Sepal.Length.setosa.W | Sepal.Length | setosa | 0.9776985 | 0.4595132 | Yes |
| Sepal.Length.versicolor.W | Sepal.Length | versicolor | 0.9778357 | 0.4647370 | Yes |
| Sepal.Length.virginica.W | Sepal.Length | virginica | 0.9711794 | 0.2583147 | Yes |
| Sepal.Width.setosa.W | Sepal.Width | setosa | 0.9717195 | 0.2715264 | Yes |
| Sepal.Width.versicolor.W | Sepal.Width | versicolor | 0.9741333 | 0.3379951 | Yes |
| Sepal.Width.virginica.W | Sepal.Width | virginica | 0.9673905 | 0.1808960 | Yes |
| Petal.Length.setosa.W | Petal.Length | setosa | 0.9549768 | 0.0548115 | Yes |
| Petal.Length.versicolor.W | Petal.Length | versicolor | 0.9660044 | 0.1584778 | Yes |
| Petal.Length.virginica.W | Petal.Length | virginica | 0.9621864 | 0.1097754 | Yes |
| Petal.Width.setosa.W | Petal.Width | setosa | 0.7997645 | 0.0000009 | No |
| Petal.Width.versicolor.W | Petal.Width | versicolor | 0.9476263 | 0.0272778 | No |
| Petal.Width.virginica.W | Petal.Width | virginica | 0.9597715 | 0.0869542 | Yes |
5 | Density Function Plot
This plot shows the density function for the characteristics of three iris species: setosa, versicolor, and virginica, illustrating how petal and sepal lengths and widths are distributed in each species.
Importance of the Density Function
The density function offers a continuous and detailed view of data distribution, making it easier to compare species and identify distinctive patterns. It is essential for understanding the variations within each species and between different species, providing key insights for classification studies and biological analysis.
# Define the colors using the Set1 palette
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)
# Create a long format for easier plotting with ggplot
iris_long <- gather(iris, key = "feature", value = "value", -Species)
# Enhanced plot with consistent colors and text adjustments
p <- ggplot(iris_long, aes(x = value, fill = Species)) +
geom_density(alpha = 0.7) +
facet_grid(feature ~ Species, scales = "free") +
scale_fill_manual(values = colors) +
theme_minimal() +
theme(
strip.background = element_rect(fill = "lightblue", colour = "deepskyblue", size = 1),
strip.text.x = element_text(size = 10, color = "navy", face = "plain"),
strip.text.y = element_text(size = 10, color = "navy", face = "plain"),
plot.title = element_text(size = 12, face = "plain", hjust = 0.5, margin = margin(t = 10, b = 10)),
axis.text = element_text(size = 8),
axis.title.y = element_text(size = 10, face = "plain", vjust = 2, angle = 90),
legend.position = "none",
plot.margin = margin(10, 10, 10, 10),
panel.border = element_blank()
) +
labs(title = "Distribution of Features by Species",
x = NULL,
y = "Density") +
scale_y_continuous(labels = label_number(scale = 1, accuracy = 0.01)) # Format y-axis numbers with two decimals
# Convert to an interactive plotly object
ggplotly(p)6 | Standard Z-score Normalization
Z-score normalization (or standardization) involves rescaling the data to have a mean of 0 and a standard deviation of 1. This is done using the formula:
\[ Z = \frac{(X - \mu)}{\sigma} \]
Where \(X\) is the value to be normalized, \(\mu\) is the mean of the feature, and \(\sigma\) is the standard deviation of the feature.
This method is useful for comparing scores between different datasets or features with different units or scales.
Now we will normalize the values of the features that passed the normality test.
# Filter features that passed the normality test
normal_vars <- results_df %>%
filter(Normality == "Yes") %>%
select(Feature, Species)
# Function to normalize the data
normalize <- function(x) {
return((x - mean(x)) / sd(x))
}
# Create a new data frame with normalized data
normalized_data <- iris
for (i in 1:nrow(normal_vars)) {
var_name <- normal_vars$Feature[i]
spec_name <- normal_vars$Species[i]
mask <- iris$Species == spec_name
normalized_data[mask, var_name] <- normalize(normalized_data[mask, var_name])
}
# Display the normalized data
normalized_data %>%
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species) %>%
head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.26667447 0.1899414 -0.3570112 0.2 setosa
2 -0.30071802 -1.1290958 -0.3570112 0.2 setosa
3 -0.86811050 -0.6014810 -0.9328358 0.2 setosa
4 -1.15180675 -0.8652884 0.2188133 0.2 setosa
5 -0.01702177 0.4537488 -0.3570112 0.2 setosa
6 1.11776320 1.2451711 1.3704625 0.4 setosa
7 | Calculate the probability that the sepal length is greater than 1 for the ‘setosa’ species
Since we have normalized the data in the Iris dataset, assuming a normal distribution (mean = 0, standard deviation = 1) for each feature that passed the Shapiro-Wilk test, we can explore specific probabilities for these normalized features.
Probability Calculation
We want to calculate the probability that an iris of the setosa
species has a sepal length (Sepal.Length) greater than 1 on
the normalized (z-score) scale. Mathematically, this probability is
expressed as:
\[ P(Z > 1) \]
where \(Z\) is the normalized sepal length. This value represents the proportion of flowers of the ‘setosa’ species whose sepal length is more than one standard deviation above the mean of this species.
# Calculate the probability that the sepal length is greater than 1 for the 'setosa' species
subset_data <- normalized_data[normalized_data$Species == "setosa", "Sepal.Length"]
prob_greater_than_1 <- mean(subset_data > 1)
prob_greater_than_1[1] 0.2
If the probability calculation result was 0.2, this means there is a 20% probability that an iris of the setosa species has a sepal length greater than 1 on the z-score scale.
1 | Covariance Matrix
The covariance matrix is an essential statistical tool for analyzing how variables in a dataset vary together. Each element \(\sigma_{ij}\) of this \(n \times n\) matrix represents the covariance between a pair of variables, calculated as:
\[ \sigma_{ij} = \frac{1}{n-1} \sum_{k=1}^n (X_{ik} - \bar{X}_i)(X_{jk} - \bar{X}_j) \]
where \(X_{ik}\) and \(X_{jk}\) are the values of variables \(i\) and \(j\) for observation \(k\), and \(\bar{X}_i\) and \(\bar{X}_j\) are the means of variables \(i\) and \(j\), respectively.
The sign of the covariance provides valuable information about the type of linear relationship between variables:
Covariance Matrix Calculation
# Load necessary libraries
library(datasets)
library(corrplot)
library(tidyverse)
library(psych)
library(knitr)
library(dplyr)
library(tidyr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(reshape2)
library(paletteer)
library(scales)# Load the Iris dataset from the datasets library
data(iris)
data <- iris
# Use attach to access columns directly
attach(data)
# Calculate the covariance matrix for the numeric variables in the dataset
# The numeric columns are: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
covariance_matrix <- cov(data[,1:4])
# Print the covariance matrix
print(covariance_matrix) Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
# Convert the covariance matrix to long format for plotting
covariance_matrix_long <- melt(covariance_matrix)
colnames(covariance_matrix_long) <- c("Variable1", "Variable2", "Covariance")
# Create the enhanced plot
ggplot(covariance_matrix_long, aes(x = Variable1, y = Variable2, fill = Covariance)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", Covariance)), size = 3.5, color = "black") +
scale_fill_gradient2(low = "skyblue", mid = "gray90", high = "salmon", midpoint = 0,
limits = c(min(covariance_matrix_long$Covariance, na.rm = TRUE),
max(covariance_matrix_long$Covariance, na.rm = TRUE)),
name = "Covariance") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 45, vjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)) +
labs(title = "Covariance Matrix of the Iris Dataset") The covariance matrix is a representation of how variables in a dataset vary together. In this matrix:
\[ \text{Cov}(X,Y) = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})}{n-1} \]
where \(X_i\) and \(Y_i\) are the values of variables \(X\) and \(Y\), respectively, and \(\overline{X}\) and \(\overline{Y}\) are the means of these variables.
\[ \text{Cov}(X,X) = \text{Var}(X) \]
The formula to calculate the variance of \(X\) is:
\[ \text{Var}(X) = \frac{\sum_{i=1}^n (X_i - \overline{X})^2}{n-1} \]
Note that \((X_i - \overline{X})^2\) is equivalent to \((X_i - \overline{X})(X_i - \overline{X})\), showing that variance is a specific case of covariance where the relationship of the variable with itself is considered.
In summary, the diagonal of the covariance matrix contains the variances of each variable, indicating the spread of each variable around its mean, and it shows how \(\text{Cov}(X,X)\) simplifies mathematically to \(\text{Var}(X)\).
# Standardize the numeric columns of the dataset to have mean 0 and standard deviation 1
data_norm <- scale(data[, 1:4])
# Compute the covariance matrix for the standardized variables
# Since the variables are standardized, this matrix is equivalent to the correlation matrix
cov(data_norm) Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Thus, for standardized variables, the covariance numerically coincides with the correlation. This result not only simplifies the analysis but also ensures direct comparability between different pairs of variables, regardless of their original scales.
2 | Correlation Matrix
Correlation is a statistical measure that reflects the strength and direction of a linear relationship between two numerical variables. It is expressed on a scale ranging from -1 to 1, where each extreme and the midpoint have specific meanings:
Differences between covariance and correlation:
This understanding facilitates the interpretation and application of statistical analyses, especially in fields where the relationship between variables is crucial, such as finance, data science, and social research.
Graph of the Correlation Matrix
# Load the dataset
data(iris)
data_numeric <- iris[, 1:4]
# Calculate the correlation matrix
cor_matrix <- cor(data_numeric)
# Create a full correlation plot
corrplot(cor_matrix, method = "color",
type = "full",
diag = TRUE,
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
order = "original",
number.cex = 0.8,
col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200)) # Color palette
To avoid duplicating information and simplify visualization, it is common to show only the upper or lower triangle of the matrix. The following image displays only the unique correlation values by including only the upper triangle:
# Load the dataset
data(iris)
data_numeric <- iris[, 1:4]
# Calculate the correlation matrix
cor_matrix <- cor(data_numeric)
# Create a correlation plot showing only the upper triangle
corrplot(cor_matrix, method = "color",
type = "upper",
diag = TRUE,
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
order = "original",
number.cex = 0.8,
col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200))This practice not only helps to simplify the information being visualized but also avoids confusion and facilitates the quick interpretation of relationships between variables, which is particularly useful in presentations or reports where clarity is essential.
Pearson and Spearman Correlation
In statistics, correlation analysis tools commonly use Pearson correlation by default. Pearson correlation measures the linear relationship between two quantitative variables. It is calculated as the ratio of the covariance of the variables to the product of their standard deviations.
When to Use Each
Both techniques provide valuable insights but in different contexts, ensuring that analysts can choose the most appropriate measure depending on the nature of their data and the specific goals of their analysis.
Spearman Correlation
For this part, we will review which features and species do not meet the assumption of normality.
# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)
for (feat in features) {
results[[feat]] <- list()
for (spec in species) {
subset_data <- iris[iris$Species == spec, feat]
test_result <- shapiro.test(subset_data)
results[[feat]][[spec]] <- test_result
}
}
# Prepare the results for visualization
results_df <- data.frame(
Feature = rep(features, each = length(species)),
Species = rep(species, times = length(features)),
Statistic_W = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
Normality = rep("", length(features) * length(species))
)
# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")
# Style the table for better presentation
results_df %>%
kable("html", caption = "Shapiro-Wilk Test Results for the Iris Dataset Features by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
column_spec(1, bold = T, color = "black") %>%
column_spec(2, width = "150px") %>%
column_spec(3, width = "100px") %>%
column_spec(4, width = "100px") %>%
column_spec(5, width = "100px") %>%
scroll_box(width = "100%", height = "500px")| Feature | Species | Statistic_W | P_value | Normality | |
|---|---|---|---|---|---|
| Sepal.Length.setosa.W | Sepal.Length | setosa | 0.9776985 | 0.4595132 | Yes |
| Sepal.Length.versicolor.W | Sepal.Length | versicolor | 0.9778357 | 0.4647370 | Yes |
| Sepal.Length.virginica.W | Sepal.Length | virginica | 0.9711794 | 0.2583147 | Yes |
| Sepal.Width.setosa.W | Sepal.Width | setosa | 0.9717195 | 0.2715264 | Yes |
| Sepal.Width.versicolor.W | Sepal.Width | versicolor | 0.9741333 | 0.3379951 | Yes |
| Sepal.Width.virginica.W | Sepal.Width | virginica | 0.9673905 | 0.1808960 | Yes |
| Petal.Length.setosa.W | Petal.Length | setosa | 0.9549768 | 0.0548115 | Yes |
| Petal.Length.versicolor.W | Petal.Length | versicolor | 0.9660044 | 0.1584778 | Yes |
| Petal.Length.virginica.W | Petal.Length | virginica | 0.9621864 | 0.1097754 | Yes |
| Petal.Width.setosa.W | Petal.Width | setosa | 0.7997645 | 0.0000009 | No |
| Petal.Width.versicolor.W | Petal.Width | versicolor | 0.9476263 | 0.0272778 | No |
| Petal.Width.virginica.W | Petal.Width | virginica | 0.9597715 | 0.0869542 | Yes |
Now let’s compare the two measures.
# Calculate the correlation matrices
pearson_corr <- cor(iris[,1:4], method = "pearson")
spearman_corr <- cor(iris[,1:4], method = "spearman")# Convert matrices to data frames and rename columns
pearson_df <- as.data.frame(as.table(pearson_corr)) %>%
rename(Variable1 = Var1, Variable2 = Var2, Pearson = Freq)
spearman_df <- as.data.frame(as.table(spearman_corr)) %>%
rename(Variable1 = Var1, Variable2 = Var2, Spearman = Freq)
# Merge the Pearson and Spearman data frames and filter the results
comparison_df <- pearson_df %>%
inner_join(spearman_df, by = c("Variable1", "Variable2")) %>%
mutate(combined_vars = case_when(as.integer(Variable1) < as.integer(Variable2) ~ paste(Variable1, Variable2, sep = ""),
TRUE ~ paste(Variable2, Variable1, sep = ""))) %>%
distinct(combined_vars, .keep_all = TRUE) %>%
select(-combined_vars) %>%
filter(Variable1 != Variable2) %>%
mutate(Difference = abs(Pearson - Spearman))# Create a diverging color palette
diverging_palette <- colorRampPalette(paletteer_c("ggthemes::Temperature Diverging", 30))(30)
# Render the table with conditional formatting
comparison_df %>%
kable("html", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
column_spec(3, color = "blue") %>%
column_spec(4, color = "red") %>%
column_spec(5, background = scales::col_numeric(palette = diverging_palette, domain = NULL)(comparison_df$Difference),
bold = TRUE, color = "black")| Variable1 | Variable2 | Pearson | Spearman | Difference |
|---|---|---|---|---|
| Sepal.Width | Sepal.Length | -0.1175698 | -0.1667777 | 0.0492079 |
| Petal.Length | Sepal.Length | 0.8717538 | 0.8818981 | 0.0101444 |
| Petal.Width | Sepal.Length | 0.8179411 | 0.8342888 | 0.0163476 |
| Petal.Length | Sepal.Width | -0.4284401 | -0.3096351 | 0.1188050 |
| Petal.Width | Sepal.Width | -0.3661259 | -0.2890317 | 0.0770942 |
| Petal.Width | Petal.Length | 0.9628654 | 0.9376668 | 0.0251986 |
Finally, we will look at each of these measures with its p_value.
# Perform Pearson and Spearman correlation tests
pearson_test <- corr.test(iris[, 1:4], method = "pearson")
spearman_test <- corr.test(iris[, 1:4], method = "spearman")# Convert correlation matrices and p-values to data frames
pearson_df <- as.data.frame(as.table(pearson_test$r)) %>%
rename(Variable1 = Var1, Variable2 = Var2, Pearson = Freq) %>%
mutate(Pearson_p_value = as.vector(pearson_test$p))
spearman_df <- as.data.frame(as.table(spearman_test$r)) %>%
rename(Variable1 = Var1, Variable2 = Var2, Spearman = Freq) %>%
mutate(Spearman_p_value = as.vector(spearman_test$p))
# Merge Pearson and Spearman data frames and filter results
comparison_df <- pearson_df %>%
inner_join(spearman_df, by = c("Variable1", "Variable2")) %>%
mutate(combined_vars = case_when(as.integer(Variable1) < as.integer(Variable2) ~ paste(Variable1, Variable2, sep = ""),
TRUE ~ paste(Variable2, Variable1, sep = ""))) %>%
distinct(combined_vars, .keep_all = TRUE) %>%
select(-combined_vars) %>%
filter(Variable1 != Variable2)# Create a high-contrast color palette for the p-values
high_contrast_palette <- colorRampPalette(c("#ffb3c1", "white", "#e9edc9"))(30)
# Render the table with conditional formatting for the p-values
comparison_df %>%
kable("html", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
column_spec(3, color = "black") %>%
column_spec(4, background = scales::col_numeric(palette = high_contrast_palette, domain = NULL)
(comparison_df$Pearson_p_value),
bold = TRUE, color = "black") %>%
column_spec(5, color = "black") %>%
column_spec(6, background = scales::col_numeric(palette = high_contrast_palette, domain = NULL)
(comparison_df$Spearman_p_value),
bold = TRUE, color = "black")| Variable1 | Variable2 | Pearson | Pearson_p_value | Spearman | Spearman_p_value |
|---|---|---|---|---|---|
| Sepal.Width | Sepal.Length | -0.1175698 | 0.1518983 | -0.1667777 | 0.0413680 |
| Petal.Length | Sepal.Length | 0.8717538 | 0.0000000 | 0.8818981 | 0.0000000 |
| Petal.Width | Sepal.Length | 0.8179411 | 0.0000000 | 0.8342888 | 0.0000000 |
| Petal.Length | Sepal.Width | -0.4284401 | 0.0000000 | -0.3096351 | 0.0001154 |
| Petal.Width | Sepal.Width | -0.3661259 | 0.0000041 | -0.2890317 | 0.0003343 |
| Petal.Width | Petal.Length | 0.9628654 | 0.0000000 | 0.9376668 | 0.0000000 |
3 | Random Sampling with Replacement
Random sampling with replacement is a sampling technique where each element of the dataset has the same probability of being selected in each draw. After an element is selected for the sample, it is “replaced” back into the dataset, meaning it can be selected again in subsequent draws. This allows the same element to appear multiple times in the final sample.
This type of sampling is useful when you want to maintain independence between samples and when the desired sample size is equal to or larger than the original dataset size.
Below is an implementation of random sampling with replacement in R using the Iris dataset, a classic dataset in R containing information on 150 iris flowers, classified into three species, with measurements of petal and sepal lengths and widths.
# Load the Iris dataset
data(iris)
# Set seed for reproducibility
set.seed(123)
# Perform random sampling with replacement
# Here, we are taking a sample of 15 elements from the original dataset
sampled_iris <- iris[sample(1:nrow(iris), 15, replace = TRUE), ]
# Display the first rows of the sample with better formatting
kable(head(sampled_iris), format = "html", table.attr = "style='width:100%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 50 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 118 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 43 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 14.1 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 118.1 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
4 | Random Sampling without Replacement
Random sampling without replacement is a sampling technique where each element in the population can only be selected once for the sample. Once an element is selected, it is not placed back into the population for future selections. This ensures that there are no duplicates in the sample and that each element has the same probability of being selected until it is chosen.
# Load the Iris dataset
data(iris)
# Set seed for reproducibility
set.seed(123)
# Perform random sampling without replacement
# Here, we are taking a sample of 15 elements from the original dataset
sampled_iris <- iris[sample(1:nrow(iris), 15, replace = FALSE), ]
# Display the first rows of the sample with better formatting
kable(head(sampled_iris), format = "html", table.attr = "style='width:100%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 50 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 118 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 43 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
5 | Systematic Random Sampling
Systematic random sampling is a type of probability sampling where the first element is selected randomly, and subsequent elements are chosen following a fixed interval. This method ensures that the sample is representative of the population and reduces the risk of selection bias.
To apply systematic random sampling in R with the iris
dataset, we first determine the population size and desired sample size,
then calculate the sampling interval and select the individuals.
# Load the Iris dataset
data(iris)
# Population size
n <- nrow(iris)
# Desired sample size
k <- 15
# Calculate the sampling interval
intervalo <- floor(n / k)
# Select a random starting point between 1 and the interval
inicio <- sample(1:intervalo, 1)
# Create the systematic sample
muestra_sistematica <- iris[seq(inicio, n, by=intervalo), ]
# Display the first rows of the sample with better formatting
kable(head(muestra_sistematica), format = "html", table.attr = "style='width:100%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| 30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 40 | 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 50 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 60 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
6 | Generate 10,000 Samples of Size 40 with Replacement
We will use the replicate function to repeat a sampling
process 10,000 times. Within this function, we will use
sample_n from the dplyr package to select 40
rows from the dataset with replacement during each repetition.
library(dplyr)
# Generate 10,000 samples of size 40 with replacement
muestras <- replicate(10000, sample_n(iris, 40, replace = TRUE), simplify = FALSE)
# View the tenth sample
muestras[[10]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.2 3.0 5.8 1.6 virginica
2 6.4 2.7 5.3 1.9 virginica
3 7.7 2.6 6.9 2.3 virginica
4 5.8 2.8 5.1 2.4 virginica
5 4.9 3.1 1.5 0.1 setosa
6 5.8 2.8 5.1 2.4 virginica
7 6.6 2.9 4.6 1.3 versicolor
8 5.3 3.7 1.5 0.2 setosa
9 6.1 2.6 5.6 1.4 virginica
10 5.0 2.0 3.5 1.0 versicolor
11 7.3 2.9 6.3 1.8 virginica
12 5.0 3.2 1.2 0.2 setosa
13 6.5 3.0 5.5 1.8 virginica
14 4.6 3.6 1.0 0.2 setosa
15 5.8 2.8 5.1 2.4 virginica
16 6.3 2.3 4.4 1.3 versicolor
17 7.9 3.8 6.4 2.0 virginica
18 4.6 3.2 1.4 0.2 setosa
19 7.4 2.8 6.1 1.9 virginica
20 5.2 4.1 1.5 0.1 setosa
21 6.7 3.1 4.4 1.4 versicolor
22 6.1 2.9 4.7 1.4 versicolor
23 5.1 3.8 1.5 0.3 setosa
24 6.7 3.0 5.0 1.7 versicolor
25 5.6 2.7 4.2 1.3 versicolor
26 5.9 3.0 5.1 1.8 virginica
27 7.7 2.8 6.7 2.0 virginica
28 6.0 3.4 4.5 1.6 versicolor
29 7.3 2.9 6.3 1.8 virginica
30 4.8 3.0 1.4 0.3 setosa
31 5.1 3.3 1.7 0.5 setosa
32 5.2 2.7 3.9 1.4 versicolor
33 5.4 3.9 1.3 0.4 setosa
34 4.6 3.6 1.0 0.2 setosa
35 5.1 3.8 1.9 0.4 setosa
36 7.7 3.0 6.1 2.3 virginica
37 6.0 2.9 4.5 1.5 versicolor
38 5.6 3.0 4.5 1.5 versicolor
39 4.4 3.0 1.3 0.2 setosa
40 5.2 3.5 1.5 0.2 setosa
7 | Estimate the Proportion of Setosa Species Flowers
Using the samples calculated earlier:
# Calculate the proportion of setosa in each sample
proporciones_setosa <- sapply(muestras, function(muestra) sum(muestra$Species == "setosa") / nrow(muestra))
# Calculate the mean and standard deviation of the proportions
media_proporciones <- mean(proporciones_setosa)
desviacion_estandar_proporciones <- sd(proporciones_setosa)# Create a histogram of the sample proportions and add a line for the theoretical normal distribution
hist(proporciones_setosa,
freq = FALSE,
main = "Histogram of Sample Proportions from 10,000 Samples\n of Size 40",
xlab = "Sample Proportions",
ylab = "Density",
ylim = c(0,6.5))
# Add a density line for the sample proportions
lines(density(proporciones_setosa),
lty = 2,
lwd = 2,
col = "red")
# Generate values for the x-axis and add a theoretical normal distribution line
x = seq(from = 0,
to = 1,
by = 0.01)
lines(x,
dnorm(x,
1/3,
sqrt((1/3)*(2/3)/40)),
lty = 3,
lwd = 2,
col = "blue")
# Add a legend to the plot
legend("topright",
legend = c("Density","Normal"),
lwd = c(2,2),
lty = c(2,3),
col = c("red","blue"))