Exploratory Data Analysis

🪻 Exploratory Data Analysis

Exploratory analysis of the Iris dataset to assess the structure, distribution, and normality of its features.

Introduction
Exploration and Summary of the Dataset
Data Visualization
Shapiro-Wilk Normality Test
Density Function Plot
Standard Z-score Normalization
Calculate the probability that sepal length is greater than 1 for the species ‘setosa’

1 | Introduction

The Iris dataset, introduced by the statistician and biologist Ronald Fisher in 1936, consists of 150 observations of three different species of the iris flower (Iris setosa, Iris virginica, and Iris versicolor). Each observation includes measurements of four features: sepal length, sepal width, petal length, and petal width.

Objectives of the Analysis

The main goal of this analysis is to perform a thorough exploration of the data to understand the structure and distribution of the measurements of the iris flowers. We will look for distinctive patterns among the species and check if the features follow a normal distribution, using both graphical techniques and statistical tests.

2 | Exploration and Summary of the Dataset

Before starting the exploration, it is necessary to load the Iris dataset along with the R libraries that will facilitate the analysis. We will use tidyverse for data manipulation and ggplot2 for visualization.

# Load necessary libraries

library(tidyverse)
library(psych)
library(knitr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(plotly)
library(webshot)
library(tidyr)
library(scales)
library(RColorBrewer)

# Load the Iris dataset

data(iris)

Visualization of the First Rows

To get a first impression of the data, we will visualize the first rows of the Iris dataset. This will help us understand the basic structure of the data we are working with.

# Show the first rows of the Iris dataset
head(iris) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Statistical Summary of the Variables

A statistical summary will provide an overview of the central tendency and dispersion of the measured features in the dataset. This is essential for identifying initial patterns and possible anomalies in the data.

# Generate a statistical summary of the dataset
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

3 | Data Visualization

Data visualization is crucial for understanding the distribution and relationships between the different features in the Iris dataset. In this section, we will create histograms, box plots, and scatter plots to visualize the differences and similarities between the species.

Histograms for Each Feature by Species

Histograms are useful for visualizing the distribution of numerical data. We will generate a histogram for each of the four features in the Iris dataset, separated by species.

# Convert the data to a longer format for easier manipulation
iris_long <- iris %>%
  pivot_longer(cols = -Species, names_to = "Feature", values_to = "Value")

# Add a count column for tooltips, grouping by feature, species, and value
iris_long <- iris_long %>%
  group_by(Feature, Species, Value) %>%
  mutate(Count = n()) %>%
  ungroup()

# Create the histogram with ggplot
g <- ggplot(iris_long, aes(x = Value, fill = Species, text = paste("Species:", Species, "<br>Measure:", round(Value, 2), "<br>Frequency:", Count))) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30, show.legend = FALSE) +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Iris Features by Species") +  # Removed x and y labels
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    legend.title = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank() 
  )

# Convert the ggplot object to an interactive plotly object
p <- ggplotly(g, tooltip = "text")

# Adjust title position upwards and add space between title and chart
p <- p %>% layout(
  title = list(
    text = "Histograms of Iris Features by Species",
    font = list(size = 16),
    xanchor = "center",  
    yanchor = "top", 
    x = 0.5, 
    y = 1,  
    pad = list(b = 50) 
  ),
  xaxis = list(title = "", titlefont = list(size = 14)),  # Set x-axis label to empty
  yaxis = list(title = "", titlefont = list(size = 14)),  # Set y-axis label to empty
  legend = list(title = list(text = "Species"), font = list(size = 12)),
  hoverlabel = list(bgcolor = "white", font = list(size = 12))
)

# Render the plot
p

The histograms shown above provide a comparative view of the distributions of the morphological characteristics of iris flowers for the three species: setosa, versicolor, and virginica.

Petal Length (Petal.Length): The setosa species is clearly distinguished by having the shortest petals, as seen in the significant peak at the lower end of the measurement scale. In contrast, the versicolor and virginica species have longer petals, and their distributions partially overlap, although virginica tends to have slightly longer petals than versicolor.
Petal Width (Petal.Width): A similar pattern is observed in petal width, where setosa has significantly smaller values compared to the other two species. This feature could be a good predictor for distinguishing setosa from versicolor and virginica.
Sepal Length (Sepal.Length): Although the three species show some overlap in their distributions, virginica tends to have longer sepals, followed by versicolor, and then setosa.
Sepal Width (Sepal.Width): This feature shows considerable overlap between the species, suggesting that sepal width may not be as distinctive as petal length or width in differentiating between iris species.

In conclusion, the differences in petal length and width between species are more pronounced and could be effectively used for classifying iris species. Meanwhile, the measurements of sepals show greater variability and overlap between species, indicating that they are less discriminative for classification.

Box Plots for Each Feature by Species

The boxplot, also known as the box-and-whisker plot, is a standardized method to graphically represent the distribution of a dataset based on a five-number summary: the minimum, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum.

This type of plot is extremely useful in data exploration as it provides a clear view of centrality, dispersion, skewness, and the presence of outliers.

The central box of the boxplot represents the interquartile range (IQR), which is the distance between the first and third quartiles and contains the middle half of the data. The line inside the box indicates the median, which is the central value of the distribution. The whiskers extend from Q1 to the minimum value and from Q3 to the maximum value, within 1.5 times the IQR. Data points outside this range are considered outliers and are represented as individual points.

Using boxplots is especially convenient in exploratory analysis for several reasons:

Comparison between groups: It allows for easy comparison of data distribution across different groups or categories.
Outlier detection: It helps identify values that significantly deviate from the rest of the data.
Visual summary: It provides a quick visual summary of several key statistics, including data dispersion and symmetry.
Non-parametric: It does not assume a specific distribution of the data, making it suitable for non-normal distributions.

For these reasons, boxplots are an indispensable tool in descriptive statistics and offer valuable insight in the early stages of analyzing any dataset.

Below is an interactive boxplot showing the distribution of the morphological characteristics of the three species of Iris present in the dataset:

# Convert the data to a longer format for easier manipulation
iris_long <- iris %>%
  pivot_longer(cols = -Species, names_to = "Feature", values_to = "Value")

# Define the colors using the Set1 palette to maintain consistency with the histogram
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)

# Create the interactive boxplot with plotly
p <- iris_long %>%
  plot_ly(x = ~Feature, y = ~Value, color = ~Species, colors = colors) %>%
  add_boxplot() %>%
  layout(yaxis = list(title = 'Value'),
         xaxis = list(title = ''),
         boxmode = 'group',
         title = 'Boxplots of Iris Features by Species',
         showlegend = TRUE,
         legend = list(title = list(text = ""), orientation = 'h', x = 0.3, y = -0.1))

# Render the plot
p

The boxplot chart illustrates the differences in the distribution of iris flower measurements among the three species: setosa, versicolor, and virginica.

Petal Length (Petal.Length): We observe that Iris setosa has significantly shorter petal lengths compared to the other two species. Additionally, this species shows the least variability in this measurement, as indicated by the height of its box. On the other hand, Iris virginica tends to have the longest petals, while Iris versicolor shows intermediate values.
Petal Width (Petal.Width): Similar to petal length, Iris setosa presents a smaller petal width, and again, Iris virginica stands out with higher values, indicating wider petals. The boxes of versicolor and virginica show greater overlap in petal width compared to petal length, though there is still a clear separation between the species.
Sepal Length (Sepal.Length): The data suggest that all species have similar variability in sepal length, although the medians shift slightly upwards from setosa to virginica. It is interesting to note that there are some outliers in the versicolor and virginica species, which may indicate particular variation in some samples.
Sepal Width (Sepal.Width): In this feature, Iris setosa shows a higher median and a wider distribution, suggesting that, in general, this species has wider sepals. The long whiskers in the boxplots for setosa and virginica indicate a more spread out distribution of the data, while versicolor presents a more compact distribution.

From these boxplots, we can infer that petal length and width provide a good basis for distinguishing between species, especially for differentiating Iris setosa from the other two. As for sepal length and width, while some differences are observed, they do not appear to be as distinctive as the petal measurements. The highlighted outliers in the boxplots of virginica and versicolor may be indicative of natural variations or measurement errors, which would require more detailed investigation.

In summary, the boxplot is an effective tool that highlights the key differences in the morphological characteristics of iris species and emphasizes the importance of these measurements in classifying and understanding the variability among iris flower species.

Scatter Plots

The scatter plot, also known as a scatter diagram, is a fundamental tool in exploratory data analysis that allows visualization and determination of whether correlations or patterns exist between two numerical variables. It represents each observation as a point on the Cartesian plane, with one variable assigned to the X-axis and another to the Y-axis.

The use of scatter plots is especially convenient for several reasons:

1. Identification of Correlations: It allows for quick and easy observation of whether a linear or non-linear relationship exists between the two variables.

2. Anomaly Detection: It helps identify outliers that do not follow the general trend of the data.

3. Group Comparison: It facilitates comparison between different groups or data categories by using distinct colors or markers.

4. Predictive Modeling: It serves as an important preliminary step before performing regressions or any other type of statistical modeling.

In the context of the Iris dataset, a scatter plot can reveal how sepal and petal measurements relate to each other and whether these relationships differ between the various Iris species. This visual analysis can be crucial in designing classification strategies or gaining a better understanding of the biology of these species.

Below is an interactive scatter plot that shows the relationship between sepal length and sepal width for the three Iris species included in the dataset:

# Define the colors using the Set1 palette to maintain consistency with previous charts
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)

# Create the interactive scatter plot with plotly
p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Sepal.Width, 
             type = 'scatter', mode = 'markers',
             color = ~Species, colors = colors,
             marker = list(size = 12, opacity = 0.7),
             hoverinfo = 'text',
             text = ~paste('Sepal Length:', Sepal.Length, '<br>Sepal Width:', Sepal.Width)) %>%
  layout(title = 'Scatter Plot of Sepal Width vs. Sepal Length by Species',
         xaxis = list(title = 'Sepal Length', titlefont = list(size = 14)),
         yaxis = list(title = 'Sepal Width', titlefont = list(size = 14)),
         legend = list(
           title = list(text = "<b>Species</b>"),
           orientation = 'v',
           x = 1.05,
           y = 1,
           font = list(family = "Arial, sans-serif", size = 12, color = "black")
         ),
         margin = list(r = 140)
  )

# Render the plot
p

This scatter plot shows the relationship between the sepal length and width of iris flowers for the three species: setosa, versicolor, and virginica.

Iris setosa: The red points, which represent the setosa species, are grouped and clearly distinguishable from the other two species. This indicates that setosa has distinctive sepal dimensions that could facilitate its identification and classification.
Iris versicolor and virginica: The blue and green points correspond to the versicolor and virginica species, respectively. There is considerable overlap between these two species, suggesting variability and similarity in their sepal dimensions.
Correlation: There appears to be a moderate positive correlation between sepal length and width across all species; that is, in general, as sepal length increases, so does its width. The correlation in setosa seems stronger than in the other species, with a less dispersed distribution.
Distribution and Variability: Setosa is distinguished by having relatively shorter and wider sepals, while versicolor and virginica have longer and narrower sepals. The variability in the sepal dimensions of versicolor is less than that of virginica, which could be considered in classification methods.

This visual analysis suggests that sepal measurements, especially when combined with petal measurements, could be useful features for the automated classification of iris species. Additionally, the presence of outliers may require further investigation to better understand their nature or to refine measurement and classification methods.

# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)

for (feat in features) {
  results[[feat]] <- list()
  for (spec in species) {
    subset_data <- iris[iris$Species == spec, feat]
    test_result <- shapiro.test(subset_data)
    results[[feat]][[spec]] <- test_result
  }
}

# Prepare the results for visualization
results_df <- data.frame(
  Feature = rep(features, each = length(species)),
  Species = rep(species, times = length(features)),
  W_statistic = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
  P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
  Normality = rep("", length(features) * length(species))
)

# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")

# Style the table for better presentation
results_df %>%
  kable("html", caption = "Results of the Shapiro-Wilk Test for the Features of the Iris Dataset by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  column_spec(1, bold = T, color = "black") %>%
  column_spec(2, width = "150px") %>%
  column_spec(3, width = "100px") %>%
  column_spec(4, width = "100px") %>%
  column_spec(5, width = "100px") %>%
  scroll_box(width = "100%", height = "500px")

Results of the Shapiro-Wilk Test for the Features of the Iris Dataset by Species
	Feature	Species	W_statistic	P_value	Normality
Sepal.Length.setosa.W	Sepal.Length	setosa	0.9776985	0.4595132	Yes
Sepal.Length.versicolor.W	Sepal.Length	versicolor	0.9778357	0.4647370	Yes
Sepal.Length.virginica.W	Sepal.Length	virginica	0.9711794	0.2583147	Yes
Sepal.Width.setosa.W	Sepal.Width	setosa	0.9717195	0.2715264	Yes
Sepal.Width.versicolor.W	Sepal.Width	versicolor	0.9741333	0.3379951	Yes
Sepal.Width.virginica.W	Sepal.Width	virginica	0.9673905	0.1808960	Yes
Petal.Length.setosa.W	Petal.Length	setosa	0.9549768	0.0548115	Yes
Petal.Length.versicolor.W	Petal.Length	versicolor	0.9660044	0.1584778	Yes
Petal.Length.virginica.W	Petal.Length	virginica	0.9621864	0.1097754	Yes
Petal.Width.setosa.W	Petal.Width	setosa	0.7997645	0.0000009	No
Petal.Width.versicolor.W	Petal.Width	versicolor	0.9476263	0.0272778	No
Petal.Width.virginica.W	Petal.Width	virginica	0.9597715	0.0869542	Yes

4 | Shapiro-Wilk Normality Test

The Shapiro-Wilk Test is a statistical test used to evaluate whether a sample of data fits a normal distribution. It is widely used to check the normality of data, which is a common assumption in many parametric statistical tests.

What is the p-value?

The p-value in a statistical test is an indicator of the probability of obtaining the observed results, or more extreme ones, if the null hypothesis were true. In the context of the Shapiro-Wilk test, the null hypothesis is that the data follows a normal distribution.

Interpreting the p-value:
- If the p-value is less than or equal to 0.05, the null hypothesis is typically rejected, suggesting that the data does not follow a normal distribution.
- If the p-value is greater than 0.05, the null hypothesis is not rejected, indicating that the data may be considered normally distributed under this test.

Importance of the Test

The importance of performing the Shapiro-Wilk test before proceeding with more complex statistical analyses lies in its ability to validate one of the key assumptions of many statistical methods. Checking for normality can help decide whether parametric methods (which assume normality) or non-parametric methods should be applied, directly impacting the validity of the statistical conclusions obtained.

# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)

for (feat in features) {
  results[[feat]] <- list()
  for (spec in species) {
    subset_data <- iris[iris$Species == spec, feat]
    test_result <- shapiro.test(subset_data)
    results[[feat]][[spec]] <- test_result
  }
}

# Prepare the results for visualization
results_df <- data.frame(
  Feature = rep(features, each = length(species)),
  Species = rep(species, times = length(features)),
  W_statistic = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
  P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
  Normality = rep("", length(features) * length(species))
)

# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")

# Style the table for better presentation
results_df %>%
  kable("html", caption = "Results of the Shapiro-Wilk Test for Iris Dataset Features by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  column_spec(1, bold = T, color = "black") %>%
  column_spec(2, width = "150px") %>%
  column_spec(3, width = "100px") %>%
  column_spec(4, width = "100px") %>%
  column_spec(5, width = "100px") %>%
  scroll_box(width = "100%", height = "500px")

Results of the Shapiro-Wilk Test for Iris Dataset Features by Species
	Feature	Species	W_statistic	P_value	Normality
Sepal.Length.setosa.W	Sepal.Length	setosa	0.9776985	0.4595132	Yes
Sepal.Length.versicolor.W	Sepal.Length	versicolor	0.9778357	0.4647370	Yes
Sepal.Length.virginica.W	Sepal.Length	virginica	0.9711794	0.2583147	Yes
Sepal.Width.setosa.W	Sepal.Width	setosa	0.9717195	0.2715264	Yes
Sepal.Width.versicolor.W	Sepal.Width	versicolor	0.9741333	0.3379951	Yes
Sepal.Width.virginica.W	Sepal.Width	virginica	0.9673905	0.1808960	Yes
Petal.Length.setosa.W	Petal.Length	setosa	0.9549768	0.0548115	Yes
Petal.Length.versicolor.W	Petal.Length	versicolor	0.9660044	0.1584778	Yes
Petal.Length.virginica.W	Petal.Length	virginica	0.9621864	0.1097754	Yes
Petal.Width.setosa.W	Petal.Width	setosa	0.7997645	0.0000009	No
Petal.Width.versicolor.W	Petal.Width	versicolor	0.9476263	0.0272778	No
Petal.Width.virginica.W	Petal.Width	virginica	0.9597715	0.0869542	Yes

5 | Density Function Plot

This plot shows the density function for the characteristics of three iris species: setosa, versicolor, and virginica, illustrating how petal and sepal lengths and widths are distributed in each species.

Importance of the Density Function

The density function offers a continuous and detailed view of data distribution, making it easier to compare species and identify distinctive patterns. It is essential for understanding the variations within each species and between different species, providing key insights for classification studies and biological analysis.

# Define the colors using the Set1 palette
colors <- brewer.pal(n = 3, name = "Set1")
names(colors) <- levels(iris$Species)

# Create a long format for easier plotting with ggplot
iris_long <- gather(iris, key = "feature", value = "value", -Species)

# Enhanced plot with consistent colors and text adjustments
p <- ggplot(iris_long, aes(x = value, fill = Species)) +
  geom_density(alpha = 0.7) +
  facet_grid(feature ~ Species, scales = "free") +
  scale_fill_manual(values = colors) +
  theme_minimal() +
  theme(
    strip.background = element_rect(fill = "lightblue", colour = "deepskyblue", size = 1),
    strip.text.x = element_text(size = 10, color = "navy", face = "plain"),
    strip.text.y = element_text(size = 10, color = "navy", face = "plain"),
    plot.title = element_text(size = 12, face = "plain", hjust = 0.5, margin = margin(t = 10, b = 10)),
    axis.text = element_text(size = 8),
    axis.title.y = element_text(size = 10, face = "plain", vjust = 2, angle = 90),
    legend.position = "none",
    plot.margin = margin(10, 10, 10, 10),
    panel.border = element_blank()  
  ) +
  labs(title = "Distribution of Features by Species",
       x = NULL,
       y = "Density") +
  scale_y_continuous(labels = label_number(scale = 1, accuracy = 0.01))  # Format y-axis numbers with two decimals

# Convert to an interactive plotly object
ggplotly(p)

6 | Standard Z-score Normalization

Z-score normalization (or standardization) involves rescaling the data to have a mean of 0 and a standard deviation of 1. This is done using the formula:

\[ Z = \frac{(X - \mu)}{\sigma} \]

Where \(X\) is the value to be normalized, \(\mu\) is the mean of the feature, and \(\sigma\) is the standard deviation of the feature.

This method is useful for comparing scores between different datasets or features with different units or scales.

Now we will normalize the values of the features that passed the normality test.

# Filter features that passed the normality test
normal_vars <- results_df %>% 
  filter(Normality == "Yes") %>%
  select(Feature, Species)

# Function to normalize the data
normalize <- function(x) {
  return((x - mean(x)) / sd(x))
}

# Create a new data frame with normalized data
normalized_data <- iris
for (i in 1:nrow(normal_vars)) {
  var_name <- normal_vars$Feature[i]
  spec_name <- normal_vars$Species[i]
  mask <- iris$Species == spec_name
  normalized_data[mask, var_name] <- normalize(normalized_data[mask, var_name])
}

# Display the normalized data
normalized_data %>%
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species) %>%
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   0.26667447   0.1899414   -0.3570112         0.2  setosa
2  -0.30071802  -1.1290958   -0.3570112         0.2  setosa
3  -0.86811050  -0.6014810   -0.9328358         0.2  setosa
4  -1.15180675  -0.8652884    0.2188133         0.2  setosa
5  -0.01702177   0.4537488   -0.3570112         0.2  setosa
6   1.11776320   1.2451711    1.3704625         0.4  setosa

7 | Calculate the probability that the sepal length is greater than 1 for the ‘setosa’ species

Since we have normalized the data in the Iris dataset, assuming a normal distribution (mean = 0, standard deviation = 1) for each feature that passed the Shapiro-Wilk test, we can explore specific probabilities for these normalized features.

Probability Calculation

We want to calculate the probability that an iris of the setosa species has a sepal length (Sepal.Length) greater than 1 on the normalized (z-score) scale. Mathematically, this probability is expressed as:

\[ P(Z > 1) \]

where \(Z\) is the normalized sepal length. This value represents the proportion of flowers of the ‘setosa’ species whose sepal length is more than one standard deviation above the mean of this species.

# Calculate the probability that the sepal length is greater than 1 for the 'setosa' species
subset_data <- normalized_data[normalized_data$Species == "setosa", "Sepal.Length"]
prob_greater_than_1 <- mean(subset_data > 1)
prob_greater_than_1

[1] 0.2

If the probability calculation result was 0.2, this means there is a 20% probability that an iris of the setosa species has a sepal length greater than 1 on the z-score scale.

Correlation and Sampling

🪻 Correlation and Sampling

Covariance Matrix
Correlation Matrix
Random Sampling with Replacement
Random Sampling without Replacement
Systematic Random Sampling
Generate 10,000 Samples of Size 40 with Replacement
Estimate the Proportion of ‘Setosa’ Species Flowers

1 | Covariance Matrix

The covariance matrix is an essential statistical tool for analyzing how variables in a dataset vary together. Each element \(\sigma_{ij}\) of this \(n \times n\) matrix represents the covariance between a pair of variables, calculated as:

\[ \sigma_{ij} = \frac{1}{n-1} \sum_{k=1}^n (X_{ik} - \bar{X}_i)(X_{jk} - \bar{X}_j) \]

where \(X_{ik}\) and \(X_{jk}\) are the values of variables \(i\) and \(j\) for observation \(k\), and \(\bar{X}_i\) and \(\bar{X}_j\) are the means of variables \(i\) and \(j\), respectively.

The sign of the covariance provides valuable information about the type of linear relationship between variables:

Direct Relationship: If the covariance is positive (\(\sigma_{ij} > 0\)), it indicates that the variables tend to increase and decrease together. Statistically, this is known as a direct or positive relationship.
No Relationship: If the covariance is close to zero (\(\sigma_{ij} \approx 0\)), it suggests that there is no appreciable linear relationship between the variables. However, this does not necessarily imply that the variables are completely independent, as they might have a non-linear relationship.
Inverse Relationship: If the covariance is negative (\(\sigma_{ij} < 0\)), it means that one variable tends to increase when the other decreases, and vice versa. This is known as an inverse or negative relationship.

Covariance Matrix Calculation

# Load necessary libraries
library(datasets)
library(corrplot)
library(tidyverse)
library(psych)
library(knitr)
library(dplyr)
library(tidyr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(reshape2)
library(paletteer)
library(scales)

# Load the Iris dataset from the datasets library
data(iris)
data <- iris

# Use attach to access columns directly
attach(data)

# Calculate the covariance matrix for the numeric variables in the dataset
# The numeric columns are: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
covariance_matrix <- cov(data[,1:4])

# Print the covariance matrix
print(covariance_matrix)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

detach(data)

# Convert the covariance matrix to long format for plotting
covariance_matrix_long <- melt(covariance_matrix)
colnames(covariance_matrix_long) <- c("Variable1", "Variable2", "Covariance")

# Create the enhanced plot
ggplot(covariance_matrix_long, aes(x = Variable1, y = Variable2, fill = Covariance)) +
  geom_tile(color = "white") +  
  geom_text(aes(label = sprintf("%.2f", Covariance)), size = 3.5, color = "black") + 
  scale_fill_gradient2(low = "skyblue", mid = "gray90", high = "salmon", midpoint = 0,
                       limits = c(min(covariance_matrix_long$Covariance, na.rm = TRUE), 
                                  max(covariance_matrix_long$Covariance, na.rm = TRUE)),
                       name = "Covariance") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        axis.text.y = element_text(angle = 45, vjust = 1),  
        axis.title.x = element_blank(),  
        axis.title.y = element_blank(),
        legend.title = element_text(size = 12), 
        legend.text = element_text(size = 10)) + 
  labs(title = "Covariance Matrix of the Iris Dataset")

The covariance matrix is a representation of how variables in a dataset vary together. In this matrix:

The off-diagonal elements represent the covariance between pairs of different variables. The covariance between two variables \(X\) and \(Y\) is calculated using the following formula:

\[ \text{Cov}(X,Y) = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})}{n-1} \]

where \(X_i\) and \(Y_i\) are the values of variables \(X\) and \(Y\), respectively, and \(\overline{X}\) and \(\overline{Y}\) are the means of these variables.

The diagonal elements represent the covariance of a variable \(X\) with itself, which is its variance, and it is expressed as:

\[ \text{Cov}(X,X) = \text{Var}(X) \]

The formula to calculate the variance of \(X\) is:

\[ \text{Var}(X) = \frac{\sum_{i=1}^n (X_i - \overline{X})^2}{n-1} \]

Note that \((X_i - \overline{X})^2\) is equivalent to \((X_i - \overline{X})(X_i - \overline{X})\), showing that variance is a specific case of covariance where the relationship of the variable with itself is considered.

In summary, the diagonal of the covariance matrix contains the variances of each variable, indicating the spread of each variable around its mean, and it shows how \(\text{Cov}(X,X)\) simplifies mathematically to \(\text{Var}(X)\).

# Standardize the numeric columns of the dataset to have mean 0 and standard deviation 1
data_norm <- scale(data[, 1:4])

# Compute the covariance matrix for the standardized variables
# Since the variables are standardized, this matrix is equivalent to the correlation matrix
cov(data_norm)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# Calculate the correlation matrix for the variables
cor(data[,1:4])

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Thus, for standardized variables, the covariance numerically coincides with the correlation. This result not only simplifies the analysis but also ensures direct comparability between different pairs of variables, regardless of their original scales.

2 | Correlation Matrix

Correlation is a statistical measure that reflects the strength and direction of a linear relationship between two numerical variables. It is expressed on a scale ranging from -1 to 1, where each extreme and the midpoint have specific meanings:

+1 (Perfect positive correlation): This value indicates a direct relationship where both variables increase or decrease together in a direct and constant proportion. It means that when one variable increases, the other also increases by the same amount.
-1 (Perfect negative correlation): This value indicates an inverse relationship where one variable increases while the other decreases in direct and constant proportion. This means that an increase in one variable is accompanied by an equivalent decrease in the other.
Close to 0 (No significant linear correlation): A value near zero suggests that there is no appreciable linear relationship between the variables. However, it is important to note that this does not rule out the possibility of other types of relationships, such as non-linear ones, between the variables.

Differences between covariance and correlation:

Scale: Unlike covariance, whose values can be difficult to interpret directly because they depend on the units of measurement of the variables, correlation eliminates this issue by normalizing the results within a standard range of -1 to 1, making it easier to compare different datasets.
Interpretation: While covariance only indicates the direction of the relationship (positive or negative), correlation provides both the direction and the strength of the relationship, making it more informative for statistical analyses where variables of different natures and scales are compared.

This understanding facilitates the interpretation and application of statistical analyses, especially in fields where the relationship between variables is crucial, such as finance, data science, and social research.

Graph of the Correlation Matrix

# Load the dataset
data(iris)
data_numeric <- iris[, 1:4]

# Calculate the correlation matrix
cor_matrix <- cor(data_numeric)

# Create a full correlation plot
corrplot(cor_matrix, method = "color",
         type = "full",                
         diag = TRUE,                  
         addCoef.col = "black",        
         tl.col = "black",             
         tl.srt = 45,                 
         order = "original",           
         number.cex = 0.8,             
         col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200))  # Color palette

Correlation of a variable with itself: On the diagonal of the matrix, we find the correlation coefficient of each variable with itself, which is always 1. This is because any variable is perfectly correlated with itself, by definition.

Matrix Symmetry: The correlation matrix is symmetric with respect to its diagonal. This means that the values above the diagonal (upper triangle) are a mirror image of the values below the diagonal (lower triangle). Since each pair of variables appears reflected on both sides of the diagonal, it is often redundant to display both triangles.

To avoid duplicating information and simplify visualization, it is common to show only the upper or lower triangle of the matrix. The following image displays only the unique correlation values by including only the upper triangle:

# Load the dataset
data(iris)
data_numeric <- iris[, 1:4]

# Calculate the correlation matrix
cor_matrix <- cor(data_numeric)

# Create a correlation plot showing only the upper triangle
corrplot(cor_matrix, method = "color",
         type = "upper",
         diag = TRUE,
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         order = "original",
         number.cex = 0.8,
         col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200))

This practice not only helps to simplify the information being visualized but also avoids confusion and facilitates the quick interpretation of relationships between variables, which is particularly useful in presentations or reports where clarity is essential.

Pearson and Spearman Correlation

In statistics, correlation analysis tools commonly use Pearson correlation by default. Pearson correlation measures the linear relationship between two quantitative variables. It is calculated as the ratio of the covariance of the variables to the product of their standard deviations.

Pearson Correlation:
- Definition: Pearson correlation assesses the degree to which two quantitative variables are linearly related in a dataset.
- Ideal for: Pearson correlation is most appropriate when both variables are normally distributed (Gaussian distribution). Additionally, it assumes that the variables have a linear relationship and homogeneity of variances.
- Formula: Mathematically expressed as \(r = \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum (x_i - \overline{x})^2 \sum (y_i - \overline{y})^2}}\).
Spearman Correlation:
- Definition: Spearman correlation is a non-parametric measure that evaluates the monotonic relationship between two variables, whether linear or not. This correlation is Pearson’s rank correlation coefficient between the ranks of the variables.
- Ideal for: It is particularly useful when the data do not meet the assumptions required for Pearson correlation. This includes data that are not normally distributed, non-linear relationships between variables, or data with significant outliers.
- Formula: Spearman correlation is calculated as \(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\), where \(d_i\) is the difference between the ranks of the corresponding variables, and \(n\) is the number of observations.

When to Use Each

Use Pearson when your data is approximately normal, and the relationship you want to explore is linear.
Use Spearman when your data is non-parametric, not normally distributed, or when you want to examine non-linear relationships between variables.

Both techniques provide valuable insights but in different contexts, ensuring that analysts can choose the most appropriate measure depending on the nature of their data and the specific goals of their analysis.

Spearman Correlation

For this part, we will review which features and species do not meet the assumption of normality.

# Perform the Shapiro-Wilk test for each feature by species
results <- list()
features <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
species <- unique(iris$Species)

for (feat in features) {
  results[[feat]] <- list()
  for (spec in species) {
    subset_data <- iris[iris$Species == spec, feat]
    test_result <- shapiro.test(subset_data)
    results[[feat]][[spec]] <- test_result
  }
}

# Prepare the results for visualization
results_df <- data.frame(
  Feature = rep(features, each = length(species)),
  Species = rep(species, times = length(features)),
  Statistic_W = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$statistic))),
  P_value = unlist(lapply(results, function(feat) sapply(feat, function(spec) spec$p.value))),
  Normality = rep("", length(features) * length(species))
)

# Determine if normality can be assumed
results_df$Normality <- ifelse(results_df$P_value > 0.05, "Yes", "No")

# Style the table for better presentation
results_df %>%
  kable("html", caption = "Shapiro-Wilk Test Results for the Iris Dataset Features by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  column_spec(1, bold = T, color = "black") %>%
  column_spec(2, width = "150px") %>%
  column_spec(3, width = "100px") %>%
  column_spec(4, width = "100px") %>%
  column_spec(5, width = "100px") %>%
  scroll_box(width = "100%", height = "500px")

Shapiro-Wilk Test Results for the Iris Dataset Features by Species
	Feature	Species	Statistic_W	P_value	Normality
Sepal.Length.setosa.W	Sepal.Length	setosa	0.9776985	0.4595132	Yes
Sepal.Length.versicolor.W	Sepal.Length	versicolor	0.9778357	0.4647370	Yes
Sepal.Length.virginica.W	Sepal.Length	virginica	0.9711794	0.2583147	Yes
Sepal.Width.setosa.W	Sepal.Width	setosa	0.9717195	0.2715264	Yes
Sepal.Width.versicolor.W	Sepal.Width	versicolor	0.9741333	0.3379951	Yes
Sepal.Width.virginica.W	Sepal.Width	virginica	0.9673905	0.1808960	Yes
Petal.Length.setosa.W	Petal.Length	setosa	0.9549768	0.0548115	Yes
Petal.Length.versicolor.W	Petal.Length	versicolor	0.9660044	0.1584778	Yes
Petal.Length.virginica.W	Petal.Length	virginica	0.9621864	0.1097754	Yes
Petal.Width.setosa.W	Petal.Width	setosa	0.7997645	0.0000009	No
Petal.Width.versicolor.W	Petal.Width	versicolor	0.9476263	0.0272778	No
Petal.Width.virginica.W	Petal.Width	virginica	0.9597715	0.0869542	Yes

Now let’s compare the two measures.

# Calculate the correlation matrices
pearson_corr <- cor(iris[,1:4], method = "pearson")
spearman_corr <- cor(iris[,1:4], method = "spearman")

# Convert matrices to data frames and rename columns
pearson_df <- as.data.frame(as.table(pearson_corr)) %>% 
  rename(Variable1 = Var1, Variable2 = Var2, Pearson = Freq)

spearman_df <- as.data.frame(as.table(spearman_corr)) %>%
  rename(Variable1 = Var1, Variable2 = Var2, Spearman = Freq)

# Merge the Pearson and Spearman data frames and filter the results
comparison_df <- pearson_df %>%
  inner_join(spearman_df, by = c("Variable1", "Variable2")) %>%
  mutate(combined_vars = case_when(as.integer(Variable1) < as.integer(Variable2) ~ paste(Variable1, Variable2, sep = ""),
                                   TRUE ~ paste(Variable2, Variable1, sep = ""))) %>%
  distinct(combined_vars, .keep_all = TRUE) %>%
  select(-combined_vars) %>%
  filter(Variable1 != Variable2) %>%
  mutate(Difference = abs(Pearson - Spearman))

# Create a diverging color palette
diverging_palette <- colorRampPalette(paletteer_c("ggthemes::Temperature Diverging", 30))(30)

# Render the table with conditional formatting
comparison_df %>%
  kable("html", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(3, color = "blue") %>%
  column_spec(4, color = "red") %>%
  column_spec(5, background = scales::col_numeric(palette = diverging_palette, domain = NULL)(comparison_df$Difference),
              bold = TRUE, color = "black")

Variable1	Variable2	Pearson	Spearman	Difference
Sepal.Width	Sepal.Length	-0.1175698	-0.1667777	0.0492079
Petal.Length	Sepal.Length	0.8717538	0.8818981	0.0101444
Petal.Width	Sepal.Length	0.8179411	0.8342888	0.0163476
Petal.Length	Sepal.Width	-0.4284401	-0.3096351	0.1188050
Petal.Width	Sepal.Width	-0.3661259	-0.2890317	0.0770942
Petal.Width	Petal.Length	0.9628654	0.9376668	0.0251986

Finally, we will look at each of these measures with its p_value.

# Perform Pearson and Spearman correlation tests
pearson_test <- corr.test(iris[, 1:4], method = "pearson")
spearman_test <- corr.test(iris[, 1:4], method = "spearman")

# Convert correlation matrices and p-values to data frames
pearson_df <- as.data.frame(as.table(pearson_test$r)) %>%
  rename(Variable1 = Var1, Variable2 = Var2, Pearson = Freq) %>%
  mutate(Pearson_p_value = as.vector(pearson_test$p))

spearman_df <- as.data.frame(as.table(spearman_test$r)) %>%
  rename(Variable1 = Var1, Variable2 = Var2, Spearman = Freq) %>%
  mutate(Spearman_p_value = as.vector(spearman_test$p))

# Merge Pearson and Spearman data frames and filter results
comparison_df <- pearson_df %>%
  inner_join(spearman_df, by = c("Variable1", "Variable2")) %>%
  mutate(combined_vars = case_when(as.integer(Variable1) < as.integer(Variable2) ~ paste(Variable1, Variable2, sep = ""),
                                   TRUE ~ paste(Variable2, Variable1, sep = ""))) %>%
  distinct(combined_vars, .keep_all = TRUE) %>%
  select(-combined_vars) %>%
  filter(Variable1 != Variable2)

# Create a high-contrast color palette for the p-values
high_contrast_palette <- colorRampPalette(c("#ffb3c1", "white", "#e9edc9"))(30)

# Render the table with conditional formatting for the p-values
comparison_df %>%
  kable("html", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(3, color = "black") %>%
  column_spec(4, background = scales::col_numeric(palette = high_contrast_palette, domain = NULL)
              (comparison_df$Pearson_p_value),
              bold = TRUE, color = "black") %>%
  column_spec(5, color = "black") %>%
  column_spec(6, background = scales::col_numeric(palette = high_contrast_palette, domain = NULL)
              (comparison_df$Spearman_p_value),
              bold = TRUE, color = "black")

Variable1	Variable2	Pearson	Pearson_p_value	Spearman	Spearman_p_value
Sepal.Width	Sepal.Length	-0.1175698	0.1518983	-0.1667777	0.0413680
Petal.Length	Sepal.Length	0.8717538	0.0000000	0.8818981	0.0000000
Petal.Width	Sepal.Length	0.8179411	0.0000000	0.8342888	0.0000000
Petal.Length	Sepal.Width	-0.4284401	0.0000000	-0.3096351	0.0001154
Petal.Width	Sepal.Width	-0.3661259	0.0000041	-0.2890317	0.0003343
Petal.Width	Petal.Length	0.9628654	0.0000000	0.9376668	0.0000000

3 | Random Sampling with Replacement

Random sampling with replacement is a sampling technique where each element of the dataset has the same probability of being selected in each draw. After an element is selected for the sample, it is “replaced” back into the dataset, meaning it can be selected again in subsequent draws. This allows the same element to appear multiple times in the final sample.

This type of sampling is useful when you want to maintain independence between samples and when the desired sample size is equal to or larger than the original dataset size.

Below is an implementation of random sampling with replacement in R using the Iris dataset, a classic dataset in R containing information on 150 iris flowers, classified into three species, with measurements of petal and sepal lengths and widths.

# Load the Iris dataset
data(iris)

# Set seed for reproducibility
set.seed(123)

# Perform random sampling with replacement
# Here, we are taking a sample of 15 elements from the original dataset
sampled_iris <- iris[sample(1:nrow(iris), 15, replace = TRUE), ]

# Display the first rows of the sample with better formatting
kable(head(sampled_iris), format = "html", table.attr = "style='width:100%;'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
14	4.3	3.0	1.1	0.1	setosa
50	5.0	3.3	1.4	0.2	setosa
118	7.7	3.8	6.7	2.2	virginica
43	4.4	3.2	1.3	0.2	setosa
14.1	4.3	3.0	1.1	0.1	setosa
118.1	7.7	3.8	6.7	2.2	virginica

4 | Random Sampling without Replacement

Random sampling without replacement is a sampling technique where each element in the population can only be selected once for the sample. Once an element is selected, it is not placed back into the population for future selections. This ensures that there are no duplicates in the sample and that each element has the same probability of being selected until it is chosen.

# Load the Iris dataset
data(iris)

# Set seed for reproducibility
set.seed(123)

# Perform random sampling without replacement
# Here, we are taking a sample of 15 elements from the original dataset
sampled_iris <- iris[sample(1:nrow(iris), 15, replace = FALSE), ]

# Display the first rows of the sample with better formatting
kable(head(sampled_iris), format = "html", table.attr = "style='width:100%;'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
14	4.3	3.0	1.1	0.1	setosa
50	5.0	3.3	1.4	0.2	setosa
118	7.7	3.8	6.7	2.2	virginica
43	4.4	3.2	1.3	0.2	setosa
150	5.9	3.0	5.1	1.8	virginica
148	6.5	3.0	5.2	2.0	virginica

5 | Systematic Random Sampling

Systematic random sampling is a type of probability sampling where the first element is selected randomly, and subsequent elements are chosen following a fixed interval. This method ensures that the sample is representative of the population and reduces the risk of selection bias.

To apply systematic random sampling in R with the iris dataset, we first determine the population size and desired sample size, then calculate the sampling interval and select the individuals.

# Load the Iris dataset
data(iris)

# Population size
n <- nrow(iris)

# Desired sample size
k <- 15

# Calculate the sampling interval
intervalo <- floor(n / k)

# Select a random starting point between 1 and the interval
inicio <- sample(1:intervalo, 1)

# Create the systematic sample
muestra_sistematica <- iris[seq(inicio, n, by=intervalo), ]

# Display the first rows of the sample with better formatting
kable(head(muestra_sistematica), format = "html", table.attr = "style='width:100%;'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
10	4.9	3.1	1.5	0.1	setosa
20	5.1	3.8	1.5	0.3	setosa
30	4.7	3.2	1.6	0.2	setosa
40	5.1	3.4	1.5	0.2	setosa
50	5.0	3.3	1.4	0.2	setosa
60	5.2	2.7	3.9	1.4	versicolor

6 | Generate 10,000 Samples of Size 40 with Replacement

We will use the replicate function to repeat a sampling process 10,000 times. Within this function, we will use sample_n from the dplyr package to select 40 rows from the dataset with replacement during each repetition.

library(dplyr)

# Generate 10,000 samples of size 40 with replacement
muestras <- replicate(10000, sample_n(iris, 40, replace = TRUE), simplify = FALSE)
# View the tenth sample
muestras[[10]]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           7.2         3.0          5.8         1.6  virginica
2           6.4         2.7          5.3         1.9  virginica
3           7.7         2.6          6.9         2.3  virginica
4           5.8         2.8          5.1         2.4  virginica
5           4.9         3.1          1.5         0.1     setosa
6           5.8         2.8          5.1         2.4  virginica
7           6.6         2.9          4.6         1.3 versicolor
8           5.3         3.7          1.5         0.2     setosa
9           6.1         2.6          5.6         1.4  virginica
10          5.0         2.0          3.5         1.0 versicolor
11          7.3         2.9          6.3         1.8  virginica
12          5.0         3.2          1.2         0.2     setosa
13          6.5         3.0          5.5         1.8  virginica
14          4.6         3.6          1.0         0.2     setosa
15          5.8         2.8          5.1         2.4  virginica
16          6.3         2.3          4.4         1.3 versicolor
17          7.9         3.8          6.4         2.0  virginica
18          4.6         3.2          1.4         0.2     setosa
19          7.4         2.8          6.1         1.9  virginica
20          5.2         4.1          1.5         0.1     setosa
21          6.7         3.1          4.4         1.4 versicolor
22          6.1         2.9          4.7         1.4 versicolor
23          5.1         3.8          1.5         0.3     setosa
24          6.7         3.0          5.0         1.7 versicolor
25          5.6         2.7          4.2         1.3 versicolor
26          5.9         3.0          5.1         1.8  virginica
27          7.7         2.8          6.7         2.0  virginica
28          6.0         3.4          4.5         1.6 versicolor
29          7.3         2.9          6.3         1.8  virginica
30          4.8         3.0          1.4         0.3     setosa
31          5.1         3.3          1.7         0.5     setosa
32          5.2         2.7          3.9         1.4 versicolor
33          5.4         3.9          1.3         0.4     setosa
34          4.6         3.6          1.0         0.2     setosa
35          5.1         3.8          1.9         0.4     setosa
36          7.7         3.0          6.1         2.3  virginica
37          6.0         2.9          4.5         1.5 versicolor
38          5.6         3.0          4.5         1.5 versicolor
39          4.4         3.0          1.3         0.2     setosa
40          5.2         3.5          1.5         0.2     setosa

7 | Estimate the Proportion of Setosa Species Flowers

Using the samples calculated earlier:

# Calculate the proportion of setosa in each sample
proporciones_setosa <- sapply(muestras, function(muestra) sum(muestra$Species == "setosa") / nrow(muestra))

# Calculate the mean and standard deviation of the proportions
media_proporciones <- mean(proporciones_setosa)
desviacion_estandar_proporciones <- sd(proporciones_setosa)

# Create a histogram of the sample proportions and add a line for the theoretical normal distribution
hist(proporciones_setosa,
     freq = FALSE,
     main = "Histogram of Sample Proportions from 10,000 Samples\n of Size 40",
     xlab = "Sample Proportions",
     ylab = "Density",
     ylim = c(0,6.5))

# Add a density line for the sample proportions
lines(density(proporciones_setosa),
      lty = 2,
      lwd = 2,
      col = "red")

# Generate values for the x-axis and add a theoretical normal distribution line
x = seq(from = 0,
        to = 1,
        by = 0.01)

lines(x,
      dnorm(x,
            1/3,
            sqrt((1/3)*(2/3)/40)),
      lty = 3,
      lwd = 2,
      col = "blue")

# Add a legend to the plot
legend("topright",
       legend = c("Density","Normal"),
       lwd = c(2,2),
       lty = c(2,3),
       col = c("red","blue"))

Exploratory Data Analysis

🪻 Exploratory Data Analysis

Table of Contents

Correlation and Sampling

🪻 Correlation and Sampling

Table of Contents