Introduction (Hordhac)

Dunidan Casriga ah Xogtu waxa ay noqotay awoodaha lagu soo saari karo fikrado macno leh, xog-ururinta dhabta ah ee aduunka ayaa noqotay mid sii kordheysa oo muhiim u ah qaybaha kala duwan Cilmibaadhista. Xogta dhabta ah waxay inta badan lagu arkaa kakanaan, taas oo ay sababayso xogta maqan, qiimo ka duwan kuwa kale amase ka baxsan (outliers), iyo qaababka aan toos ahayn ee u baahan habab gorfayn adag oo si sax ah loo fasiro. Falanqaynta xogtu waxa ay door muhiim ah ka ciyaartaa in xogta cayriinka ah loo beddelo aqoon la fulin karo, iyada oo awood u siinaysa go’aan-qaadasho wanaagsan iyo qorshayn istaraatiijiyadeed. Hadaba Falanqayntan waxay diiradda saaraysaa sahaminta iyo qaabaynta xogta dhabta ah (Real World Dataset), waxaa Jira xeelado iyo tillaabooyin badan oo la isticmaalo xiliga falanqaynta xogta waxaana ka mid ah nadiifinta xogta, falanqaynta xogta sahaminta (EDA), iyo adeegsiga models ku habboon ee statistics iyo machine learning. Ujeedadu ma aha oo kaliya in la fahmo xogta laakiin sidoo kale in la abuuro fikrado saadaalineed ama lagu wargeliyo siyaasiyinta, cilmi-baarista, iyo go’aamada ganacsiga. Qormadan waxaynu ku bixin doona falanqayn dhamaystiran oo ku saabsan xogaha ku kaydsan mareegaha, taasoo muujinaysa adeegsiga farsamooyinka iyo qalabka kala duwan ee Xogaha la xidhiidha Dhirta. Falanqaynta waxay Xooga saari doonta muuqaalaynta xogta (Data visualization), isbarbardhiga, kala-soocidda, falanqaynta model statistics, iyo falanqaynta qaybta aasaasiga ah (PCA). Isticmaalka farsamooyinkan, waxaay ino hirgalinaysaa inaan ka soo saarno fikrado qiimo leh xogta ku keydsan mareegaha iyo isticmaalka xogaha sayniska si wax looga qabto su’aalaha iyo caqabadaha nolosha ee taagan. Xogta aynu isticmaalayno si aynu u falanqayno waa nooc ka mid ah Ubaxa (iris data), kaas oo ka kooban cabbirro sifooyin kala duwan oo ubaxa iris ah. Xogtan waxa inta badan loo isticmaalaa barashada statistics iyo machine learning. Waxaynu adeegsan doona tusaale aad u wanaagsan muujinaya farsamooyinka falanqaynta xogta ee kala duwan, oo ay ku jiraan falanqaynta xogta, habaynta, kala soocida. Xogta Ubaxa (Iris data) Waxay ka kooban tahay 150 muunado oo laga soo qaaday saddex nooc oo ubaxa iris ah (Iris setosa, Iris versicolor, iyo Iris virginica). Muunad kastaa waxay leedahay afar sifo: dhererka iyo ballaca sepals and petals. Xogtan waxaa inta badan loo isticmaalaa tijaabinta algorithms sida machine learning waxayna u adeegtaa qaab halbeeg u ah farsamooyinka kala soocida kuwa aan isku midka ahayn.

By applying a structured and methodical approach to analyzing real-world data, this project demonstrates the practical use of data science tools and techniques in addressing real-life questions and challenges.To analyze data, we need to understand the data first. This is the first step in any data analysis process. In this course, we will cover the following topics:

Iris Dataset

Iris is a dataset that consists of 150 samples from three species of iris flowers (Iris setosa, Iris versicolor, and Iris virginica). Four features were measured from each sample: the lengths and the widths of the sepals and petals. The dataset is often used for testing machine learning algorithms. the dataset is available in the R programming language as a built-in dataset. here is a brief description of the dataset:

attach(iris)
iris_dataset = iris
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Summary of iris dataset

To understand the data, we need to summarize it. This will help us to understand the data better and to identify any potential issues with the data.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Histrogram - Distribution

Histograms are used to visualize the distribution of a dataset. They are useful for understanding the shape of the data and for identifying any potential outliers.

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p = ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.1, fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Histogram of Sepal Length",
       x = "Sepal Length",
       y = "Frequency")


# ggplotly(p)
p

Scatter plot

Scatter plots are used to visualize the relationship between two continuous variables. They are useful for identifying any potential correlations between the variables.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  labs(title = "Scatter Plot of Sepal Length vs. Sepal Width",
       x = "Sepal Length",
       y = "Sepal Width")

Box plot

Box plots are used to visualize the distribution of a dataset. They are useful for identifying any potential outliers and for comparing the distributions of different groups.

data(iris)
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Box Plot of Sepal Length by Species",
       x = "Species",
       y = "Sepal Length")

Violine plot

Violin plots are used to visualize the distribution of a dataset. They are similar to box plots, but they also show the density of the data.

Correlation Matrix

Correlation matrices are used to visualize the correlation between different variables. They are useful for identifying any potential relationships between the variables.

cor_matrix <- cor(iris[, 1:4])
print(cor_matrix)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Heat map of correlation matrix 1

Heat maps are used to visualize the correlation between different variables. They are useful for identifying any potential relationships between the variables.

Heat map of correlation matrix 2

in this plot, the correlation matrix is shown in a different way. The colors are used to represent the strength of the correlation between the variables.

Heat map of correlation matrix 3

This plot is similar to the previous one, but it uses a different color scheme. The colors are used to represent the strength of the correlation between the variables.

Pari plot

Pair plots are used to visualize the relationship between multiple variables. They are useful for identifying any potential correlations between the variables.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(iris, aes(colour = Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# https://ggobi.github.io/ggally/reference/ggpairs.html

Regression

Regression is used to model the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.

Linear Regression

Linear regression is used to model the relationship between a dependent variable and one independent variable. It is used to predict the value of the dependent variable based on the value of the independent variable.

ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Linear Regression: Sepal Length vs. Petal Length",
       x = "Petal Length",
       y = "Sepal Length")
## `geom_smooth()` using formula = 'y ~ x'

lm_model <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(lm_model)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16
# https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

Polynomial Regression

Polynomial regression is used to model the relationship between a dependent variable and one independent variable. It is used to predict the value of the dependent variable based on the value of the independent variable.this plot shows the polynomial regression line fitted to the data. The blue line represents the fitted polynomial regression line, and the shaded area represents the confidence interval around the fitted line. the independent variable is the petal length, and the dependent variable is the sepal length.

The polynomial regression model explores the relationship between a dependent variable y and an independent variable x by including polynomial terms up to the third degree (i.e., \(x\), \(x^2\), and \(x^3\)).

#define data
x <- iris$Petal.Length
y <- iris$Sepal.Length
 
 
#fit polynomial regression model
fit <- lm(y ~ x + I(x^2) + I(x^3))
summary(fit) 
## 
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06434 -0.24523  0.00707  0.19869  0.92755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.64817    0.45873  10.133   <2e-16 ***
## x            0.27811    0.48046   0.579    0.564    
## I(x^2)      -0.04428    0.13454  -0.329    0.743    
## I(x^3)       0.01055    0.01123   0.939    0.349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.364 on 146 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.8067 
## F-statistic: 208.3 on 3 and 146 DF,  p-value: < 2.2e-16

The model appears to fit the data well, with a Multiple R-squared value of 0.8106, indicating that approximately 81% of the variability in y is explained by the model. The Adjusted R-squared value of 0.8067 confirms that the model maintains strong explanatory power even after adjusting for the number of predictors. Additionally, the F-statistic of 208.3 and the associated p-value (< 2.2e-16) indicate that the model is statistically significant overall.

Despite the model’s good overall fit, none of the individual predictors (x, \(x^2\), or \(x^3\)) are statistically significant at the 0.05 level. The p-values for the linear term x (p = 0.564), the quadratic term I(x^2) (p = 0.743), and the cubic term I(x^3) (p = 0.349) are all relatively large, suggesting that individually, these terms do not significantly contribute to explaining the variation in y. However, it’s possible that the combined effect of the polynomial terms still captures the pattern in the data, which explains the high R-squared value. The residual standard error is 0.364, and the residuals are reasonably small and centered around zero, indicating a decent model fit. Overall, the model captures a non-linear trend in the data, but further investigation may be needed to simplify the model or confirm the relevance of each polynomial term.

Multivariate Polynomial Regression

Multivariate polynomial regression is an extension of linear regression that models non-linear relationships between multiple input variables and a single output variable by using polynomial terms. It fits a polynomial function to the data, allowing for more complex relationships to be captured than with standard linear regression. Multivariate polynomial regression is an extension of linear regression that allows for multiple input variables and non-linear relationships between the input variables and the target variable.

# Load necessary libraries
library(stats)

# Load the iris dataset (it's built-in)
data(iris)

# Perform polynomial regression (quadratic model)
poly_model <- lm(Sepal.Length ~ poly(Sepal.Width, 2) + poly(Petal.Length, 2) + poly(Petal.Width, 2), data = iris)

# Print summary of polynomial regression
summary(poly_model)
## 
## Call:
## lm(formula = Sepal.Length ~ poly(Sepal.Width, 2) + poly(Petal.Length, 
##     2) + poly(Petal.Width, 2), data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85830 -0.21065  0.00061  0.19278  0.77325 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.84333    0.02509 232.877  < 2e-16 ***
## poly(Sepal.Width, 2)1   2.99803    0.40359   7.428 9.12e-12 ***
## poly(Sepal.Width, 2)2   0.34547    0.31951   1.081  0.28141    
## poly(Petal.Length, 2)1 12.74168    1.78665   7.132 4.54e-11 ***
## poly(Petal.Length, 2)2  1.59442    0.58991   2.703  0.00771 ** 
## poly(Petal.Width, 2)1  -2.82015    1.72498  -1.635  0.10427    
## poly(Petal.Width, 2)2  -0.95176    0.67450  -1.411  0.16040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3073 on 143 degrees of freedom
## Multiple R-squared:  0.8678, Adjusted R-squared:  0.8623 
## F-statistic: 156.5 on 6 and 143 DF,  p-value: < 2.2e-16

The multiple linear regression model aims to predict Sepal.Length using second-degree polynomial transformations of Sepal.Width, Petal.Length, and Petal.Width from the Iris dataset. The use of polynomial terms allows the model to capture non-linear relationships between the predictors and the response variable. The results show that the model fits the data well, with a Multiple R-squared value of 0.8678, indicating that approximately 86.8% of the variation in Sepal.Length is explained by the model. The Adjusted R-squared value of 0.8623 confirms the model’s strong explanatory power even after accounting for the number of predictors.

Among the predictors, the first-degree polynomial terms of Sepal.Width and Petal.Length are highly significant, with very low p-values (< 0.001), suggesting a strong influence on Sepal.Length. The second-degree term of Petal.Length is also statistically significant (p = 0.0077), indicating some non-linear effects. However, both polynomial terms of Petal.Width are not statistically significant, with p-values greater than 0.1, suggesting that Petal.Width may not have a meaningful contribution in this form. The residual standard error is 0.3073, and the residuals are relatively small and centered around zero, indicating a good model fit. Overall, the model demonstrates a strong ability to predict Sepal.Length, particularly due to the influence of Sepal.Width and Petal.Length, while Petal.Width appears to be a weaker predictor in this context.

Clustering

Clustering is used to group similar data points together. It is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together.

K-Means Clustering

K-Means clustering is a method used to group similar data points together. It is an unsupervised learning algorithm that is used to partition a dataset into K clusters.

# Load necessary libraries
library(stats)
library(ggplot2)  # For data visualization

# Load the iris dataset (it's built-in)
data(iris)

# Select only the numeric columns for clustering
data_for_clustering <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]

# Perform K-Means clustering with 3 clusters
k <- 3  # Number of clusters
kmeans_result <- kmeans(data_for_clustering, centers = k)

Clustering result analysis

the result of the K-Means clustering algorithm. The clusters are represented by different colors, and the centroids of the clusters are shown as black crosses.

## K-means clustering with 3 clusters of sizes 33, 96, 21
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.175758    3.624242     1.472727   0.2727273
## 2     6.314583    2.895833     4.973958   1.7031250
## 3     4.738095    2.904762     1.790476   0.3523810
## 
## Clustering vector:
##   [1] 1 3 3 3 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 1 1 1 3 1 1
##  [38] 1 3 1 1 3 3 1 1 3 1 3 1 1 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2
## 
## Within cluster sum of squares by cluster:
## [1]   6.432121 118.651875  17.669524
##  (between_SS / total_SS =  79.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The K-means clustering algorithm was applied to the Iris dataset, specifying three clusters, which resulted in cluster sizes of 50, 38, and 62 observations, respectively. The cluster centers (means of each variable within each cluster) indicate clear differences in floral measurements between clusters. Cluster 1 is characterized by smaller petal lengths and widths (mean Petal.Length = 1.462, Petal.Width = 0.246), suggesting that it likely corresponds to the Iris Setosa species, which is known for these features. Cluster 2, with higher values in all dimensions, especially Petal.Length (5.742) and Petal.Width (2.071), likely represents Iris Virginica, the species with the largest flowers. Cluster 3 falls in between, which likely corresponds to Iris Versicolor, the intermediate species.

The clustering performance is strong, with 88.4% of the total variance explained by the between-cluster differences (between_SS / total_SS = 88.4%), indicating that the clusters are well-separated. The within-cluster sum of squares values show that Cluster 1 is the most compact (15.15), while Cluster 3 is the most dispersed (39.82), reflecting variability within those groups. The clustering vector confirms that all 150 observations were successfully assigned to one of the three clusters. Overall, this unsupervised learning approach effectively distinguished the flower species based on their morphological traits, even without using the actual species labels.

Classification

Classification is used to predict the class of a given data point. It is used to classify the data points into different classes based on the values of the independent variables.

SVM Model

Support Vector Machines (SVM) is a supervised learning algorithm that is used for classification and regression tasks. It works by finding the hyperplane that best separates the data points of different classes.

# Load necessary libraries
library(e1071)  # For SVM
library(caret)   # For model evaluation
## Loading required package: lattice
library(ggplot2) 
library(lattice)

# Load the iris dataset (it's built-in)
data(iris)
## [1] "Confusion Matrix:"
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500
## [1] "Accuracy: 0.966666666666667"

The Support Vector Machine (SVM) model demonstrated excellent performance in classifying the three Iris species—Setosa, Versicolor, and Virginica. The overall accuracy of the model is 96.67%, meaning that 29 out of 30 observations in the test set were correctly classified. The 95% confidence interval for the accuracy is (82.78%, 99.92%), indicating strong and reliable performance. The Kappa statistic is 0.95, suggesting a very high level of agreement between predicted and actual classifications beyond chance.

The confusion matrix shows that all 10 instances of Setosa were correctly identified, reflecting perfect performance for that class. Versicolor also achieved perfect sensitivity (1.00), although one Virginica was misclassified as Versicolor, resulting in a slight drop in sensitivity for Virginica (0.90). Nevertheless, both Versicolor and Virginica maintained positive predictive values of over 90%, indicating that when the model predicts a class, it is highly likely to be correct.

Class-wise, the model achieved perfect balanced accuracy (1.00) for Setosa, and near-perfect scores for Versicolor (0.975) and Virginica (0.95). These high scores in sensitivity, specificity, and precision across classes indicate that the model effectively differentiates between species, particularly distinguishing Setosa with complete accuracy. Overall, the SVM classifier proved to be highly accurate, robust, and reliable for species classification in the Iris dataset.

SVM Classification confusion matrix

cm = conf_matrix

plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))

ggplot(plt, aes(Prediction,Reference, fill= Freq)) +
        geom_tile() + 
        geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="skyblue") +
        labs(x = "Reference",
             y = "Prediction") 

Statistical Analysis

Statistical analysis is the process of collecting, organizing, exploring, interpreting, and presenting data to uncover underlying patterns, trends, relationships, and insights.

Data Distribution

Normal distribution

The normal distribution is a continuous probability distribution that is symmetric about the mean. It is often used to model real-world data that follows a bell-shaped curve.

# Normal distribution
hist(iris$Sepal.Length, probability = TRUE, main = "Histogram of Sepal Length")
lines(density(iris$Sepal.Length), col = "blue")

T-test

A t-test is used to determine if there is a significant difference between the means of two groups. It is commonly used to compare the means of two independent samples.

setosa <- iris$Sepal.Length[iris$Species == "setosa"]
virginica <- iris$Sepal.Length[iris$Species == "virginica"]
t_test_result <- t.test(setosa, virginica)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  setosa and virginica
## t = -15.386, df = 76.516, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.78676 -1.37724
## sample estimates:
## mean of x mean of y 
##     5.006     6.588

The Welch Two Sample t-test was conducted to determine whether there is a significant difference in the mean values between the Setosa and Virginica species (likely for a variable such as Sepal.Length, based on the means provided). The test produced a t-value of -15.386 with approximately 76.5 degrees of freedom, and a highly significant p-value < 2.2e-16. This extremely small p-value provides strong evidence against the null hypothesis, indicating that the true difference in means between Setosa and Virginica is statistically significant. The 95% confidence interval for the difference in means ranges from -1.79 to -1.38, suggesting that Virginica has, on average, a significantly greater mean than Setosa by approximately 1.38 to 1.79 units. The sample means reinforce this difference, with Setosa averaging 5.006 and Virginica averaging 6.588. Overall, this analysis confirms a clear and meaningful difference between the two species in the measured trait.

ANOVA

ANOVA (Analysis of Variance) is used to determine if there are any statistically significant differences between the means of three or more independent groups. It is commonly used to compare the means of multiple groups.

anova_result <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_result)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A one-way ANOVA was conducted to examine whether there are statistically significant differences in Sepal.Length among the three Iris species. The results indicate a highly significant effect of Species on Sepal.Length, with an F-value of 119.3 and a p-value less than 2e-16. This extremely small p-value strongly suggests that the mean Sepal.Length differs significantly across at least two of the species groups. The sum of squares between groups (Species) is 63.21, much larger than the residual (within-group) sum of squares of 38.96, reinforcing that the variation in Sepal.Length is primarily due to differences between species rather than random variation within groups. Given these results, we reject the null hypothesis that all species have the same mean Sepal.Length. A post-hoc test (such as Tukey’s HSD) would be appropriate next to determine specifically which species differ from each other.

Tukey’s posthoc test

Tukey’s posthoc test is used to determine which specific groups are different after performing ANOVA. It is commonly used to compare the means of multiple groups.

posthoc <- TukeyHSD(anova_result)
print(posthoc)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

The Tukey’s HSD test was performed to determine which specific species’ mean Sepal.Length differ from each other. The results show significant differences between all pairwise comparisons of species, with the following findings:

  • Versicolor vs. Setosa: The mean difference in Sepal.Length is 0.930 (with a 95% confidence interval of 0.69 to 1.17), and this difference is statistically significant (p = 0).
  • Virginica vs. Setosa: The mean difference is 1.582 (with a 95% confidence interval of 1.34 to 1.83), also significant (p = 0).
  • Virginica vs. Versicolor: The mean difference is 0.652 (with a 95% confidence interval of 0.41 to 0.90), and this difference is also statistically significant (p = 0).

All comparisons are highly significant, indicating that there are clear differences in Sepal.Length between the species. Setosa is the smallest in Sepal.Length, followed by Versicolor, and Virginica has the largest Sepal.Length on average.

Chi-square test

A chi-square test is used to determine if there is a significant association between two categorical variables. It is commonly used to test the independence of two variables.

# Load necessary libraries
library(stats)

# Load the iris dataset (it's built-in)
data(iris)

# Chi-square test (testing independence between species and petal length)
chisq.test(table(iris$Species, cut(iris$Petal.Length, breaks = c(1, 2, 3, 4, 5))))
## Warning in chisq.test(table(iris$Species, cut(iris$Petal.Length, breaks = c(1,
## : Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(iris$Species, cut(iris$Petal.Length, breaks = c(1, 2, 3,     4, 5)))
## X-squared = 114.49, df = 6, p-value < 2.2e-16

A Chi-square test of independence was conducted to assess whether there is an association between the species of the Iris flower and petal length (categorized into 4 intervals: 1–2, 2–3, 3–4, 4–5). The test yielded a Chi-squared statistic of 114.49 with 6 degrees of freedom, and a p-value less than 2.2e-16, indicating a highly significant result. This suggests that there is a strong association between species and petal length categories—meaning that the distribution of petal lengths varies significantly across different species.

However, the warning indicates that the Chi-squared approximation may not be valid, which could be due to small expected frequencies in some cells. In such cases, it is often recommended to use Fisher’s Exact Test, or check if expected frequencies are sufficiently large (typically greater than 5). Despite this, the p-value is so small that the association between species and petal length categories is still clearly significant.

Principal component analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that is used to reduce the number of variables in a dataset while retaining as much information as possible. It is commonly used to visualize high-dimensional data and to identify patterns in the data.

# Load required library
library(ggplot2)
library(stats)

attach(iris)
## The following objects are masked from iris (pos = 11):
## 
##     Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
# Apply PCA
iris_pca <- prcomp(iris[, -5], center = TRUE, scale = TRUE)
iris_pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
summary(iris_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
# Extract PC scores
pc_scores <- as.data.frame(iris_pca$x[, 1:2])


# Combine PC scores with Species
pc_data <- cbind(pc_scores, Species = iris$Species)


# Plot PCA (2D)
ggplot(pc_data, aes(PC1, PC2, color = Species)) +
  geom_point() +
  labs(title = "PCA (2D) of Iris Dataset",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

Principal Component Analysis (PCA) was performed on the Iris dataset, excluding the species labels, to reduce the dimensionality of the data. The PCA summary indicates that the first two principal components (PC1 and PC2) explain 95.81% of the total variance in the data, with PC1 accounting for 72.96% and PC2 contributing 22.85%. This suggests that the first two components alone capture most of the variation in the dataset, which is ideal for visualization and further analysis.

The standard deviations of the components show that PC1 has the highest variability (1.7084), followed by PC2 (0.9560), PC3 (0.3831), and PC4 (0.1439), indicating the order of importance of the components in explaining the data’s variance. The rotation matrix provides the weights (loadings) for each variable on the components. For PC1, Petal.Length (0.5804), Petal.Width (0.5649), and Sepal.Length (0.5211) have the highest positive loadings, while Sepal.Width has a negative loading (-0.2693), suggesting that PC1 is primarily influenced by petal and sepal dimensions. For PC2, Sepal.Width (negative loading of -0.9233) is the most influential variable, followed by Petal.Length and Petal.Width, which suggests that PC2 captures a contrast between sepal width and the other variables.

The PC scores (the transformed data in the new component space) were extracted for the first two components and combined with the species labels for further analysis. This will allow for a visualization of the data in two dimensions, making it easier to identify patterns or separations between the species based on the principal components.

PCA Dimension Contribution

The contribution of each variable to the principal components can be visualized using a bar plot. This helps to understand which variables are most important in explaining the variance in the data.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(iris_pca, addlabels = TRUE)

PCA Dimension Contribution (Heat map)

Plotting the contributions of variables to the principal components using a heat map can help visualize the importance of each variable in the PCA analysis.

## corrplot 0.92 loaded

PCA Dimension Contribution (Vector map)

PCA variable contributions can also be visualized using a vector map. This helps to understand the direction and magnitude of each variable’s contribution to the principal components.

PCA 2D plot (scatter with marked data point)

PCA Clustering

PCA clustering can be visualized using a scatter plot with ellipses representing the clusters. This helps to understand the distribution of the data points in the PCA space.

Interactive plots

The interactive plots are created using the plotly library. The plotly library is used to create interactive plots in R. It is a powerful library that allows you to create interactive plots with just a few lines of code.

in this plot, the 3D scatter plot is created using the plotly library. The x-axis represents the sepal length, the y-axis represents the sepal width, and the z-axis represents the petal length. The points are colored based on the species of the iris flower.

Bonus!

3D sine curve

A 3D sine curve, or a sinusoidal helix, is a 3-dimensional representation of a sine wave, extending into the third dimension, often visualized as a spiral or a wave-like structure oscillating in three-dimensional space. It is commonly used in mathematics, physics, and engineering to represent periodic phenomena in three dimensions. The sine function oscillates between -1 and 1, and when plotted in 3D, it creates a wave-like structure that can be visualized from different angles.

Support Vector Regression Surface Curve

SVR is a type of Support Vector Machine (SVM) that is used for regression tasks. It works by finding the hyperplane that best fits the data points while minimizing the error. In this example, we will use SVR to predict the petal width based on the sepal length and sepal width. The surface plot shows the predicted values of the petal width based on the sepal length and sepal width.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## 
## ✔ broom        1.0.6     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.2
## ✔ recipes      1.1.0     
## 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks plotly::filter(), stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ rsample::permutations()  masks e1071::permutations()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## ✖ tune::tune()             masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
## 
## 
## Attaching package: 'kernlab'
## 
## 
## The following object is masked from 'package:scales':
## 
##     alpha
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## 
## 
## Attaching package: 'pracma'
## 
## 
## The following objects are masked from 'package:kernlab':
## 
##     cross, eig, size
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:e1071':
## 
##     sigmoid