Introduction (Hordhac)
Dunidan Casriga ah Xogtu waxa ay noqotay awoodaha lagu soo saari karo fikrado macno leh, xog-ururinta dhabta ah ee aduunka ayaa noqotay mid sii kordheysa oo muhiim u ah qaybaha kala duwan Cilmibaadhista. Xogta dhabta ah waxay inta badan lagu arkaa kakanaan, taas oo ay sababayso xogta maqan, qiimo ka duwan kuwa kale amase ka baxsan (outliers), iyo qaababka aan toos ahayn ee u baahan habab gorfayn adag oo si sax ah loo fasiro. Falanqaynta xogtu waxa ay door muhiim ah ka ciyaartaa in xogta cayriinka ah loo beddelo aqoon la fulin karo, iyada oo awood u siinaysa go’aan-qaadasho wanaagsan iyo qorshayn istaraatiijiyadeed. Hadaba Falanqayntan waxay diiradda saaraysaa sahaminta iyo qaabaynta xogta dhabta ah (Real World Dataset), waxaa Jira xeelado iyo tillaabooyin badan oo la isticmaalo xiliga falanqaynta xogta waxaana ka mid ah nadiifinta xogta, falanqaynta xogta sahaminta (EDA), iyo adeegsiga models ku habboon ee statistics iyo machine learning. Ujeedadu ma aha oo kaliya in la fahmo xogta laakiin sidoo kale in la abuuro fikrado saadaalineed ama lagu wargeliyo siyaasiyinta, cilmi-baarista, iyo go’aamada ganacsiga. Qormadan waxaynu ku bixin doona falanqayn dhamaystiran oo ku saabsan xogaha ku kaydsan mareegaha, taasoo muujinaysa adeegsiga farsamooyinka iyo qalabka kala duwan ee Xogaha la xidhiidha Dhirta. Falanqaynta waxay Xooga saari doonta muuqaalaynta xogta (Data visualization), isbarbardhiga, kala-soocidda, falanqaynta model statistics, iyo falanqaynta qaybta aasaasiga ah (PCA). Isticmaalka farsamooyinkan, waxaay ino hirgalinaysaa inaan ka soo saarno fikrado qiimo leh xogta ku keydsan mareegaha iyo isticmaalka xogaha sayniska si wax looga qabto su’aalaha iyo caqabadaha nolosha ee taagan. Xogta aynu isticmaalayno si aynu u falanqayno waa nooc ka mid ah Ubaxa (iris data), kaas oo ka kooban cabbirro sifooyin kala duwan oo ubaxa iris ah. Xogtan waxa inta badan loo isticmaalaa barashada statistics iyo machine learning. Waxaynu adeegsan doona tusaale aad u wanaagsan muujinaya farsamooyinka falanqaynta xogta ee kala duwan, oo ay ku jiraan falanqaynta xogta, habaynta, kala soocida. Xogta Ubaxa (Iris data) Waxay ka kooban tahay 150 muunado oo laga soo qaaday saddex nooc oo ubaxa iris ah (Iris setosa, Iris versicolor, iyo Iris virginica). Muunad kastaa waxay leedahay afar sifo: dhererka iyo ballaca sepals and petals. Xogtan waxaa inta badan loo isticmaalaa tijaabinta algorithms sida machine learning waxayna u adeegtaa qaab halbeeg u ah farsamooyinka kala soocida kuwa aan isku midka ahayn.
By applying a structured and methodical approach to analyzing
real-world data, this project demonstrates the practical use of data
science tools and techniques in addressing real-life questions and
challenges.To analyze data, we need to understand the data first. This
is the first step in any data analysis process. In this course, we will
cover the following topics:
- Summary of the dataset
- Data visualization
- Regression
- Clustering
- Classification
- Statistical analysis
- Principal component analysis (PCA)
- Interactive plots
- 3D sine curve and Support Vector Regression surface curve
Iris Dataset
Iris is a dataset that consists of 150 samples from three species of iris flowers (Iris setosa, Iris versicolor, and Iris virginica). Four features were measured from each sample: the lengths and the widths of the sepals and petals. The dataset is often used for testing machine learning algorithms. the dataset is available in the R programming language as a built-in dataset. here is a brief description of the dataset:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Summary of iris dataset
To understand the data, we need to summarize it. This will help us to
understand the data better and to identify any potential issues with the
data.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Histrogram - Distribution
Histograms are used to visualize the distribution of a dataset. They
are useful for understanding the shape of the data and for identifying
any potential outliers.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p = ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.1, fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Histogram of Sepal Length",
x = "Sepal Length",
y = "Frequency")
# ggplotly(p)
pScatter plot
Scatter plots are used to visualize the relationship between two continuous variables. They are useful for identifying any potential correlations between the variables.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Length vs. Sepal Width",
x = "Sepal Length",
y = "Sepal Width")Box plot
Box plots are used to visualize the distribution of a dataset. They are useful for identifying any potential outliers and for comparing the distributions of different groups.
data(iris)
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
labs(title = "Box Plot of Sepal Length by Species",
x = "Species",
y = "Sepal Length")Violine plot
Violin plots are used to visualize the distribution of a dataset.
They are similar to box plots, but they also show the density of the
data.
Correlation Matrix
Correlation matrices are used to visualize the correlation between different variables. They are useful for identifying any potential relationships between the variables.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Heat map of correlation matrix 1
Heat maps are used to visualize the correlation between different
variables. They are useful for identifying any potential relationships
between the variables.
Heat map of correlation matrix 2
in this plot, the correlation matrix is shown in a different way. The
colors are used to represent the strength of the correlation between the
variables.
Heat map of correlation matrix 3
This plot is similar to the previous one, but it uses a different
color scheme. The colors are used to represent the strength of the
correlation between the variables.
Pari plot
Pair plots are used to visualize the relationship between multiple
variables. They are useful for identifying any potential correlations
between the variables.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Regression
Regression is used to model the relationship between a dependent
variable and one or more independent variables. It is used to predict
the value of the dependent variable based on the values of the
independent variables.
Linear Regression
Linear regression is used to model the relationship between a
dependent variable and one independent variable. It is used to predict
the value of the dependent variable based on the value of the
independent variable.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Linear Regression: Sepal Length vs. Petal Length",
x = "Petal Length",
y = "Sepal Length")## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24675 -0.29657 -0.01515 0.27676 1.00269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
## Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
Polynomial Regression
Polynomial regression is used to model the relationship between a dependent variable and one independent variable. It is used to predict the value of the dependent variable based on the value of the independent variable.this plot shows the polynomial regression line fitted to the data. The blue line represents the fitted polynomial regression line, and the shaded area represents the confidence interval around the fitted line. the independent variable is the petal length, and the dependent variable is the sepal length.
The polynomial regression model explores the relationship between a
dependent variable y and an independent variable
x by including polynomial terms up to the third degree
(i.e., \(x\), \(x^2\), and \(x^3\)).
#define data
x <- iris$Petal.Length
y <- iris$Sepal.Length
#fit polynomial regression model
fit <- lm(y ~ x + I(x^2) + I(x^3))
summary(fit) ##
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06434 -0.24523 0.00707 0.19869 0.92755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.64817 0.45873 10.133 <2e-16 ***
## x 0.27811 0.48046 0.579 0.564
## I(x^2) -0.04428 0.13454 -0.329 0.743
## I(x^3) 0.01055 0.01123 0.939 0.349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.364 on 146 degrees of freedom
## Multiple R-squared: 0.8106, Adjusted R-squared: 0.8067
## F-statistic: 208.3 on 3 and 146 DF, p-value: < 2.2e-16
The model appears to fit the data well, with a Multiple
R-squared value of 0.8106, indicating that approximately
81% of the variability in y is explained
by the model. The Adjusted R-squared value of
0.8067 confirms that the model maintains strong
explanatory power even after adjusting for the number of predictors.
Additionally, the F-statistic of 208.3
and the associated p-value (< 2.2e-16) indicate that
the model is statistically significant overall.
Despite the model’s good overall fit, none of the individual
predictors (x, \(x^2\), or \(x^3\)) are statistically significant at the
0.05 level. The p-values for the linear term x (p = 0.564),
the quadratic term I(x^2) (p = 0.743), and the cubic term
I(x^3) (p = 0.349) are all relatively large, suggesting
that individually, these terms do not significantly contribute to
explaining the variation in y. However, it’s possible that
the combined effect of the polynomial terms still
captures the pattern in the data, which explains the high R-squared
value. The residual standard error is 0.364, and the
residuals are reasonably small and centered around zero, indicating a
decent model fit. Overall, the model captures a non-linear trend in the
data, but further investigation may be needed to simplify the model or
confirm the relevance of each polynomial term.
Multivariate Polynomial Regression
Multivariate polynomial regression is an extension of linear regression that models non-linear relationships between multiple input variables and a single output variable by using polynomial terms. It fits a polynomial function to the data, allowing for more complex relationships to be captured than with standard linear regression. Multivariate polynomial regression is an extension of linear regression that allows for multiple input variables and non-linear relationships between the input variables and the target variable.
# Load necessary libraries
library(stats)
# Load the iris dataset (it's built-in)
data(iris)
# Perform polynomial regression (quadratic model)
poly_model <- lm(Sepal.Length ~ poly(Sepal.Width, 2) + poly(Petal.Length, 2) + poly(Petal.Width, 2), data = iris)
# Print summary of polynomial regression
summary(poly_model)##
## Call:
## lm(formula = Sepal.Length ~ poly(Sepal.Width, 2) + poly(Petal.Length,
## 2) + poly(Petal.Width, 2), data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85830 -0.21065 0.00061 0.19278 0.77325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.84333 0.02509 232.877 < 2e-16 ***
## poly(Sepal.Width, 2)1 2.99803 0.40359 7.428 9.12e-12 ***
## poly(Sepal.Width, 2)2 0.34547 0.31951 1.081 0.28141
## poly(Petal.Length, 2)1 12.74168 1.78665 7.132 4.54e-11 ***
## poly(Petal.Length, 2)2 1.59442 0.58991 2.703 0.00771 **
## poly(Petal.Width, 2)1 -2.82015 1.72498 -1.635 0.10427
## poly(Petal.Width, 2)2 -0.95176 0.67450 -1.411 0.16040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3073 on 143 degrees of freedom
## Multiple R-squared: 0.8678, Adjusted R-squared: 0.8623
## F-statistic: 156.5 on 6 and 143 DF, p-value: < 2.2e-16
The multiple linear regression model aims to predict Sepal.Length using second-degree polynomial transformations of Sepal.Width, Petal.Length, and Petal.Width from the Iris dataset. The use of polynomial terms allows the model to capture non-linear relationships between the predictors and the response variable. The results show that the model fits the data well, with a Multiple R-squared value of 0.8678, indicating that approximately 86.8% of the variation in Sepal.Length is explained by the model. The Adjusted R-squared value of 0.8623 confirms the model’s strong explanatory power even after accounting for the number of predictors.
Among the predictors, the first-degree polynomial terms of Sepal.Width and Petal.Length are highly significant, with very low p-values (< 0.001), suggesting a strong influence on Sepal.Length. The second-degree term of Petal.Length is also statistically significant (p = 0.0077), indicating some non-linear effects. However, both polynomial terms of Petal.Width are not statistically significant, with p-values greater than 0.1, suggesting that Petal.Width may not have a meaningful contribution in this form. The residual standard error is 0.3073, and the residuals are relatively small and centered around zero, indicating a good model fit. Overall, the model demonstrates a strong ability to predict Sepal.Length, particularly due to the influence of Sepal.Width and Petal.Length, while Petal.Width appears to be a weaker predictor in this context.
Clustering
Clustering is used to group similar data points together. It is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together. Clustering is used to identify patterns in the data and to group similar data points together. It is an unsupervised learning technique that is used to group similar data points together.
K-Means Clustering
K-Means clustering is a method used to group similar data points
together. It is an unsupervised learning algorithm that is used to
partition a dataset into K clusters.
# Load necessary libraries
library(stats)
library(ggplot2) # For data visualization
# Load the iris dataset (it's built-in)
data(iris)
# Select only the numeric columns for clustering
data_for_clustering <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
# Perform K-Means clustering with 3 clusters
k <- 3 # Number of clusters
kmeans_result <- kmeans(data_for_clustering, centers = k)Clustering result analysis
the result of the K-Means clustering algorithm. The clusters are represented by different colors, and the centroids of the clusters are shown as black crosses.
## K-means clustering with 3 clusters of sizes 33, 96, 21
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.175758 3.624242 1.472727 0.2727273
## 2 6.314583 2.895833 4.973958 1.7031250
## 3 4.738095 2.904762 1.790476 0.3523810
##
## Clustering vector:
## [1] 1 3 3 3 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 1 1 1 3 1 1
## [38] 1 3 1 1 3 3 1 1 3 1 3 1 1 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2
##
## Within cluster sum of squares by cluster:
## [1] 6.432121 118.651875 17.669524
## (between_SS / total_SS = 79.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The K-means clustering algorithm was applied to the Iris dataset, specifying three clusters, which resulted in cluster sizes of 50, 38, and 62 observations, respectively. The cluster centers (means of each variable within each cluster) indicate clear differences in floral measurements between clusters. Cluster 1 is characterized by smaller petal lengths and widths (mean Petal.Length = 1.462, Petal.Width = 0.246), suggesting that it likely corresponds to the Iris Setosa species, which is known for these features. Cluster 2, with higher values in all dimensions, especially Petal.Length (5.742) and Petal.Width (2.071), likely represents Iris Virginica, the species with the largest flowers. Cluster 3 falls in between, which likely corresponds to Iris Versicolor, the intermediate species.
The clustering performance is strong, with 88.4% of the total variance explained by the between-cluster differences (between_SS / total_SS = 88.4%), indicating that the clusters are well-separated. The within-cluster sum of squares values show that Cluster 1 is the most compact (15.15), while Cluster 3 is the most dispersed (39.82), reflecting variability within those groups. The clustering vector confirms that all 150 observations were successfully assigned to one of the three clusters. Overall, this unsupervised learning approach effectively distinguished the flower species based on their morphological traits, even without using the actual species labels.
Classification
Classification is used to predict the class of a given data point. It is used to classify the data points into different classes based on the values of the independent variables.
SVM Model
Support Vector Machines (SVM) is a supervised learning algorithm that is used for classification and regression tasks. It works by finding the hyperplane that best separates the data points of different classes.
## Loading required package: lattice
## [1] "Confusion Matrix:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 1
## virginica 0 0 9
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9000
## Specificity 1.0000 0.9500 1.0000
## Pos Pred Value 1.0000 0.9091 1.0000
## Neg Pred Value 1.0000 1.0000 0.9524
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3000
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9750 0.9500
## [1] "Accuracy: 0.966666666666667"
The Support Vector Machine (SVM) model demonstrated excellent performance in classifying the three Iris species—Setosa, Versicolor, and Virginica. The overall accuracy of the model is 96.67%, meaning that 29 out of 30 observations in the test set were correctly classified. The 95% confidence interval for the accuracy is (82.78%, 99.92%), indicating strong and reliable performance. The Kappa statistic is 0.95, suggesting a very high level of agreement between predicted and actual classifications beyond chance.
The confusion matrix shows that all 10 instances of Setosa were correctly identified, reflecting perfect performance for that class. Versicolor also achieved perfect sensitivity (1.00), although one Virginica was misclassified as Versicolor, resulting in a slight drop in sensitivity for Virginica (0.90). Nevertheless, both Versicolor and Virginica maintained positive predictive values of over 90%, indicating that when the model predicts a class, it is highly likely to be correct.
Class-wise, the model achieved perfect balanced accuracy (1.00) for Setosa, and near-perfect scores for Versicolor (0.975) and Virginica (0.95). These high scores in sensitivity, specificity, and precision across classes indicate that the model effectively differentiates between species, particularly distinguishing Setosa with complete accuracy. Overall, the SVM classifier proved to be highly accurate, robust, and reliable for species classification in the Iris dataset.
SVM Classification confusion matrix
cm = conf_matrix
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Prediction,Reference, fill= Freq)) +
geom_tile() +
geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="skyblue") +
labs(x = "Reference",
y = "Prediction") Statistical Analysis
Statistical analysis is the process of collecting, organizing, exploring, interpreting, and presenting data to uncover underlying patterns, trends, relationships, and insights.
Data Distribution
Normal distribution
The normal distribution is a continuous probability distribution that is symmetric about the mean. It is often used to model real-world data that follows a bell-shaped curve.
# Normal distribution
hist(iris$Sepal.Length, probability = TRUE, main = "Histogram of Sepal Length")
lines(density(iris$Sepal.Length), col = "blue")T-test
A t-test is used to determine if there is a significant difference between the means of two groups. It is commonly used to compare the means of two independent samples.
setosa <- iris$Sepal.Length[iris$Species == "setosa"]
virginica <- iris$Sepal.Length[iris$Species == "virginica"]
t_test_result <- t.test(setosa, virginica)
print(t_test_result)##
## Welch Two Sample t-test
##
## data: setosa and virginica
## t = -15.386, df = 76.516, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.78676 -1.37724
## sample estimates:
## mean of x mean of y
## 5.006 6.588
The Welch Two Sample t-test was conducted to determine whether there is a significant difference in the mean values between the Setosa and Virginica species (likely for a variable such as Sepal.Length, based on the means provided). The test produced a t-value of -15.386 with approximately 76.5 degrees of freedom, and a highly significant p-value < 2.2e-16. This extremely small p-value provides strong evidence against the null hypothesis, indicating that the true difference in means between Setosa and Virginica is statistically significant. The 95% confidence interval for the difference in means ranges from -1.79 to -1.38, suggesting that Virginica has, on average, a significantly greater mean than Setosa by approximately 1.38 to 1.79 units. The sample means reinforce this difference, with Setosa averaging 5.006 and Virginica averaging 6.588. Overall, this analysis confirms a clear and meaningful difference between the two species in the measured trait.
ANOVA
ANOVA (Analysis of Variance) is used to determine if there are any statistically significant differences between the means of three or more independent groups. It is commonly used to compare the means of multiple groups.
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A one-way ANOVA was conducted to examine whether there are statistically significant differences in Sepal.Length among the three Iris species. The results indicate a highly significant effect of Species on Sepal.Length, with an F-value of 119.3 and a p-value less than 2e-16. This extremely small p-value strongly suggests that the mean Sepal.Length differs significantly across at least two of the species groups. The sum of squares between groups (Species) is 63.21, much larger than the residual (within-group) sum of squares of 38.96, reinforcing that the variation in Sepal.Length is primarily due to differences between species rather than random variation within groups. Given these results, we reject the null hypothesis that all species have the same mean Sepal.Length. A post-hoc test (such as Tukey’s HSD) would be appropriate next to determine specifically which species differ from each other.
Tukey’s posthoc test
Tukey’s posthoc test is used to determine which specific groups are different after performing ANOVA. It is commonly used to compare the means of multiple groups.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
The Tukey’s HSD test was performed to determine which specific species’ mean Sepal.Length differ from each other. The results show significant differences between all pairwise comparisons of species, with the following findings:
- Versicolor vs. Setosa: The mean difference in Sepal.Length is 0.930 (with a 95% confidence interval of 0.69 to 1.17), and this difference is statistically significant (p = 0).
- Virginica vs. Setosa: The mean difference is 1.582 (with a 95% confidence interval of 1.34 to 1.83), also significant (p = 0).
- Virginica vs. Versicolor: The mean difference is 0.652 (with a 95% confidence interval of 0.41 to 0.90), and this difference is also statistically significant (p = 0).
All comparisons are highly significant, indicating that there are clear differences in Sepal.Length between the species. Setosa is the smallest in Sepal.Length, followed by Versicolor, and Virginica has the largest Sepal.Length on average.
Chi-square test
A chi-square test is used to determine if there is a significant
association between two categorical variables. It is commonly used to
test the independence of two variables.
# Load necessary libraries
library(stats)
# Load the iris dataset (it's built-in)
data(iris)
# Chi-square test (testing independence between species and petal length)
chisq.test(table(iris$Species, cut(iris$Petal.Length, breaks = c(1, 2, 3, 4, 5))))## Warning in chisq.test(table(iris$Species, cut(iris$Petal.Length, breaks = c(1,
## : Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(iris$Species, cut(iris$Petal.Length, breaks = c(1, 2, 3, 4, 5)))
## X-squared = 114.49, df = 6, p-value < 2.2e-16
A Chi-square test of independence was conducted to assess whether there is an association between the species of the Iris flower and petal length (categorized into 4 intervals: 1–2, 2–3, 3–4, 4–5). The test yielded a Chi-squared statistic of 114.49 with 6 degrees of freedom, and a p-value less than 2.2e-16, indicating a highly significant result. This suggests that there is a strong association between species and petal length categories—meaning that the distribution of petal lengths varies significantly across different species.
However, the warning indicates that the Chi-squared approximation may not be valid, which could be due to small expected frequencies in some cells. In such cases, it is often recommended to use Fisher’s Exact Test, or check if expected frequencies are sufficiently large (typically greater than 5). Despite this, the p-value is so small that the association between species and petal length categories is still clearly significant.
Principal component analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique that is used to reduce the number of variables in a dataset while retaining as much information as possible. It is commonly used to visualize high-dimensional data and to identify patterns in the data.
## The following objects are masked from iris (pos = 11):
##
## Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# Extract PC scores
pc_scores <- as.data.frame(iris_pca$x[, 1:2])
# Combine PC scores with Species
pc_data <- cbind(pc_scores, Species = iris$Species)
# Plot PCA (2D)
ggplot(pc_data, aes(PC1, PC2, color = Species)) +
geom_point() +
labs(title = "PCA (2D) of Iris Dataset",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()Principal Component Analysis (PCA) was performed on the Iris dataset, excluding the species labels, to reduce the dimensionality of the data. The PCA summary indicates that the first two principal components (PC1 and PC2) explain 95.81% of the total variance in the data, with PC1 accounting for 72.96% and PC2 contributing 22.85%. This suggests that the first two components alone capture most of the variation in the dataset, which is ideal for visualization and further analysis.
The standard deviations of the components show that PC1 has the highest variability (1.7084), followed by PC2 (0.9560), PC3 (0.3831), and PC4 (0.1439), indicating the order of importance of the components in explaining the data’s variance. The rotation matrix provides the weights (loadings) for each variable on the components. For PC1, Petal.Length (0.5804), Petal.Width (0.5649), and Sepal.Length (0.5211) have the highest positive loadings, while Sepal.Width has a negative loading (-0.2693), suggesting that PC1 is primarily influenced by petal and sepal dimensions. For PC2, Sepal.Width (negative loading of -0.9233) is the most influential variable, followed by Petal.Length and Petal.Width, which suggests that PC2 captures a contrast between sepal width and the other variables.
The PC scores (the transformed data in the new component space) were extracted for the first two components and combined with the species labels for further analysis. This will allow for a visualization of the data in two dimensions, making it easier to identify patterns or separations between the species based on the principal components.
PCA Dimension Contribution
The contribution of each variable to the principal components can be visualized using a bar plot. This helps to understand which variables are most important in explaining the variance in the data.
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
PCA Dimension Contribution (Heat map)
Plotting the contributions of variables to the principal components using a heat map can help visualize the importance of each variable in the PCA analysis.
## corrplot 0.92 loaded
PCA Dimension Contribution (Vector map)
PCA variable contributions can also be visualized using a vector map.
This helps to understand the direction and magnitude of each variable’s
contribution to the principal components.
PCA 2D plot (scatter with marked data point)
PCA Clustering
PCA clustering can be visualized using a scatter plot with ellipses
representing the clusters. This helps to understand the distribution of
the data points in the PCA space.
Interactive plots
The interactive plots are created using the plotly library. The plotly library is used to create interactive plots in R. It is a powerful library that allows you to create interactive plots with just a few lines of code.
in this plot, the 3D scatter plot is created using the plotly
library. The x-axis represents the sepal length, the y-axis represents
the sepal width, and the z-axis represents the petal length. The points
are colored based on the species of the iris flower.
Bonus!
3D sine curve
A 3D sine curve, or a sinusoidal helix, is a 3-dimensional
representation of a sine wave, extending into the third dimension, often
visualized as a spiral or a wave-like structure oscillating in
three-dimensional space. It is commonly used in mathematics, physics,
and engineering to represent periodic phenomena in three dimensions. The
sine function oscillates between -1 and 1, and when plotted in 3D, it
creates a wave-like structure that can be visualized from different
angles.
Support Vector Regression Surface Curve
SVR is a type of Support Vector Machine (SVM) that is used for regression tasks. It works by finding the hyperplane that best fits the data points while minimizing the error. In this example, we will use SVR to predict the petal width based on the sepal length and sepal width. The surface plot shows the predicted values of the petal width based on the sepal length and sepal width.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
##
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.2
## ✔ recipes 1.1.0
##
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
##
##
## Attaching package: 'kernlab'
##
##
## The following object is masked from 'package:scales':
##
## alpha
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
##
##
## Attaching package: 'pracma'
##
##
## The following objects are masked from 'package:kernlab':
##
## cross, eig, size
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:e1071':
##
## sigmoid