data(iris)
#Set color of plots for each specie
= c( "skyblue", "lightgreen","lightpink")
colors
#Plot the boxplot and ir represent volleyball category in different color
boxplot(Sepal.Length ~ Species, data = iris,
main = "Sepal Length by Species",
xlab = "Species",
ylab = "Sepal Length (cm)",
col = colors,
# Set boxplot colorscol = colors,
border = "darkblue", # Box border color
notch = TRUE, # Notches help to report a confidence interval around the median
varwidth = TRUE, # Box widths proportional to number of observations
outline = T) # Remove outliers so you have a better view
#Better readability by adding a grid
grid(nx=NULL, ny = NULL, lty=2, col="lightgray",lwd = 0.8)
WEEK 5
LOADING PACKAGES
LOADING DATA
LOOK AT THE GRAPH AND IDENTIFY THERE TESTS:
The image may be a boxplot. It shows The distribution Of Sepal lengths. You may Be comparing this across three flower species .These species are setosa, versicolor ,And virginica.
Statistical tests Can be associated with the data shown These include ANOVA short for Analysis of Variance
ANOVA (Analysis of Variance): Boxplot compares a continuous variable. You are looking at sepal length. This happens Across Three groups. These are the species .It would be Appropriate to use an ANOVA test. This test will evaluate whether there are significant differences .These are In the mean sepal length. It is among the species.
Kruskal-Wallis Test: Data might not meet assumptions of ANOVA. These assumptions may be normality or Homogeneity of variance. The Kruskal-Wallis test can be used. It is a non-parametric alternative .It can compare distributions between the species.
Tukey’s Honest Significant Difference (HSD) Test: ANOVA may Indicate Significant differences. A post-hoc test like Tukey’s HSD can be applied .The goal is To determine which specific pairs of Species have Significant differences. Here We are talking about their sepal lengths.
Levene’s Test: This test could check for Homogeneity Of variances between the groups .It may be executed before performing ANOVA. These tests can be used. It depends on the assumptions and data properties. The goal is to analyze differences between the species .It is based on the sepal length.
Image provided reveals density plot Of petal length.. Three species are shown. .They are setosa versicolor and virginica.. The distribution of length Varies for each species.
- ANOVA (Analysis of Variance): Plot compares variable. It does not look at discrete value but continuum .The variable is length of petal. The petal length is compared across three groups. Groups are species of flowers. ANOVA test can be used to determine .This test checks for significant differences in mean petal lengths. ANOVA Assesses these differences among the species.
- Kruskal-Wallis Test: Should data not meet assumptions of ANOVA Kruskal-Wallis test can be used Assumptions Might be normality and homogeneity of variance The test is an alternative.. It’s non-parametric. .The Aim is to Compare distributions across species
- Shapiro-Wilk Test (or other normality tests): Test Is used to verify if petal length data for each species is in normal distribution .This normal distribution is required for ANOVA assumption.
- Levene’s Test or Bartlett’s Test: These tests evaluate if variances are equal (homogeneity of variance) across species groups. If variances are not equal ANOVA assumptions Are not met. Alternative tests could be more fitting .Welch’s ANOVA is one such test.
- Tukey’s HSD (Honestly Significant Difference): ANOVA shows significant differences between species. Tukey’s HSD test could be used .It does post-hoc pairwise comparisons. Task is to find which species have petal lengths that differ significantly.
The image features a scatter plot. The regression line is fitted and it shows the Relationship between petal length and petal width across three flower species. These species are setosa versicolor and virginica .Each species is depicted by different markers and colors. The outliers are visible in the plot because they Reveal interesting information about the relationship Between variables.
The scatter Plot shown here Represents relationship between two Variables within the data set.. The variables are petal length and petal width.. By examining the scatter plot ,Patterns within the Data Can be discerned. .These patterns indicate whether these two variables have any relationship between them.
- Pearson’s Correlation: This statistical test is used to analyze the relationship between two variables. In this case ,he variables are petal Length and petal width. The test provides A correlation coefficient value that shows the strength of their relationship .This value Ranges from -1 to 1. A value near 1 indicates a strong and positive linear relationship .Value close to -1 suggests a strong and negative linear relationship. While value near 0 Indicates no linear correlation between the two variables.
- Simple Linear Regression: Simple Linear Regression is a statistical method to model linear relationship between variables.. If we observe that petal length is increasing as petal width increases.This suggests a linear relationship between these two variables.For instance ,a linear regression model can be used to predict petal width based on petal length.
- Multiple Linear Regression (Analysis by Species): Suppose you wish to include species as a categorical Variable in regression analysis.The aim is to see how it affects the relationship between petal length and width.You can extend the Simple linear regression by Using dummy variables multiple regression..
- ANOVA for Linear Regression: Use of ANOVA Is possible to judge the relevance of linear Regression model .It assesses whether a Slope of regression line is substantially unlike zero.
- Comparing Slopes Across Species: One can examine if A connection between petal length and width is different across species .ANCOVA Can be the method to choose. The method can let you make comparisons of linear relationships and take into account species.
- Residual Analysis: After constructing a linear regression model ,you can delve into the residuals.Those are the Differences between the Observed and predicted values.These are to evaluate the goodness of fit.They can Indicate signs of heteroscedasticity. And they Can ensure model assumptions Are not breached.
The tests and methods work well. They help to delve into linear Relationship Between petal length and petal width .It also helps to compare this relationship across species. Dissecting this Relationship Is shown in the scatter plot.
These visuals depict bar charts.. The charts show categorical variable distribution.. A categorical variable is species.. These are across two other levels of variable. Another categorical variable They reflect is size .Considering this data ,certain statistical tests could be conjured with the graphics.
- Chi-Square Test of Independence: This test shows a significant relationship between two categorical factors.
• Null hypothesis: There is no Distributional difference For size between three species.
• Alternative Hypothesis: The distribution of size is different between the three species.
- Fisher’s Exact Test: This test is similar to the chisquare test It is meant For small sample sizes. It calculates The exact probability of obtaining the observed data Under independent assumption.
- Cochran-Mantel-Haenszel Test: This test assesses the connection between two categorical variables. A third categorical variable will be adjusted .It is stratum.In this scenario ,With the third categorical variable, we can see the association between species and size across different locations.
Note: The choice of test hinges on specific research questions and data assumptions Large sample sizes and expected frequencies greater than 5 suggest chi-square test. Small sample sizes and expected frequencies less than 5 suggest Fisher’s exact test
Reproduce graphs using the Dataset “iris”:
first graph:
Second Graph:
# Create the density plot with improvements
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.6, color = "darkblue", size = 1) + # Adjust transparency and add border
labs(title = "Density Plot of Petal Length by Species",
x = "Petal Length (cm)", # Add unit for clarity
y = "Density") +
scale_fill_manual(values = c("skyblue", "lightgreen", "lightpink")) + # Custom colors for species
theme_minimal(base_size = 15) + # Use a clean minimal theme
theme(legend.position = "top", # Position legend on top for better layout
plot.title = element_text(hjust = 0.5, face = "bold", size = 18), # Center and style the title
axis.title = element_text(face = "bold")) # Bold axis titles
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
###Third graph:
# Create the scatter plot with enhancements
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +
geom_point(size = 3, alpha = 0.7) + # Adjust point size and transparency for better visibility
geom_smooth(method = "lm", se = FALSE, linetype = "solid", color = "black", size = 0.8) + # Add linear fit lines for trend
geom_abline(intercept = 0, slope = 1, color = "gray", linetype = "dashed") + # Reference line (slope = 1)
labs(title = "Scatter Plot of Petal Width vs. Petal Length by Species",
x = "Petal Length (cm)", # Add unit for clarity
y = "Petal Width (cm)") +
scale_color_manual(values = c("skyblue", "lightgreen", "lightpink")) + # Custom colors for species
theme_minimal(base_size = 15) + # Clean minimal theme with larger base size
theme(legend.position = "top", # Move legend to the top
plot.title = element_text(hjust = 0.5, face = "bold", size = 18), # Center and bold the title
axis.title = element_text(face = "bold")) # Bold the axis titles
`geom_smooth()` using formula = 'y ~ x'
fourth Graph:
# Create a new column for "size" based on petal width
$size <- ifelse(iris$Petal.Width > 2, "big", "small")
iris
# Create the bar plot with improved aesthetics
ggplot(iris, aes(x = Species, fill = size)) +
geom_bar(stat = "count", position = "dodge", width = 0.7) + # Use "dodge" for side-by-side bars
scale_fill_manual(values = c("lightblue", "tomato"), labels = c("Small", "Big")) + # Custom colors and labels
labs(title = "Number of Irises by Species and Size",
x = "Species",
y = "Count",
fill = "Size") + # Add a clear label for the legend
theme_minimal(base_size = 15) + # Minimal theme with larger text size
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 18), # Center and bold the title
axis.title = element_text(face = "bold"),
legend.position = "top", # Move the legend to the top
legend.title = element_text(face = "bold")) # Bold the legend title