Name: Karolina Grodzinska
Class: ALY6010: Probability Theory and Introductory Statistics
INTRODUCTION
This assignment required us to produce summary statistics for the whole dataset, as well as, per category of a categorical variable. Thus, the first dataset that came to my mind was the iris dataset that has been widely used for classification problems. I will start by describing the variables and descriptive statistics related to them, and then I will present some insights about this dataset using graphs. My final steps in this analysis will include correlation and regression analysis.
I’ll start by loading the dataset and seeing the summary of variables:
iris = data.frame(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The iris dataset consists of 5 variables - 4 numerical ones describing characteristics of petals and sepals of flowers, and one categorical variable including species.
From the summary, I can see that the dataset includes 150 observations - 50 for each species. Petal Length seems to have the highest range, so for the next part of the analysis, I will be using this variable in my visualizations.
describeBy(iris, iris$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11 -0.45
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.60
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10 0.65
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18 1.26
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## se
## Sepal.Length 0.05
## Sepal.Width 0.05
## Petal.Length 0.02
## Petal.Width 0.01
## Species* 0.00
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 5.94 0.52 5.90 5.94 0.52 4.9 7.0 2.1 0.10 -0.69
## Sepal.Width 2 50 2.77 0.31 2.80 2.78 0.30 2.0 3.4 1.4 -0.34 -0.55
## Petal.Length 3 50 4.26 0.47 4.35 4.29 0.52 3.0 5.1 2.1 -0.57 -0.19
## Petal.Width 4 50 1.33 0.20 1.30 1.32 0.22 1.0 1.8 0.8 -0.03 -0.59
## Species* 5 50 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN NaN
## se
## Sepal.Length 0.07
## Sepal.Width 0.04
## Petal.Length 0.07
## Petal.Width 0.03
## Species* 0.00
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis
## Sepal.Length 1 50 6.59 0.64 6.50 6.57 0.59 4.9 7.9 3.0 0.11 -0.20
## Sepal.Width 2 50 2.97 0.32 3.00 2.96 0.30 2.2 3.8 1.6 0.34 0.38
## Petal.Length 3 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37
## Petal.Width 4 50 2.03 0.27 2.00 2.03 0.30 1.4 2.5 1.1 -0.12 -0.75
## Species* 5 50 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN NaN
## se
## Sepal.Length 0.09
## Sepal.Width 0.05
## Petal.Length 0.08
## Petal.Width 0.04
## Species* 0.00
This table presents descriptive statistics, represented by measures that show either central points of the data (such as mean and median), variability of the data (for example the range and standard deviation). The first summary for the whole dataset also included the measures of position - showing the values for different quartiles, which aren’t present in the second table divided by species.
I will present the distribution of petal length in the next sections but even from this table, we can already see some characteristics. For example, when the median is higher than the mean, the data might be negatively skewed. Similarly, when the mean is higher than the median, the data might be positively skewed. This table also includes a measure called “skewness”, which represents the asymmetry in the distribution. When the value of skewness is lower than 0, it corresponds to negatively skewed data. Values higher than 0, on the contrary, show positively skewed data (Bluman, 2018). However, I will look at the plots to see what is the actual shape of the distribution.
Kurtosis also shows the shape of the data - the higher the value, the higher the “peak” of the distribution. So the lower values of kurtosis are also characterized by longer tails on both sides. The value of kurtosis when the data is normally distributed is 3, so from the table above we can already see that none of the variables have values that high, most of them are negative or lower than 1.
I can also see the spread of the data points by looking at the minimum and maximum values, and their ranges. It seems that the virginica species is characterized by the highest dispersion of all variables except for sepal width.
The table with summary statistics is really informative but it has been said that for most people it is easier to comprehend data presented graphically than to understand numerical data shown in tables. Additionally, graphs are helpful in discovering any patterns in data (Bluman, 2018). Therefore, this next section will focus on plots.
To avoid confusion - setosa will always be shown in red, versicolor in green, and virginica in blue.
# Box plot
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
labs(title = "Petal length for different species shown on a box plot", y = "Petal Length") +
theme_minimal()
Above, we can see 3 box plots showing the distribution of the variable petal length for each species. They consist of boxes drawn between the first and the third quartile with the median represented as a bold line in the middle, and whiskers showing the range of data. Any points drawn above or below the whiskers represent outliers. Box plots are a good way of presenting the measures of position.
This is a graphical representation of what has been shown in the table above - we can see that the distribution of data points is the least spread out for setosa. We also see that for setosa the data points in the first and third quartile are almost equally spread out (corresponding to the low value of skewness in the table). However, for both versicolor and virginica, the size of boxes on both sides of the median are different, which corresponds to higher values for skewness, and suggests that the data points are more spread out in one of the quartiles. Virginica represents the highest median and generally longer petals, while setosa is associated with the lowest values. Additionally, the box plots show that setosa and versicolor include outliers.
It turns out that most people struggle to interpret box plots and it is more challenging for them than understanding other chart types. That also makes box plots more prone to misinterpretation. When people see that one part of the box is wider than the other, they automatically think that it consists of more data points, while in fact the data is just more spread out in one of the quartiles (the data is equally divided based on the number of observations and each of the quartiles represents the same number of observations, their values are just different). Similarly, different lengths of the whiskers can lead to the same misinterpretation. Thus, other graphical representations of the distribution of data can be more useful for the audience that isn’t data literate (or isn’t particularly familiar with statistics). One of the alternative charts could be a histogram. Another example would be a jitter plot (Desbarats, 2021).
Firstly, I wanted to see the same data on a jitter plot.
A jitter plot shows data points as dots, which can be grouped into separate categories. Below, I have visualized exactly the same data as above but using a jitter plot instead of a box plot.
# Jitter chart
ggplot(iris, aes(x = Species, y = Petal.Length, color = Species)) +
geom_jitter(position=position_jitter(0.2)) +
labs(title = "Petal length for different species shown on a jitter plot", y = "Petal Length") +
theme_minimal()
The first main difference is that we can’t see the quartiles or outliers anymore, just the data points placed where the corresponding values on the axis are. It is still visible that the virginica species has the longest petals and setosa is characterized by the shortest petals.
# Let's see how the it looks like on histograms:
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_histogram(alpha = 0.75, position = "identity") +
labs(title = "Petal length for different species shown on a histogram", x = "Petal Length", y = "Frequency") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Lastly, I showed the distribution on histograms but I decided to show one graph with the distribution of all 3 species. From this plot, I can see the clearest that there is an overlap between versicolor and virginica, while setosa seems to be quite different from the two other species.
We can see that setosa has a shape the closest to a normal distribution with the highest peak. Versicolor is negatively skewed, and virginica seems to have 2 peaks, however, it is not bimodal because one of them is higher.
The table and all 3 graphs show the same information but in a different way and they can complement each other, as it is easier to observe different insights on different plots.
Now I’d like to see the relations between different variables on scatter plots. I have found a great example of how to show them as small multiples on the website of the University of Warwick.
pairs(iris[1:4],
main = "Relations between variables in iris data",
pch = 21,
bg = c("firebrick1", "green3", "royalblue2")[unclass(iris$Species)],
lower.panel=NULL,
labels = c("Sepal Length","Sepal Width","Petal Length","Petal Width"),
font.labels = 2,
cex.labels = 2)
Scatter plots are types of plots that show ordered pairs of numerical variables, showing the relation (or the lack of) between an independent and dependent variable (Bluman, 2018).
The graph above shows the variables (for each pair of variables the corresponding scatter plot lays on the intersection, such as the scatter plot in the top right corner is showing sepal length and petal width). The data points are colored based on the species.
The plots show that there is a positive linear relationship between petal length and petal width. On the contrary, sepal length and sepal width show no relationship at all. Generally, setosa (shown in red) seems to differ from the other two species.
#I'd like to take a look at only the two variables showing a clear relationship
p <- lm(Petal.Width ~ Petal.Length, data=iris)
plot(iris$Petal.Length, iris$Petal.Width,
pch = 21, cex=1.5,
xlab = "Petal Length",
ylab = "Petal Width",
main = "Positive linear relationship between petal length and width",
bg = c("firebrick1", "green3", "royalblue2")[unclass(iris$Species)])
abline(p, col="black")
The scatter plot above allows taking a closer look at the positive linear relationship between petal length and width. So when the values of the independent variable (petal length) increase, the values of a dependent variable (petal width) also increase. Again, the colors correspond to different species and I added a trend line in the middle to show how the data points follow the linear trend.
From the graph above, I can see relationships between all the numerical variables shown on scatter plots (colors represent different species). I can already conclude that it seems that there is no relationship between the sepal length and sepal width and probably petal length and width have the strongest linear relationship.
The obvious next step in my analysis would be to look at correlations to check that.
cor(iris[1:4], method = "pearson")
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Pearson’s correlation coefficient shows how strong the linear relationship between two variables is. The values vary between -1 and 1, where a coefficient close to -1 shows a strong negative relationship, coefficient close to 0 indicates the lack of linear relationship, and coefficient close to 1 suggests a strong positive relationship (Bluman, 2018).
This table is already informative but it would look nicer if I showed a graphical representation of it:
# Let's see the correlation matrix
ggcorr(iris[1:4], label = TRUE, label_size = 5, label_round = 2, label_alpha = TRUE)
We can see that there is indeed a strong positive correlation between petal length and width, as well as, petal length and sepal length, and sepal length with petal width. Sepal width is negatively correlated to the other variables and there’s almost no relationship between the sepal length and width.
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
labs(title = "Linear relationship between the length and width of petals", x = "Petal Width", y = "Petal Length") +
geom_point() +
geom_smooth(method = "lm", se = TRUE) + # showing a straight line and standard error
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
The two studied variables are called the independent variable (it cannot be controlled or manipulated) and the dependent variable. It is not always easy to determine which variable should be the explanatory one and which one should be a response variable (Bluman, 2018). In my case, I decided that petal width is independent and petal length is the response variable, dependent on the width of petals.
Above, there is a scatter plot showing a clear positive linear relationship between the two variables - the petal length increases when the width of petals increases.
Now, I am going to look at the regression line equation:
# Let's see the values associated with this linear regression
lm(Petal.Length ~ Petal.Width, data = iris)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.084 2.230
So the length would be calculated using a formula:
Petal Length = 1.084 + 2.23 * Petal Width
That means that if the petal width increases by 1, then the length of a petal is expected to increase by 2.23.
linear = lm(Petal.Length ~ Petal.Width, data = iris)
summary(linear)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
There is a lot of information in this summary. The first part of the output is simply the formula for the linear model. It is followed the values corresponding to the quartiles for the residuals. To assess that the model fits the data well, the distribution across these points should be symmetrical with the mean value zero (Rego, 2015). In my case, the mean for those values is equal to -0.00323 and the median to -0.02955, so the model fits this data quite well. I will visualize the model’s fit in the next section.
The next section shows coefficients of the model, together with their standard errors (which measure the average amount that the coefficient estimates vary from the actual average value of the response variable), coefficient’s t-value (how many standard deviations the coefficient estimate is far away from 0), and p-values. Small values for the p-value mean that it is unlikely that the relationship between the explanatory and response variables is due to chance (Rego, 2015).
The last part of the summary shows residual standard error, R-squared, and the F-statistic.
The Residual Standard Error measures the quality of a linear regression fit. Every linear model is assumed to contain an error and because of the presence of this error, it is not possible to perfectly predict the response variable from the predictor. The Residual Standard Error shows the average amount that the response variable deviates from the true regression line. In my model, petal length will deviate by 0.48 centimeters (Rego, 2015).
The the coefficient of determination (R-squared) explains how well the model is fitting the actual data. R-squared shows how much of the variation of the dependent variable is explained by the linear model and the independent variable. Its values vary between 0 and 1, where 1 means that the regression line explains 100% of the variation in the response variable (Bluman, 2018). In this model, R2 is equal to 0.93, so 93 % of the variance found in the response variable (petal length) can be explained by the predictor (petal width).
# let's visualize the model's fit
autoplot(linear,
which = 1:3,
nrow = 3,
ncol = 1)
There are several plots included in R that can quantify the performance of a model. I have plotted 3 graphs to look at my model. The first one show residuals versus fitted values, which means the distance between the actual and predicted value. If residuals meet the assumption about being normally distributed with the mean equal to zero, then the trend line should closely follow the y = 0 line on the plot (Cotton). This is mostly true for my model, the line only strays away from the zero line for the higher values corresponding to petal width (which could be seen on the scatter plot, as the values were plotted further away from the line of best fit).
The Q-Q plot shows whether the residuals follow a normal distribution. The x-axis represent quantiles from the normal distribution and the y-axis show the standardized residuals (the residuals divided by their standard deviation). If plotted points follow along a straight line, they are normally distributed (Cotton). It seems that it is mostly true for my model.
Finally, the last plot shows the square root of the standardized residuals versus the fitted values. It helps to check whether the size of the residuals gets bigger or smaller (Cotton). If the line is roughly horizontal across the plot, then the assumption of homoscedasticity is likely to be satisfied for this regression model. The residuals should also be randomly scattered around the line, without any clear patterns (Zach, 2020). In this case, it seems that the line is indeed horizontal and the residuals seem to be randomly scattered around it.
An additional task:
From the box plot included in my module 2 assignment, I found out that the petal length also differs between species.
# Box plot
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
labs(title = "Petal length for different species shown on a box plot", y = "Petal Length") +
theme_minimal()
I’d like to see whether it would be possible to create a linear model for a categorical variable.
lm(Petal.Length ~ Species, data = iris)
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Coefficients:
## (Intercept) Speciesversicolor Speciesvirginica
## 1.462 2.798 4.090
The intercept is the mean length of setosa and the coefficients for each category are calculated as relative to the intercept. Thus, if I added 2.798 to 1.462, I would have gotten the average petal length of a versicolor flower. If I wanted to avoid this relativity, I could specify my formula as Petal.Length ~ Species + 0. However, the resulting numbers would only tell me what are the average values for petal lengths of each iris species.
SUMMARY
This assignment has focused on correlation and regression. I have been able to find a strong linear relationship between petal length and width of iris flowers, and create a model that has been a good fit for the data. Thus, I can conclude that it is possible to predict the petal length of iris flowers, based on the width of their petals.
On the other hand, the analysis based on the categorical variable would have been better as a classification problem, not regression.
REFERENCES
Bluman, A. G. (2018). Elementary statistics: A Step by Step Approach: A Brief Version. New York: McGraw-Hill Higher Education; 10th edition.
Cotton, R. Introduction to Regression in R. Datacamp, Retrieved from: https://app.datacamp.com/learn/courses/introduction-to-regression-in-r
Desbarats, N. (2021). I’ve Stopped Using Box Plots. Should You?. Nightingale. Journal of the Data Visualization Society. Retrieved from: https://nightingaledvs.com/ive-stopped-using-box-plots-should-you/
Rego, F. (2015). Quick Guide: Interpreting Simple Linear Model Outpu in R. Retrieved from: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
University of Warwick. Plotting the Iris Data. Retrieved from: https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_plots/
Zach. (2020). How to Interpret a Scale-Location Plot (With Examples). Statology, Retrieved from: https://www.statology.org/scale-location-plot/