ASSIGNMENT WEEK 2 : DESCRIPTIVE STATISTICS
library(datasets)
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Table 1 : Important summary statistics for sepal length, sepal width, petal length and petal width
| Statistic | Sepal Length | Sepal Width | Petal Length | Petal Width | Species Count | |
|---|---|---|---|---|---|---|
| Minimum | 4.3 | 2 | 1 | 0.1 | Setosa | 50 |
| 1st Quartile | 5.1 | 2.8 | 1.6 | 0.3 | Versicolor | 50 |
| Median | 5.8 | 3 | 4.35 | 1.3 | Virginica | 50 |
| Mean | 5.843 | 3.057 | 3.758 | 1.199 | ||
| 3rd Quartile | 6.4 | 3.3 | 5.1 | 1.8 | ||
| Maximum | 7.9 | 4.4 | 6.9 | 2.5 |
The datasets “Iris” contains data for the three species of Iris which includes its sepal length and width and its petal length and width. The three species take an equal part in numbers of the observed data or 1/3 for each and in total of 150 observations.
The sepal length has minimum and maximum data at 4.300 and 7.900 accordingly. While the median is 5.800 or 50% of the data are lower than the value. For the 1st and 3rd quartile, the value is 2.100 and 6.400 accordingly which can be interpreted that 25% and 75% of the data are lower than the value accordingly. And the mean of the sepal width is 5.843.
The sepal width has minimum and maximum data at 2.000 and 4.400 accordingly. While the median is 3.000 or 50% of the data are lower than the value. For the 1st and 3rd quartile, the value is 2.800 and 3.300 accordingly which can be interpreted that 25% and 75% of the data are lower than the value accordingly. And the mean of the sepal width is 3.057.
The petal length has minimum and maximum data at 1.000 and 6.900 accordingly. While the median is 4.350 or 50% of the data are lower than the value. For the 1st and 3rd quartile, the value is 1.600 and 5.100 accordingly which can be interpreted that 25% and 75% of the data are lower than the value accordingly. And the mean of the sepal width is 3.758.
The petal width has minimum and maximum data at 0.100 and 2.500 accordingly. While the median is 1.300 or 50% of the data are lower than the value. For the 1st and 3rd quartile, the value is 0.300 and 1.800 accordingly which can be interpreted that 25% and 75% of the data are lower than the value accordingly. And the mean of the sepal width is 1.199.
attach(iris)
hist(Petal.Length,breaks=25, col="blue", xlab = "Petal Lenght", main = "Historgram of the Iris's petal lenght")
hist(Petal.Length,breaks=7, col="blue", xlab = "Petal Lenght", main = "Historgram of the Iris's petal lenght")
This is the histogram of the Iris’s petal length illustrating the petal length which is on X axes and the frequency of observed petal length on Y axes. The distribution of the histogram is a multimodal/bimodal or there are two or more curves. The main peak is around 1 and the lower peak is around 4 which is a median according to the summarized data in table 1. The existing of two curves is representing difference character between two or more group and can be distinguished. The iris dataset is already known of having observed data for three species and is undoubtedly that two of the three species are different in some characters. In unknown dataset, this can be used as an initial assumption that there are two or more difference group of data and is to be investigated for the different.
hist(Petal.Length,freq = F, breaks=25, col="green", xlab = "Petal Lenght", main = "Historgram of the Iris's petal lenght")
lines(density(Petal.Length), col="red")
lines(seq(1, 7, by=.5), dnorm(seq(1, 7, by=.5), mean(Petal.Length), sd(Petal.Length)), col="blue")
legend(5, 0.55, legend=c("Empirical distribution", "Normal distribution"),
col=c("red", "blue"), lty=1:1, cex=0.8)
The probability bell curve or the normal distribution is drawn with a blue line covering both peak. Value of petal length is on X axes and density of frequency distribution is on Y axes. The mean of the normal distribution, which is set to zero, lie on histogram where the value is 3.758 according to the summary table. When looking at the empirical distribution which represent two difference group of characters, the mean of the distribution of the higher peak would be approximately 1.5 and of the lower peak would be nearly 5.
qqnorm(Petal.Length, pch = 1, frame = FALSE)
qqline(Petal.Length, col = "red", lwd = 2)
The QQ plot showed a curl which represented a bimodal. There is a gap in value between 2 -3. The distribution is not normal and are less kurtosis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris <- iris %>% mutate(Sepal_WL = Sepal.Width/Sepal.Length,Petal_WL = Petal.Width/Petal.Length)
boxplot(iris$Sepal_WL~Species, main="The Ratio of Sepal width and Sepal length among Iris's species",
xlab="Species", ylab="Sepal width by Sepal length")
The box plot showed the Ratio of Sepal width and Sepal length of the setosa and virginica species having a symmetrical distribution while the versicolor having a positive or right skewedness and lesser interquartile range when compared with the two species. The setosa and the versicolor showed possible outlier while the virginica did not. Median of the versicolor and virginica are almost equal. While 3rd quartile of the two species is equal, the virginica has lower 2nd quartile or having a wider interquartile range. Data of the versicolor and virginica has value that overlayed to each other while the setosa has a difference dataset and are separately with only an outlier, also a minimal data, that overlayed to the two group.
library(dplyr)
iris <- iris %>% mutate(Sepal_WL = Sepal.Width/Sepal.Length,Petal_WL = Petal.Width/Petal.Length)
boxplot(iris$Petal_WL~Species, main="The Ratio of Petal width and Petal length among Iris's species",
xlab="Species", ylab="Petal width by Petal length")
The versicolor has a symmetrical distribution while the setosa and the virginica are skewed right and left accordingly. Only the setosa that has outlier data. Range of the setosa are overlayed to the two species and the maximum data which is not an outlier are approximately to the median of the versicolor. The versicolor has narrowest range compared to the two and has maximum data equal to the median of the virginica and the 3rd quartile is a littler higher to the 2nd quartile of the virginica.