In OpenStats Chapter 1, Exercises, Problem 1.9, there is a reference to Fisher’s iris data. (You can find the OpenStats textbook in our Dropbox folder {link in Course Introduction and Materials}, or see the Syllabus on how to access it. The problem is on Page 20 of the 4th edition). Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R.
To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., describe commandLinks to an external site. (on irisLinks to an external site.) in psych packageLinks to an external site., or stargazer packageLinks to an external site.). Share your code and analysis in the discussion forum. Google or Youtube for help with keywords such as “R” and a description of what you are trying to do if you get stuck. EG - “psych package in R”.
Solution: There is a total of 150 cases in this dataset.
library(dplyr) #required to use the count function
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris.data <- iris #call and name file
count(iris.data) #n=?
## n
## 1 150
Solution: There are five variables in this dataset, in which four of them are numeric (i.e., Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width). As shown by the plots, these four variables ares all continuous. The fifth variable is categorical (Species) with three levels (i.e., setosa, versicolor, and virginica).
iris.data <- iris
#how many categorical and numerical variables in total?
total_numerical_variables <- sum(
if (is.numeric(iris.data$Sepal.Length) == TRUE) 1 else 0,
if (is.numeric(iris.data$Sepal.Width) == TRUE) 1 else 0,
if (is.numeric(iris.data$Petal.Length) == TRUE) 1 else 0,
if (is.numeric(iris.data$Petal.Width) == TRUE) 1 else 0,
if (is.numeric(iris.data$Species) == TRUE) 1 else 0)
print(total_numerical_variables)
## [1] 4
total_categorical_variables <- sum(
if (is.factor(iris.data$Sepal.Length) == TRUE) 1 else 0,
if (is.factor(iris.data$Sepal.Width) == TRUE) 1 else 0,
if (is.factor(iris.data$Petal.Length) == TRUE) 1 else 0,
if (is.factor(iris.data$Petal.Width) == TRUE) 1 else 0,
if (is.factor(iris.data$Species) == TRUE) 1 else 0)
print(total_categorical_variables)
## [1] 1
#visually check whether variables are continuous vs discrete (and what the levels are)
plot(iris.data$Sepal.Length)
plot(iris.data$Sepal.Width)
plot(iris.data$Petal.Length)
plot(iris.data$Petal.Width)
plot(iris.data$Species)
#another way to check the categorical levels
Species_levels <- c(iris.data$Species)
Species_levels
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
library(dplyr)
library(psych)
iris.data <- iris
summary(iris.data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
describeBy(iris.data$Sepal.Length, group = iris.data$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 5.01 0.35 5 5 0.3 4.3 5.8 1.5 0.11 -0.45 0.05
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 5.94 0.52 5.9 5.94 0.52 4.9 7 2.1 0.1 -0.69 0.07
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 6.59 0.64 6.5 6.57 0.59 4.9 7.9 3 0.11 -0.2 0.09
describeBy(iris.data$Sepal.Width, group = iris.data$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04 0.6 0.05
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 2.77 0.31 2.8 2.78 0.3 2 3.4 1.4 -0.34 -0.55 0.04
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 2.97 0.32 3 2.96 0.3 2.2 3.8 1.6 0.34 0.38 0.05
describeBy(iris.data$Petal.Length, group = iris.data$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 1.46 0.17 1.5 1.46 0.15 1 1.9 0.9 0.1 0.65 0.02
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 4.26 0.47 4.35 4.29 0.52 3 5.1 2.1 -0.57 -0.19 0.07
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 5.55 0.55 5.55 5.51 0.67 4.5 6.9 2.4 0.52 -0.37 0.08
describeBy(iris.data$Petal.Width, group = iris.data$Species)
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 0.25 0.11 0.2 0.24 0 0.1 0.6 0.5 1.18 1.26 0.01
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 1.33 0.2 1.3 1.32 0.22 1 1.8 0.8 -0.03 -0.59 0.03
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 2.03 0.27 2 2.03 0.3 1.4 2.5 1.1 -0.12 -0.75 0.04
anova_sepal_length<- aov(iris.data$Sepal.Length~iris.data$Species, data=iris.data)
summary(anova_sepal_length)
## Df Sum Sq Mean Sq F value Pr(>F)
## iris.data$Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_sepal_length, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = iris.data$Sepal.Length ~ iris.data$Species, data = iris.data)
##
## $`iris.data$Species`
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
anova_petal_length<- aov(iris.data$Petal.Length~iris.data$Species, data=iris.data)
summary(anova_petal_length)
## Df Sum Sq Mean Sq F value Pr(>F)
## iris.data$Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_petal_length, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = iris.data$Petal.Length ~ iris.data$Species, data = iris.data)
##
## $`iris.data$Species`
## diff lwr upr p adj
## versicolor-setosa 2.798 2.59422 3.00178 0
## virginica-setosa 4.090 3.88622 4.29378 0
## virginica-versicolor 1.292 1.08822 1.49578 0
anova_sepal_width<- aov(iris.data$Sepal.Width~iris.data$Species, data=iris.data)
summary(anova_sepal_width)
## Df Sum Sq Mean Sq F value Pr(>F)
## iris.data$Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_sepal_width, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = iris.data$Sepal.Width ~ iris.data$Species, data = iris.data)
##
## $`iris.data$Species`
## diff lwr upr p adj
## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
anova_petal_width<- aov(iris.data$Petal.Width~iris.data$Species, data=iris.data)
summary(anova_petal_width)
## Df Sum Sq Mean Sq F value Pr(>F)
## iris.data$Species 2 80.41 40.21 960 <2e-16 ***
## Residuals 147 6.16 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_petal_width, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = iris.data$Petal.Width ~ iris.data$Species, data = iris.data)
##
## $`iris.data$Species`
## diff lwr upr p adj
## versicolor-setosa 1.08 0.9830903 1.1769097 0
## virginica-setosa 1.78 1.6830903 1.8769097 0
## virginica-versicolor 0.70 0.6030903 0.7969097 0
boxplot(iris.data$Sepal.Length~iris.data$Species, main="Sepal Length", xlab="Type of Flower", ylab="Centimeters (cm)")
boxplot(iris.data$Sepal.Width~iris.data$Species, main="Sepal Width", xlab="Type of Flower", ylab="Centimeters (cm)")
boxplot(iris.data$Petal.Length~iris.data$Species, main="Petal Length", xlab="Type of Flower", ylab="Centimeters (cm)")
boxplot(iris.data$Petal.Width~iris.data$Species, main="Petal Width", xlab="Type of Flower", ylab="Centimeters (cm)")
#ForSepal
cor.test(iris.data$Sepal.Length, iris.data$Sepal.Width, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: iris.data$Sepal.Length and iris.data$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
plot(iris.data$Sepal.Length, iris.data$Sepal.Width, pch=17, main="Sepals", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=17, col=factor(levels(factor(iris.data$Species))))
#ForPetal **
cor.test(iris.data$Petal.Length, iris.data$Petal.Width, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: iris.data$Petal.Length and iris.data$Petal.Width
## t = 43.387, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9490525 0.9729853
## sample estimates:
## cor
## 0.9628654
plot(iris.data$Petal.Length, iris.data$Petal.Width, pch=16, main="Petals", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=16, col=factor(levels(factor(iris.data$Species))))
#For length**
cor.test(iris.data$Sepal.Length, iris.data$Petal.Length, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: iris.data$Sepal.Length and iris.data$Petal.Length
## t = 21.646, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8270363 0.9055080
## sample estimates:
## cor
## 0.8717538
plot(iris.data$Sepal.Length, iris.data$Petal.Length, pch=19, main="Flower Length", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=19, col=factor(levels(factor(iris.data$Species))))
#For width
cor.test(iris.data$Sepal.Width, iris.data$Petal.Width, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: iris.data$Sepal.Width and iris.data$Petal.Width
## t = -4.7865, df = 148, p-value = 4.073e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4972130 -0.2186966
## sample estimates:
## cor
## -0.3661259
plot(iris.data$Sepal.Width, iris.data$Petal.Width, pch=18, main="Flower Width", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=18, col=factor(levels(factor(iris.data$Species))))
We talk about three different types of data in Economics (cross sectional, time series and panel data). If you type “data()” in R, you will find some internal datasets preloaded in R (including iris dataset). Pick any dataset and tell/show us what category of data it belongs to with an appropriate chart/summary statistics.
See some different types of charts in base R, Links to an external site.and some simple code in base RLinks to an external site. to create these charts.
You will find lots of code online, as well as some resources in Tools for Success.
Try the commands in the help file to see if you have any quick visualizations to support your answer.
Read up on descriptive statistics blogs online that can give you some ideas. STHDALinks to an external site., Stats and R. Links to an external site.
Draw a scatterplotLinks to an external site. if cross-sectional data.
Show us the timeplotsLinks to an external site. if time series data,
Use a two-way table command on the ID variables if panel data.
data("ToothGrowth")
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
dose_lens <- plot(ToothGrowth$dose, ToothGrowth$len, pch=19, main="Amount of Dosage", xlab="Dosage (mg)", ylab="Tooth Length", col=factor(levels(factor(ToothGrowth$dose))))
legend("topleft", legend = levels(factor(ToothGrowth$dose)),pch=19, col=factor(levels(factor(ToothGrowth$dose))))
VC_lens <- plot(ToothGrowth$supp, ToothGrowth$len, pch=18, main="Type of Supplement", xlab="Supplement Type", ylab="Tooth Length", col=factor(levels(factor(ToothGrowth$VC))))
interaction.plot(x.factor= ToothGrowth$dose,
trace.factor= ToothGrowth$supp,
response=ToothGrowth$len, fun= mean, legend=TRUE)