I.  Part 1: OpenStats Problem 1.9

In OpenStats Chapter 1, Exercises, Problem 1.9, there is a reference to Fisher’s iris data. (You can find the OpenStats textbook in our Dropbox folder {link in Course Introduction and Materials}, or see the Syllabus on how to access it.  The problem is on Page 20 of the 4th edition).  Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R.  

To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., describe commandLinks to an external site. (on irisLinks to an external site.) in psych packageLinks to an external site., or stargazer packageLinks to an external site.). Share your code and analysis in the discussion forum.  Google or Youtube for help with keywords such as “R” and a description of what you are trying to do if you get stuck.  EG - “psych package in R”.

(a) How many cases were included in the data?

Solution: There is a total of 150 cases in this dataset.

library(dplyr) #required to use the count function
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris.data <- iris #call and name file 
count(iris.data) #n=?
##     n
## 1 150

(b+c) How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete. How many categorical variables and what are they?

Solution: There are five variables in this dataset, in which four of them are numeric (i.e., Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width). As shown by the plots, these four variables ares all continuous. The fifth variable is categorical (Species) with three levels (i.e., setosa, versicolor, and virginica).

iris.data <- iris

#how many categorical and numerical variables in total?
total_numerical_variables <- sum(
  if (is.numeric(iris.data$Sepal.Length) == TRUE) 1 else 0,
  if (is.numeric(iris.data$Sepal.Width) == TRUE) 1 else 0,
  if (is.numeric(iris.data$Petal.Length) == TRUE) 1 else 0,
  if (is.numeric(iris.data$Petal.Width) == TRUE) 1 else 0,
  if (is.numeric(iris.data$Species) == TRUE) 1 else 0)
print(total_numerical_variables)
## [1] 4
total_categorical_variables <- sum(
  if (is.factor(iris.data$Sepal.Length) == TRUE) 1 else 0,
  if (is.factor(iris.data$Sepal.Width) == TRUE) 1 else 0,
  if (is.factor(iris.data$Petal.Length) == TRUE) 1 else 0,
  if (is.factor(iris.data$Petal.Width) == TRUE) 1 else 0,
  if (is.factor(iris.data$Species) == TRUE) 1 else 0)
print(total_categorical_variables)
## [1] 1
#visually check whether variables are continuous vs discrete (and what the levels are)
plot(iris.data$Sepal.Length)

plot(iris.data$Sepal.Width)

plot(iris.data$Petal.Length)

plot(iris.data$Petal.Width)

plot(iris.data$Species)

#another way to check the categorical levels
Species_levels <- c(iris.data$Species)
Species_levels
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Descriptive Analysis

Descriptive Statistics Summary

library(dplyr)
library(psych)
iris.data <- iris

summary(iris.data)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

How do the sepal/petal lengths and widths change based on type of species?

describeBy(iris.data$Sepal.Length, group = iris.data$Species)
## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 5.01 0.35      5       5 0.3 4.3 5.8   1.5 0.11    -0.45 0.05
## ------------------------------------------------------------ 
## group: versicolor
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 5.94 0.52    5.9    5.94 0.52 4.9   7   2.1  0.1    -0.69 0.07
## ------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 6.59 0.64    6.5    6.57 0.59 4.9 7.9     3 0.11     -0.2 0.09
describeBy(iris.data$Sepal.Width, group = iris.data$Species)
## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 3.43 0.38    3.4    3.42 0.37 2.3 4.4   2.1 0.04      0.6 0.05
## ------------------------------------------------------------ 
## group: versicolor
##    vars  n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 50 2.77 0.31    2.8    2.78 0.3   2 3.4   1.4 -0.34    -0.55 0.04
## ------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 2.97 0.32      3    2.96 0.3 2.2 3.8   1.6 0.34     0.38 0.05
describeBy(iris.data$Petal.Length, group = iris.data$Species)
## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 1.46 0.17    1.5    1.46 0.15   1 1.9   0.9  0.1     0.65 0.02
## ------------------------------------------------------------ 
## group: versicolor
##    vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 50 4.26 0.47   4.35    4.29 0.52   3 5.1   2.1 -0.57    -0.19 0.07
## ------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 5.55 0.55   5.55    5.51 0.67 4.5 6.9   2.4 0.52    -0.37 0.08
describeBy(iris.data$Petal.Width, group = iris.data$Species)
## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 0.25 0.11    0.2    0.24   0 0.1 0.6   0.5 1.18     1.26 0.01
## ------------------------------------------------------------ 
## group: versicolor
##    vars  n mean  sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 50 1.33 0.2    1.3    1.32 0.22   1 1.8   0.8 -0.03    -0.59 0.03
## ------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 50 2.03 0.27      2    2.03 0.3 1.4 2.5   1.1 -0.12    -0.75 0.04

Are there significant mean differences of petal/sepal lengths and/or widths, depending on the type of species?

anova_sepal_length<- aov(iris.data$Sepal.Length~iris.data$Species, data=iris.data)
summary(anova_sepal_length)
##                    Df Sum Sq Mean Sq F value Pr(>F)    
## iris.data$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals         147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_sepal_length, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = iris.data$Sepal.Length ~ iris.data$Species, data = iris.data)
## 
## $`iris.data$Species`
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0
anova_petal_length<- aov(iris.data$Petal.Length~iris.data$Species, data=iris.data)
summary(anova_petal_length)
##                    Df Sum Sq Mean Sq F value Pr(>F)    
## iris.data$Species   2  437.1  218.55    1180 <2e-16 ***
## Residuals         147   27.2    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_petal_length, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = iris.data$Petal.Length ~ iris.data$Species, data = iris.data)
## 
## $`iris.data$Species`
##                       diff     lwr     upr p adj
## versicolor-setosa    2.798 2.59422 3.00178     0
## virginica-setosa     4.090 3.88622 4.29378     0
## virginica-versicolor 1.292 1.08822 1.49578     0
anova_sepal_width<- aov(iris.data$Sepal.Width~iris.data$Species, data=iris.data)
summary(anova_sepal_width)
##                    Df Sum Sq Mean Sq F value Pr(>F)    
## iris.data$Species   2  11.35   5.672   49.16 <2e-16 ***
## Residuals         147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_sepal_width, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = iris.data$Sepal.Width ~ iris.data$Species, data = iris.data)
## 
## $`iris.data$Species`
##                        diff         lwr        upr     p adj
## versicolor-setosa    -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa     -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor  0.204  0.04314472  0.3648553 0.0087802
anova_petal_width<- aov(iris.data$Petal.Width~iris.data$Species, data=iris.data)
summary(anova_petal_width)
##                    Df Sum Sq Mean Sq F value Pr(>F)    
## iris.data$Species   2  80.41   40.21     960 <2e-16 ***
## Residuals         147   6.16    0.04                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_petal_width, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = iris.data$Petal.Width ~ iris.data$Species, data = iris.data)
## 
## $`iris.data$Species`
##                      diff       lwr       upr p adj
## versicolor-setosa    1.08 0.9830903 1.1769097     0
## virginica-setosa     1.78 1.6830903 1.8769097     0
## virginica-versicolor 0.70 0.6030903 0.7969097     0

What does the within-group variability look like?

boxplot(iris.data$Sepal.Length~iris.data$Species, main="Sepal Length", xlab="Type of Flower", ylab="Centimeters (cm)")

boxplot(iris.data$Sepal.Width~iris.data$Species, main="Sepal Width", xlab="Type of Flower", ylab="Centimeters (cm)")

boxplot(iris.data$Petal.Length~iris.data$Species, main="Petal Length", xlab="Type of Flower", ylab="Centimeters (cm)")

boxplot(iris.data$Petal.Width~iris.data$Species, main="Petal Width", xlab="Type of Flower", ylab="Centimeters (cm)")

What’s the relation between the lengths and widths for Sepals? What about for Petals?

#ForSepal
cor.test(iris.data$Sepal.Length, iris.data$Sepal.Width, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  iris.data$Sepal.Length and iris.data$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698
plot(iris.data$Sepal.Length, iris.data$Sepal.Width, pch=17, main="Sepals", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=17, col=factor(levels(factor(iris.data$Species))))

#ForPetal **
cor.test(iris.data$Petal.Length, iris.data$Petal.Width, method=c("pearson")) 
## 
##  Pearson's product-moment correlation
## 
## data:  iris.data$Petal.Length and iris.data$Petal.Width
## t = 43.387, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9490525 0.9729853
## sample estimates:
##       cor 
## 0.9628654
plot(iris.data$Petal.Length, iris.data$Petal.Width, pch=16, main="Petals", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=16, col=factor(levels(factor(iris.data$Species))))

What’s the relation between Sepal and Petal lengths? What about their widths?

#For length**
cor.test(iris.data$Sepal.Length, iris.data$Petal.Length, method=c("pearson")) 
## 
##  Pearson's product-moment correlation
## 
## data:  iris.data$Sepal.Length and iris.data$Petal.Length
## t = 21.646, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8270363 0.9055080
## sample estimates:
##       cor 
## 0.8717538
plot(iris.data$Sepal.Length, iris.data$Petal.Length, pch=19, main="Flower Length", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=19, col=factor(levels(factor(iris.data$Species))))

#For width
cor.test(iris.data$Sepal.Width, iris.data$Petal.Width, method=c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  iris.data$Sepal.Width and iris.data$Petal.Width
## t = -4.7865, df = 148, p-value = 4.073e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4972130 -0.2186966
## sample estimates:
##        cor 
## -0.3661259
plot(iris.data$Sepal.Width, iris.data$Petal.Width, pch=18, main="Flower Width", col=factor(levels(factor(iris.data$Species))))
legend("bottomright", legend = levels(factor(iris.data$Species)),pch=18, col=factor(levels(factor(iris.data$Species))))

II. Part 2: Data Exploration

We talk about three different types of data in Economics (cross sectional, time series and panel data).  If you type “data()” in R, you will find some internal datasets preloaded in R (including iris dataset).  Pick any dataset and tell/show us what category of data it belongs to with an appropriate chart/summary statistics.

Data(ToothGrowth)

Descriptive Statistics, Visualizations (Boxplots and Scatterplots)

data("ToothGrowth")
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
dose_lens <- plot(ToothGrowth$dose, ToothGrowth$len, pch=19, main="Amount of Dosage", xlab="Dosage (mg)", ylab="Tooth Length", col=factor(levels(factor(ToothGrowth$dose))))
legend("topleft", legend = levels(factor(ToothGrowth$dose)),pch=19, col=factor(levels(factor(ToothGrowth$dose))))

VC_lens <- plot(ToothGrowth$supp, ToothGrowth$len, pch=18, main="Type of Supplement", xlab="Supplement Type", ylab="Tooth Length", col=factor(levels(factor(ToothGrowth$VC))))

interaction.plot(x.factor= ToothGrowth$dose, 
                 trace.factor= ToothGrowth$supp,
                 response=ToothGrowth$len, fun= mean, legend=TRUE)