Section 1

Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor, and virginica). There were 50 flowers from each species in the data set.

a) How many cases were included in the data?

150 cases can be made out (50 flowers * 3 species)

nrow(iris)

## [1] 150

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

b) How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.

All the numerical variables have continuous data as displayed above.

unique(iris[c("Species")])

##        Species
## 1       setosa
## 51  versicolor
## 101  virginica

c) How many categorical variables are included inthe data, and what are they? List the corresponding levels (categories).

There is only one categorical value, with three corresponding levels: setosa, versicolor, and virginica

Analysis

Virginica has largest petal length while setosa has the smallest petal length.
Virginica has largest sepal length while setosa seems to genrally have the smallest sepal, but the thickest in width.
Vericolor mostly appears to have length and width ranging between that of virginica and setosa.

require(psych)

## Loading required package: psych

Reference: https://www.statology.org/iris-dataset-r/

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

describe(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

plot(iris$Petal.Width, iris$Petal.Length,
     col=iris$Species,
     main='Species vs. Petal length and width',
     xlab='Petal Width in cm',
     ylab='Petal Length in cm',
     pch=19)

legend("topright", legend = unique(iris$Species), 
       col = unique(iris$Species), pch = 19)

plot(iris$Sepal.Width, iris$Sepal.Length,
     col=iris$Species,
     main='Species vs. Sepal Length and width',
     xlab='Sepal Width in cm',
     ylab='Sepal Length in cm',
     pch=19)

#Adding a legend
legend("topright", legend = unique(iris$Species), 
       col = unique(iris$Species), pch = 19)

Reference: https://datascienceplus.com/box-plots-identify-outliers/

#Boxplot for comparing petal length in species
boxplot(Petal.Length ~ Species, data=iris,
     main="Petal Length for each Species",
     col = iris$Species,
     xlab="Species",
     ylab="Petal Length in cm")

Section 2

About the dataset:

Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.

This is a cross-sectional dataset.

describe(LifeCycleSavings)

##       vars  n    mean     sd median trimmed    mad   min     max   range  skew
## sr       1 50    9.67   4.48  10.51    9.68   4.07  0.60   21.10   20.50 -0.01
## pop15    2 50   35.09   9.15  32.58   35.15  13.24 21.44   47.64   26.20  0.00
## pop75    3 50    2.29   1.29   2.17    2.22   1.61  0.56    4.70    4.14  0.31
## dpi      4 50 1106.76 990.87 695.66  980.85 713.94 88.94 4001.89 3912.95  0.95
## ddpi     5 50    3.76   2.87   3.00    3.33   1.75  0.22   16.71   16.49  2.14
##       kurtosis     se
## sr       -0.32   0.63
## pop15    -1.68   1.29
## pop75    -1.33   0.18
## dpi      -0.09 140.13
## ddpi      6.40   0.41

#Plotting a scatterplot
plot(LifeCycleSavings$sr, LifeCycleSavings$dpi,
     main = "Per-capita disposable income vs. Personal Savings for population>75",
     xlab = "Savings",
     ylab = "Disposable income",
     pch = 19,
     col = LifeCycleSavings$pop75
     )

Discussion_Week_2

Aritra

2023-09-11