Part 1 Iris

1a. How many cases were included in the data?

library(psych)
dim(iris)
## [1] 150   5

There is 150 rows and 5 columns in this matrix. As we learned in earlier readings, each row represents a unique observational unit. Therefore, there was 150 cases included in this data.

2b. How many numerical variables are included in the data, and what are they?

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

We can see from the table above, there are 4 numerical variables: sepal.length, sepal.width, petal.length, and petal.width. These are all continuous as they are a measurable value.

2c. How many categorical variables are included in the data, and what are they?

unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

We knew from the previous question that Species was the only categorical variable. The code snippet above shows the Species variable is composed of 3 values: Setosa, Versicolor, and Virginica.

Part 2 Penguins

I chose to go with the Penguins dataset as those are my second favorite animal (behind Pandas).

summary(penguins)
##       species          island       bill_len        bill_dep    
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##   flipper_len      body_mass        sex           year     
##  Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0   Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9   Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0   Max.   :6300                Max.   :2009  
##  NA's   :2       NA's   :2

The summary above shows we are working with cross-sectional data as this is a study of many subjects (penguins). The dataset has 8 variables: species, island, bill_length, bill_dep, flipper_len, body_mass, sex, year.

Now lets see if there’s any kind of relationship between the length of a penguins bill and how much that penguin weighs (in grams)

plot(
     x=penguins$bill_len,
     y=penguins$body_mass,
     xlab = 'Body Mass of Penguin (grams)',
     ylab = 'Bill Length of Penguin (millimeters)'
)

The scatterplot above would suggest that that longer the penguins bill is, the more that penguin weighs