Part 1. Exercise 1.9 of OpenIntro Statistics - Discuss the solutions to this problem, and then conduct a descriptive analysis of the data
A Statistician, Evolutionary Biologist, and Geneticist named Sir Ronald Aylmer Fisher worked on a dataset describing sepal length and width, and petal length and width from three species of iris. The exercise asks 3 questions:
Q1. How many cases were included in the data?
A1. To answer this, we need a function/method to count the number of cases in the iris dataset, which is already loaded in R after download/install of R. I quickly googled https://www.digitalocean.com/community/tutorials/get-number-of-rows-and-columns-in-r and found the function nrow(). Using nrow() shows there are 150 cases:
nrow(iris)
## [1] 150
Q2. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.
A2. Looking further into the iris dataset, simply by typing ‘iris’ shows all of the data, with columns labeled at the top. These columns show that the data contains:
Sepal.Length - a Continuous Numerical Value
Sepal.Width - a Continuous Numerical Value
Petal.Length - a Continuous Numerical Value
Petal.Width - a Continuous Numerical Value
Species - a Nominal Categorical Value
In Summary, there are 4 Continuous Numerical Values and 1 unordered Categorical Value.
Q3. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).
A3. To find the different categories of Species, we need a function to find unique values in the Species column. A quick search led me to https://www.digitalocean.com/community/tutorials/unique-function-r-programming that identifies the unique() function, which seems obvious to me now. Running this function over the iris$Species column produces 3 distinct values shown here:
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
In Summary, the categories are setosa, versicolor, and viginica
To understand the data a bit more, I used describe(), which I had previously learned about in one of the course resources, I can see the different statistics across the dataset.
library(psych)
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
Part 2. Pick a dataset, by exploring preloaded data in R with the data() function, and explain the categories of data that it belongs along with an appropriate chart/summary statistics.
As a fan of automobiles/cars, I’ve picked the mtcars data set to review and analyze. Here’s an analysis of the data:
Number of Cases = 32, shown by using the nrow() function, as we did with the iris dataset.
nrow(mtcars)
## [1] 32
Categories:
mpg - Miles Per Gallon / Continuous Numerical
cyl - Number of Cylinders / Discrete Numerical
disp - Engine Displacement (volume in square inches) / Continuous Numerical
hp - Horsepower / Continuous Numerical
drat - Rear axle ratio / Continuous Numerical
wt - Weights, in thousands of pounds / Continuous Numerical
qsec - Quarter mile time in seconds / Continuous Numerical
vs - Engine Shape (0 = v-shaped, 1 = straight) Nominal Categorical
am - Transmission (0 = automatic, 1 = manual) Nominal Categorical
gear - Number of Gears (forward, doesn’t count reverse) / Discrete Numerical
carb - Number of Carburetors / Discrete Numerical
The dataset is cross-sectional or an observation of a population at a moment of time. I assume this data was captured from cars of a particular year or range of years. Having the year of the car would provide an interesting aspect of the data, although it doesn’t really matter, since the car is just a label when comparing columns. I’m interested to see how the horsepower relates to the quarter mile time. Using a scatter plot, you can see that as horsepower increases, speed in the quarter mile descreases, see the plot here:
plot(mtcars$hp, mtcars$mpg, xlab = "Horsepower", ylab = "Quarter Mile Time (in seconds)", main = "Scatter plot of Horsepower vs. Quarter Mile Time")
That makes some sense and aligns to my knowledge that more horsepower usually results in more speed. Further investigation, and understanding that heavier cars may be relatively slower, showing a scatterplot of Horsepower/Weight compared to Quarter Mile times seems to suggest that the more horsepower per weight, the faster the car will be, in the quarter mile in these cases. See that plot here:
plot(mtcars$hp/mtcars$wt, mtcars$qsec, xlab = "Horsepower / Weight", ylab = "Quarter Mile Time (in seconds)", main = "Scatter plot of Horsepower/Weight vs. Quarter Mile Time")