I. OpenStats Chapter 1, Exercises, Problem 1.9: Discuss the solutions to this problem, and then conduct a descriptive analysis of the data

Get package and Iris data.

library(datasets)
data("iris")

(a) How many cases were included in the data?

dim(iris)
## [1] 150   5

There are 150 cases included in the data.

(b) How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
sapply(iris, class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

There are 4 numerical variables including “Sepal.Length”, “Sepal.Width”, “Petal.Length”, and “Petal.Width” as sapply() function provides the detail and indicates them. These numerical variables are all continuous.

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"

(c) How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

The only categorical variable is “Species”. The corresponding levels include “setosa”, “versicolor”, and “virginica”.

Descriptive Analysis

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
iris %>% 
  describe()
##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

Using the describe() function, one of the analyses we found is how Both Sepal.Length and Sepal.Width are postively skewed, but both Petal.Length and Petal.Width are negatively skewed.

Graph

Referencing page 48 in IPSUR: creating a similar graph as Figure 3.10 but just with different colors…

plot(iris$Petal.Width ~ iris$Petal.Length,
     xlab = "Petal Length", 
     ylab = "Petal Width", 
     main = "Petal Length vs Petal Width",
     col = iris$Species)

II. Pick any dataset and tell/show us what category of data it belongs to with an appropriate chart/summary statistics.

Show us the timeplots to an external site. if time series data,

One example of a time series data is the AirPassenger dataset.

There are definitely more airline passengers as time progresses. However, it is also interesting to note how this graph has very consistent peaks and lows, as I am assuming seasonalities might have an impact on passengers. Overall, the airline passengers increase over time, going upwards.

data("AirPassengers")

Referencing https://rpubs.com/vivekkashyap043/airpassengers.:

plot(AirPassengers, 
     main = "Airline Passengers Over Time",
     xlab = "Year-Month", 
     ylab = "Number of Passengers")

Referencing help(“AirPassengers”):

(fit <- StructTS(AirPassengers, type = "BSM"))
## 
## Call:
## StructTS(x = AirPassengers, type = "BSM")
## 
## Variances:
##   level    slope     seas  epsilon  
##    0.00   160.98    29.85     0.00
plot(cbind(AirPassengers,fitted(fit)), plot.type = "single")