week1-Discussion

Part 1:

Pick any two different datasets from base R packages like “?datasets()”, from an add-on package like “AER”, or even actual data that you may have.

#Dataset 1: iris
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

#Dataset 1: AirPassengers
data(AirPassengers)
head(AirPassengers)

## [1] 112 118 132 129 121 135

Describe the data to somebody who has never seen it (in less than 3 sentences). This may include elaborating upon the key variables so that the reader can follow along / guess what information the data contains.

Iris Dataset: The Iris dataset is a collection of measurements from three species of iris flowers: setosa, versicolor, and virginica. For each flower, there are four numeric variables: sepal length, sepal width, petal length, and petal width, all measured in centimeters. These measurements provide insights into the physical characteristics of the flowers, allowing for analysis of differences and similarities between the iris species based on these features.

AirPassengers Dataset: The AirPassengers dataset records the monthly total number of international airline passengers from 1949 to 1960. It is structured as a time series, where each observation represents the passenger count for a specific month over this 12-year period It enables the examination of passenger trends and seasonal patterns in air travel, making it useful for studying historical international travel patterns and forecasting future passenger demand.

Finally, tell/show us what is the type of data and why.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

#display dataset
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(AirPassengers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

#display dataset
str(AirPassengers)

##  Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

Iris dataset:

Type : Cross sectional data.

This type of data refers to be collected at a single point in time or over a short period, typically from different subjects (or in this case, different flowers) at the same point in time. Each observation in the Iris dataset represents a single flower, and the measurements (sepal length, sepal width, petal length, petal width, and species) were taken at a specific instance in time.

Why:

The Iris dataset meets the criteria for cross-sectional data because it captures observations from multiple iris flowers at a single moment in time, allowing for comparative analysis of their physical characteristics and species classification at that specific instance. This type of data is commonly used in statistical analysis to understand and classify characteristics across different subjects simultaneously.

AirPassengers dataset:

Type : Time series data

This dataset structured as a time series because it consists of observations (monthly counts of airline passengers) collected at regular intervals (monthly from January 1949 to December 1960). Time series data are characterized by their temporal ordering and typically exhibit trends, seasonality, and possibly cyclical patterns.

Why:

In this dataset each observation in the dataset represents the number of international airline passengers for a specific month, making it inherently temporal. Time series analysis techniques, such as decomposition, forecasting, and identifying trends or seasonal effects, are applicable and useful for understanding patterns in airline passenger traffic over time.

Scatterplot for Cross sectional data.

#iris dataset
data(iris)

# Scatterplot of Petal.Length vs. Petal.Width
plot(iris$Petal.Width, iris$Petal.Length,
     main = "Scatterplot of Petal Length vs. Petal Width",
     xlab = "Petal Width (cm)",
     ylab = "Petal Length (cm)",
     col = "blue")

Time Plots in time series data.

# Load AirPassengers dataset
data(AirPassengers)

# Convert time series to ts object (if necessary)
passengers <- ts(AirPassengers, frequency = 12, start = c(1949, 1))

# Plotting time series
plot(passengers, 
     main = "International Airline Passengers Over Time",
     ylab = "Number of Passengers",
     xlab = "Year",
     col = "blue")

Using ggplot2 package for creating your charts.

#iris dataset 
data(iris)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

# Create a scatterplot of Petal.Length vs. Petal.Width using ggplot
ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(color = "blue") +
  labs(title = "Scatterplot of Petal Length vs. Petal Width",
       x = "Petal Width (cm)",
       y = "Petal Length (cm)")

Part 2:

what is covariance?

Covariance: Covariance is a statistical measure that indicates the extent to which two random variables change in tandem. In simpler terms, it measures how much two variables vary together. If the covariance between two variables is positive, it means that when one variable increases, the other tends to increase as well. Conversely, a negative covariance indicates that when one variable increases, the other tends to decrease. If the covariance is zero, it suggests that there is no linear relationship between the variables.

What is variance

Variance: Variance, on the other hand, is a measure of how much a set of observations differ from the mean (average) value. It quantifies the dispersion of data points around the mean. A high variance indicates that the data points are spread out widely from the mean, while a low variance indicates that the data points are clustered closely around the mean.

Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?

In a simple linear regression with one independent variable X and one dependent variable 𝑌, the slope coefficient 𝛽1 represents the change in Y per unit change in X. This coefficient can be derived from the covariance of X and Y divided by the variance of X.

By dividing the covariance of Y and X by the variance of X gives you the slope coefficient .

β 1:

Covariance and Variance Relationship:

The covariance Cov(X,Y) measures the linear association between X and Y. The variance Var(X) measures the spread or variability of X.

Slope Coefficient in Simple Linear Regression:

In simple linear regression, the slope coefficient β 1 is defined as:

β 1= Cov(X,Y)/Var(X)

This formula indicates that β 1 represents how much Y changes on average for each one-unit increase in X.

# Iris dataset
data(iris)

# Extracting sepal length (X) and petal length (Y) for simplicity
X <- iris$Sepal.Length
Y <- iris$Petal.Length

# Calculate covariance of X and Y
cov_xy <- cov(X, Y)

# Calculate variance of X
var_x <- var(X)

# Calculate the slope coefficient beta1
beta1 <- cov_xy / var_x

# Print the slope coefficient
print(beta1)

## [1] 1.858433

β 1(beta1) value gives us the slope of the regression line that best fits the relationship between X and Y in the dataset.

Please show this in R. You just need to match the two outputs (one from regression, one from the formula) like we did in class, but for a dataset of your choice.

# Load Iris dataset
data(iris)

# Extracting sepal length (X) and petal length (Y) for simplicity
X <- iris$Sepal.Length
Y <- iris$Petal.Length

# Perform simple linear regression using lm()
model <- lm(Y ~ X)

# Extract the slope coefficient from the regression model
beta1_regression <- coef(model)[2]

# Calculate covariance of X and Y
cov_xy <- cov(X, Y)

# Calculate variance of X
var_x <- var(X)

# Calculate the slope coefficient beta1 using covariance and variance
beta1_formula <- cov_xy / var_x

# Print the slope coefficient from regression and formula
cat("Slope coefficient from regression:", beta1_regression, "\n")

## Slope coefficient from regression: 1.858433

cat("Slope coefficient from formula:", beta1_formula, "\n")

## Slope coefficient from formula: 1.858433

The output demonstrates that the slope coefficient β 1 obtained from the linear regression (1.857509) matches exactly with the slope coefficient calculated using the covariance and variance approach (1.857509). This confirms the consistency and accuracy of both methods in determining the relationship between sepal length and petal length in the Iris dataset.

week1-Discussion

Reuben

2024-07-12