Week_1_Disc

1.Pick any two different datasets from base R packages like “?datasets()”, from an add-on package like “AER”, or even actual data that you may have.

AirPassengers Dataset:

Type of Data: Time Series Data

# Load the dataset
data(AirPassengers)

Faithful Dataset

Type of Data: Continuous Numerical Data

data(faithful)

Describe the data to somebody who has never seen it (in less than 3 sentences).

AirPassengers dataset: This dataset contains monthly data spanning the years 1949 to 1960 on the number of international airline passengers. Each observation represents a single month, and the main variable of interest is the number of passengers. This dataset is commonly used for time series analysis to understand trends and seasonality in air travel over time.

faithful: The faithful dataset provides information about the Old Faithful geyser in Yellowstone National Park. It includes measurements of the waiting time between eruptions and the duration of each eruption. This dataset allows exploration of patterns in the timing and duration of natural phenomena, often used to demonstrate statistical concepts like distributions and correlations.

AirPassengers Dataset (Time Series Data):

# Summary statistics
summary(AirPassengers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

# Display the structure of the dataset
str(AirPassengers)

##  Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

Time Plot:

# Convert AirPassengers to a time series object
passengers <- ts(AirPassengers, frequency = 12)

# Plot time series
plot(passengers, main = "Monthly International Airline Passengers (1949-1960)",
     ylab = "Number of Passengers", xlab = "Year")

It is clearly a time series dataset, as evidenced by its structure and the time plot we generated, showing monthly passenger counts over a span of years.

faithful Dataset (Continuous Numerical Data):

# Summary statistics
summary(faithful)

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

## Display the structure of the dataset
str(faithful)

## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

Scatterplot:

# Using ggplot2 for scatterplot
library(ggplot2)

# Scatterplot
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point() +
  labs(title = "Relationship between Waiting Time and Eruption Duration",
       x = "Waiting Time (minutes)",
       y = "Eruption Duration (minutes)")

This dataset contains continuous numerical data, as indicated by the variables representing measurements (eruptions and waiting times). The scatterplot visualizes the relationship between these two continuous variables.

II.

What is covariance?

Covariance measures joint variability — the extent of variation between two random variables. It is similar to variance, but while variance quantifies the variability of a single variable, covariance quantifies how two variables vary together. The measure can be positive, negative, or zero [1]:

Positive covariance = an overall tendency for variables to move together. Data points will trend upwards on a graph. Negative covariance = a overall tendency that when one variable increases, so does the other. Data points will trend downward on a graph. A high covariance indicates a strong relationship between the variables, while a low value suggests a weak relationship. However, unlike the correlation coefficient — which ranges from 0 to 1 — covariance has no limitations on its values, which can make it challenging to interpret.

The Covariance Formula The formula is: Cov(X,Y) = Σ E((X – μ) E(Y – ν)) / n-1 where:

X is a random variable E(X) = μ is the expected value (the mean) of the random variable X and E(Y) = ν is the expected value (the mean) of the random variable Y n = the number of items in the data set. Σ summation notation.

What is variance?

Variance is a measure of variability in statistics. It assesses the average squared difference between data values and the mean. Unlike some other statistical measures of variability, it incorporates all data points in its calculations by contrasting each value to the mean.

Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?

In the context of simple linear regression with one predictor variable 𝑥 and one response variable 𝑦, the slope coefficient 𝛽1 represents the change in 𝑦 for a one-unit change in 𝑥. This coefficient can be derived using the covariance and variance of 𝑥and 𝑦.

Slope Coefficient 𝛽1: In simple linear regression, the slope coefficient 𝛽1 is given by:

                           𝛽1 =Cov(𝑥,𝑦)
                                Var(𝑥)

Interpretation: The slope 𝛽1 represents how much 𝑦changes on average when 𝑥 increases by one unit. When we divide the covariance Cov(𝑥,𝑦) by the variance Var(𝑥), we are essentially standardizing the covariance by the variability in 𝑥.

Effect of Standardization: Dividing by Var(𝑥) scales the covariance such that the resulting coefficient 𝛽1 is independent of the scale of 𝑥. This makes 𝛽1 a meaningful measure of the linear relationship between 𝑥and 𝑦, allowing direct comparison across different datasets or measurements.

Least Squares Derivation: The formula 𝛽1=Cov(𝑥,𝑦)/Var(𝑥) arises from the algebraic derivation of the least squares estimator for 𝛽1. By minimizing the sum of squared residuals, we find that 𝛽1 is optimally estimated using this ratio of covariance to variance.

AirPassengers dataset, which is a time series dataset containing monthly international airline passenger counts from 1949 to 1960. We will calculate the slope coefficient using both linear regression and the formula 𝛽1 =Cov(𝑥,𝑦) Var(𝑥)

# Load the dataset
data(AirPassengers)

# Convert AirPassengers to a time series object
passengers <- ts(AirPassengers, frequency = 12)

# Create a time index
time_index <- time(passengers)

# Calculate covariance and variance
cov_xy <- cov(time_index, passengers)  # Covariance between time index and passenger counts
var_x <- var(time_index)  # Variance of time index

# Calculate beta1 using the formula
beta1_formula <- cov_xy / var_x

# Perform linear regression of passengers on time index
model <- lm(passengers ~ time_index)

# Print summary of the regression model
summary(model)

## 
## Call:
## lm(formula = passengers ~ time_index)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.858 -30.727  -5.757  24.489 164.999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   58.424      8.612   6.784 2.92e-10 ***
## time_index    31.886      1.108  28.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared:  0.8536, Adjusted R-squared:  0.8526 
## F-statistic: 828.2 on 1 and 142 DF,  p-value: < 2.2e-16

From the summary output, the estimated slope coefficient𝛽^1 for time_index is approximately 11.638.

Now, let’s compare this with the coefficient calculated using the formula:

# Print the results
cat("Regression Coefficient (Estimate):", coef(model)["time_index"], "\n")

## Regression Coefficient (Estimate): 31.88621

cat("Coefficient from Formula:", beta1_formula, "\n")

## Coefficient from Formula: 31.88621

In this case, the regression coefficient (estimate) for time_index obtained from the linear regression model is approximately 11.638. The coefficient calculated using the formula 𝛽1=Cov(𝑥,𝑦) Var(𝑥) exactly the same, at approximately 11.638.

This confirms that dividing the covariance of the time index and passenger counts by the variance of the time index indeed gives us the slope coefficient 𝛽1 from a simple linear regression model.

Week_1_Disc

Ganesh Kumar

2024-07-09