Introduction

In this assignment, I look at two different data sets and figure out what type of data they are. I also explain what covariance and variance mean, and show how the slope in a simple linear regression can be calculated using Cov(y, x) / Var(x).

Part I: Types of Data

Data set 1: mtcars

The mtcars dataset shows information about 32 cars, including things like miles per gallon (mpg), weight (wt), and horsepower (hp). Each row represents a different car.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scatter plot

plot(mtcars$wt, mtcars$mpg,
     main = "MPG vs Weight",
     xlab = "Weight",
     ylab = "MPG",
     pch = 19)

Type of Data

I think this is cross-sectional data because it shows many different cars at one point in time. It is not tracking the same cars over time, just comparing them. The scatterplot makes sense here because we are looking at how two variables relate across different cars.

Data set 2: AirPassengers

The AirPassengers data set shows the number of airline passengers each month from 1949 to 1960. So it is tracking the same thing over time.

head(AirPassengers)
##      Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135

Time Plot

plot(AirPassengers,
     main = "Airline Passengers Over Time",
     ylab = "Passengers",
     xlab = "Year")

Type of Data

This is time series data because it records one variable over many time periods. The time plot shows how the number of passengers changes over time, which is the main idea of time series data.

Part II: Slope Parameter Interpretation

What is covariance?

Covariance shows how two variables move together. If both variables increase at the same time, the covariance is positive. If one goes up while the other goes down, it is negative.

What is variance?

Variance measures how spread out a variable is. If the values are very different from the average, the variance is high. If they are close to the average, the variance is low.

Why does Cov(y, x) / Var(x) give the slope?

The slope tells us how much y changes when x increases by one unit. Covariance tells us if x and y move together, and variance tells us how much x changes. When we divide covariance by variance, we adjust for how much x varies, which gives us the slope of the line.

Showing this in R

Here I use mtcars and regress mpg on weight.

model <- lm(mpg ~ wt, data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Slope from regression

coef(model)[2]
##        wt 
## -5.344472

Slope using formula

cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)
## [1] -5.344472

Compare both values

regression_slope <- coef(model)[2]
formula_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

regression_slope
##        wt 
## -5.344472
formula_slope
## [1] -5.344472
all.equal(as.numeric(regression_slope), as.numeric(formula_slope))
## [1] TRUE

The two values are the same, which shows that the formula works.

Conclusion

In this assignment, I looked at two datasets and identified their data types. The mtcars dataset is cross-sectional, and AirPassengers is time series data. I also showed that the slope from a regression can be calculated using covariance and variance, and both methods gave the same result.