In this assignment, I look at two different data sets and figure out what type of data they are. I also explain what covariance and variance mean, and show how the slope in a simple linear regression can be calculated using Cov(y, x) / Var(x).
mtcarsThe mtcars dataset shows information about 32 cars,
including things like miles per gallon (mpg), weight (wt), and
horsepower (hp). Each row represents a different car.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
plot(mtcars$wt, mtcars$mpg,
main = "MPG vs Weight",
xlab = "Weight",
ylab = "MPG",
pch = 19)
I think this is cross-sectional data because it shows many different cars at one point in time. It is not tracking the same cars over time, just comparing them. The scatterplot makes sense here because we are looking at how two variables relate across different cars.
AirPassengersThe AirPassengers data set shows the number of airline
passengers each month from 1949 to 1960. So it is tracking the same
thing over time.
head(AirPassengers)
## Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
plot(AirPassengers,
main = "Airline Passengers Over Time",
ylab = "Passengers",
xlab = "Year")
This is time series data because it records one variable over many time periods. The time plot shows how the number of passengers changes over time, which is the main idea of time series data.
Covariance shows how two variables move together. If both variables increase at the same time, the covariance is positive. If one goes up while the other goes down, it is negative.
Variance measures how spread out a variable is. If the values are very different from the average, the variance is high. If they are close to the average, the variance is low.
The slope tells us how much y changes when x increases by one unit. Covariance tells us if x and y move together, and variance tells us how much x changes. When we divide covariance by variance, we adjust for how much x varies, which gives us the slope of the line.
Here I use mtcars and regress mpg on weight.
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
coef(model)[2]
## wt
## -5.344472
cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)
## [1] -5.344472
regression_slope <- coef(model)[2]
formula_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)
regression_slope
## wt
## -5.344472
formula_slope
## [1] -5.344472
all.equal(as.numeric(regression_slope), as.numeric(formula_slope))
## [1] TRUE
The two values are the same, which shows that the formula works.
In this assignment, I looked at two datasets and identified their
data types. The mtcars dataset is cross-sectional, and
AirPassengers is time series data. I also showed that the
slope from a regression can be calculated using covariance and variance,
and both methods gave the same result.