Data Types & Slope Parameter

Week 1 Discussion

I Types of Data

1a.

data(airquality)
head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

2a.

The “airquality” dataset built into base R describes measurements of New York’s air quality using time-series data from May-September of 1973. It contains six variables: mean ozone (parts per billion), solar radiation, wind speed(mph), maximum daily temperature (F), month, and day of the month. Ozone concentration is the primary variable; higher ozone indicates worse air quality.

1b.

data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2b.

The “mtcars” dataset built into base R describes motor trend car road tests using cross-sectional data from the 1974 Motor Trend US magazine. It contains 32 observations with 11 numeric variables for different characteristics such as weight (1000 lbs), number of cylinders, gross horsepower, rear axle ratio, etc. The main variable of interest is miles per gallon (mpg), as it represents fuel efficiency.

3a.

“airquality” is a time series dataset, as it contains daily observations over a specific period of time (May-September 1973).

data(airquality)
summary(airquality$Ozone)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   18.00   31.50   42.13   63.25  168.00      37

data(airquality)
time <- 1:nrow(airquality)
plot(time, airquality$Ozone,
     type = "l",
     xlab = "Time (days)",
     ylab = "Ozone(ppb)",
     main = "1973 New York Ozone Levels Over Time")

The quality of air in New York is a single unit being observed repeatedly over time; it focuses on a single unit across time rather than comparing multiple units. If the dataset included air quality measurements from multiple cities at multiple time periods, it would instead be pooled-cross sectional.

ggplot2 attempt

library(ggplot2)
data(airquality)
ggplot(airquality, aes(x=time, y=Ozone)) +
  geom_line() +
  labs(title= "1973 New York Ozone Levels Over Time")

Ozone concentration appears to peak near the end of the plot, most likely August. This makes sense that it would rise during the hotter summer months.

3b.

“mtcars” contains cross-sectional data because it measures individual cars at a single point in time.

data(mtcars)
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

data(mtcars)
plot(mtcars$wt, mtcars$mpg,
     xlab = "Weight(1000 lbs)",
     ylab = "Miles per Gallon",
     main = "Cars MPG vs Weight")

Each point on the scatterplot represents an individual observation, so the fuel efficiency of each car can be compared. As weight increases, miles per gallon tends to decrease, aligning with the intuition that heavier cars are less fuel efficient.

ggplot2 attempt

library(ggplot2)
data(mtcars)
ggplot(mtcars, aes(x= wt, y = mpg)) +
  geom_point() + 
  labs(title = "Cars MPG vs Weight",
       x= "Weight(1000 lbs)",
       y= "Miles per Gallon")

The scatterplot shows the relationship between a cars weight and mpg. Since the observations don’t span across time or involve different time periods, the data is cross-sectional.

II Slope Parameter

1a. Covariance measures how two variables move together, looking at how changing one variable changes another variable. This relationship could be positive or negative. In the example of mtcars, covariance between mpg and weight would examine how mpg changes when weight changes- a negative covariance.

\[ \text{cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n}\]

1b. Variance measures how much the variable deviates from its mean; graphically it is the dispersion of the height from the best fit line. If there is high variance, values in a dataset are more spread out.

\[ \text{Var}(X) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n} \]

2. Dividing the covariance of y and x by the variance of x gives you the slope coefficient from a simple linear regression. Slope represents the change in y for every 1 unit change in x. Covariance measures how two variables move together, essentially capturing an unscaled version of the slope. However, covariance is influenced by the scale and variability of x. Dividing by the variance acts like a normalizing factor to measure how much y changes per unit change in x.

data(mtcars)
model <- lm(mpg ~ wt, data=mtcars)
coef(model)[2]

       wt 
-5.344472

cov(mtcars$wt, mtcars$mpg) / var(mtcars$wt)

[1] -5.344472

The slope coefficient obtained using the lm() function is the same as the value computed manually as the ratio of covariance to variance.