STA521 Lab 03

Justin Kao

Course Tools

RStudio Online: Your Data Science Playground


  • Browser-based: Access RStudio right from your browser—no setup headaches.
  • Duke VPN or Campus Connection: Seamlessly connect from anywhere using Duke VPN (Cisco) or while on campus.
  • Cloud-Powered: Harness the power of online resources—save your computer’s CPU and GPU for other tasks.
  • Hassle-Free Setup: No need to run install.packages—everything is pre-installed and ready to go!

Source: Dr. Colin Rundel STA 523

Exam Review

Exam 1, Question 2

You want to predict how long someone lives from the number of hours of exercise they do in a week. From a sample of 50 people, you estimate the intercept as 90 and the slope as \(-2\). The coefficient of determination is 0.6, the sd of the residuals is 9, and the mean number of hours spent exercising is 8 with an sd of 4.


Method 1 The slope is negative, so it is the negative of the square root of the coefficient of determination. Hence, \(\hat{y} = 90 -2x\) then \(r = -\sqrt{0.6} = -0.77\)

Exam 1, Question 2

Method 2 This part attempts to calculate (r) using the formula for the correlation coefficient from regression parameters.

  1. Estimation of RSS (Residual Sum of Squares):This calculates the RSS using the formula for the estimate of variance of the residuals (\(\hat{\sigma}^2 = 9^2\)), multiplying by the degrees of freedom (\(n-2\)). This is standard for a linear regression model with 2 parameters (intercept and slope).

\[ \sum\left(y_i-\hat{y}_i\right)^2 = 9^2 \cdot (n-2) = 9^2 \cdot 48 \]

  1. Reformulation of (R^2):

    \[ r^2 = 0.6 = 1- \frac{\text{RSS}}{\text{TSS}}= 1 - \frac{\sum\left(y_i-\hat{y}_i\right)^2}{\sum\left(y_i-\bar{y}\right)^2} = 1 - \frac{9^2 \times 48}{\sum\left(y_i-\bar{y}\right)^2} \] Hence,

    \[ \sum\left(y_i-\bar{y}\right)^2 = \frac{9^2 \times 48}{0.4} \] Here, we are calculating the TSS (\(SS_y\)) based on the formula for (R^2) rearranged to extract (SS_y). The denominator 0.4 comes from rearranging (R^2 = 0.6) into (1 - R^2).

  2. Estimation of (SS_x) (Total Sum of Squares of (x)):

    \[ \sqrt{\frac{SS_x}{n-1}} = 4 \Rightarrow SS_{xx} = 4^2 \cdot 49 \] Here, you’re estimating the total sum of squares for (x) based on the sample variance formula, where 4 is the standard deviation of (x).

  3. Calculation of (r):

    \[ r = \frac{SS_{xy}}{\sqrt{SS_x SS_y}} = -2 \sqrt{\frac{SS_x}{SS_y}} = -2 \sqrt{\frac{4^2 \cdot 49}{9^2 \cdot 48 / 0.4}} \approx -0.568 \]

Slides Review

Curse of Dimensionality Idea

In the context of data analysis and machine learning, dimensions refer to the features or attributes of a dataset. For instance, if we consider the Ozone dataset, which will be used in today’s lab, the dimensions include variables such as wind speed, temperature, humidity, pressure height, and ozone levels. Each of these features adds a new dimension to the dataset, increasing the complexity of the analysis as more variables are included, much like adding more characteristics in other datasets.

Curse of Dimensionality: Regression Insights

  • Formula Progression

1 Variable: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \epsilon\)

2 Variables: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \beta_2 (\text{Humidity}) + \epsilon\)

3 Variables: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \beta_2 (\text{Humidity}) + \beta_3 (\text{Temperature}) + \epsilon\)

  • Explanation: With each added variable, we introduce a new dimension. To capture relationships in this higher-dimensional space, we need more data points (rows) to adequately model these features.

Curse of Dimensionality 1D Space

  • Imagine you start with a line (1D). You only need a few points to cover the length of the line, since it’s just a straight path.

Curse of Dimensionality 2D Space

  • Now, think about a square (2D). Instead of just covering a line, you now need to fill up an entire area. You need more points to cover both the width and height of the square.

Curse of Dimensionality 3D Space

  • Next, picture a cube (3D). To fill the volume of the cube, you need even more points because you’re now working in three directions: length, width, and height.

Problems with High Dimensions

  • Data sparsity: Imagine trying to fill a large room with a handful of marbles—they’d be scattered far apart. In high-dimensional data, the same thing happens: data points are far apart, making it harder for models to find meaningful patterns.
  • Increased computation: More dimensions mean more computational resources and time to process the data.
  • Overfitting-learning the wrong thing: When there’s too much data complexity, models start “memorizing” noise or irrelevant details instead of learning the important patterns. This makes the model less effective at predicting new data correctly.
  • Distances lose meaning: In a simple space, the distance between points matters. But in many dimensions, all points start feeling equally far apart, making it hard to distinguish which data points are truly similar.
  • Algorithms don’t perform as well: Algorithms that rely on distances (like k-nearest neighbors) become less reliable in high-dimensional spaces. They struggle because the concept of “closeness” gets diluted.

Takeway

As you go up to even more dimensions (4D, 5D, etc.), the space grows exponentially larger, and you need exponentially more points to fill it. But no matter how many points you add, they get spread out thinner and thinner, making it harder to find patterns or meaningful relationships in the data.


So, in higher dimensions, it becomes extremely difficult to “fill” the space with enough data, and the data becomes very sparse. This is why analyzing data in many dimensions gets much harder.

🚀 Let’s Dive In and Get Started! 🌟

Ozone Data: The Classic Air Quality Challenge! 😀🤟

Locate the Ozone data, which records Los Angeles ozone pollution levels in 1976. Follow the instructions below and submit a PDF of your results to the Canvas Canvas. Be sure to include relevant figures.

Note

This is a very open-ended project, meaning every response you provide is valuable. The optimal approach is to extract as much information as possible. This ability will give you insight into ad-hoc problem-solving skills, which are essential when tackling real-world problems as a Data Scientist, Engineer, or Analyst.

Tasks 🤙🎉

  • The objective is to predict the daily maximum one-hour average ozone reading (V4 or Ozone).
  • Perform Exploratory Data Analysis (EDA): Create and visualize various plots for the data variables to gain insights into the distribution, relationships, and potential issues.
  • Data Wrangling: Review the dataset and decide which variables to remove. For instance, consider removing the variables Day_of_Month and Day_of_Week. Additionally, convert the Month variable into a numerical format if necessary for analysis.
  • Model Comparison: Create and evaluate four models:
    1. Full model: Fit a linear model to the data.
    2. GLM with Gaussian family: Fit a Generalized Linear Model (GLM) using the Gaussian family.
    3. GLM with Gamma family: Fit a Generalized Linear Model (GLM) using the Gamma family.
    4. Final model: Select the final model based on the results, and provide justifications for why this model was chosen over the others.

Load Data

  • You can load the data using either of the following methods:

    library(mlbench)
    data(Ozone)
    View(Ozone)

    or

    data("Ozone", package = "mlbench")
    View(Ozone)
  • mlbench stands for “Machine Learning Benchmark Problems”

  • When you load the data, you’ll notice the variables are named V1, V2, …, V13, which is not very descriptive. Therefore, it’s a good idea to rename the columns for clarity.

Data Wrangling

  • Let’s clean the dataset by removing rows with missing values.

Important

While discarding data with missing values isn’t always the best approach, we’re opting for simplicity here. In practice, techniques like imputation could be used to fill in the gaps, preserving more of the data’s richness—though it’s a bit more complex.

Ozone <- na.omit(Ozone)
  • Rename columns to descriptive names
names(Ozone) <- c("Month", "Day_of_Month", "Day_of_Week", "Ozone", 
                        "Pressure", "Wind", "Humidity", "Temp1", "Temp2", 
                        "Inversion_Height", "Pressure_Gradient", 
                        "Inversion_Temp", "Visibility")
  • Use str() to check the data structure, ensuring numerical variables are correctly formatted and categorical variables are properly defined.
str(Ozone)

Data Variables

  1. Month: 1 = January, …, 12 = December
  2. Day of Month
  3. Day of Week: 1 = Monday, …, 7 = Sunday
  4. Daily Maximum One-hour Average Ozone Reading
  5. 500 Millibar Pressure Height (m) measured at Vandenberg AFB
  6. Wind Speed (mph) at Los Angeles International Airport (LAX)
  7. Humidity (%) at LAX
  8. Temperature (°F) measured at Sandburg, CA
  9. Temperature (°F) measured at El Monte, CA
  10. Inversion Base Height (feet) at LAX
  11. Pressure Gradient (mm Hg) from LAX to Daggett, CA
  12. Inversion Base Temperature (°F) at LAX
  13. Visibility (miles) measured at LAX

Source: R Documentation Ozone {mlbench}

Why Not a Normal Distribution 1.0?

The histogram clearly demonstrates that the assumption of normality no longer holds. The distribution of the data deviates significantly from what would be expected under a normal distribution, showing potential skewness or heavy tails. This visual evidence suggests that applying methods reliant on normality assumptions may lead to inaccurate results.

The histogram appears more consistent with a gamma distribution.

Why Not a Normal Distribution 2.0?

If we fit to the full linear model

lm_model <- lm(Ozone ~., data = Ozone)

Why Not a Normal Distribution 2.1?

  • Residual Spread: The residuals should be evenly spread around the horizontal line (residuals = 0) without any distinct pattern. In the plot, the residuals seem to form a U-shaped pattern, meaning that the residuals are not random, but rather show a pattern. This pattern suggests that our model might be underfitting, as there might be a non-linear relationship in the data that the model doesn’t account for.
  • Heteroscedasticity: If the spread of residuals increases or decreases with fitted values, it would indicate heteroscedasticity (i.e., non-constant variance). In this plot, the spread seems fairly consistent across fitted values, meaning there is no strong indication of heteroscedasticity, though some slight increase in variance can be seen for higher fitted values.
  • Outliers: The numbers next to some points represent influential or outlying observations. These points are distant from the bulk of residuals and could have a disproportionate effect on the model fit. You should investigate these points further to ensure they are not having too much influence on your model.

Generalied Linear Regression(GLM)

Aspect Linear Model (LM) Generalized Linear Model (GLM)
Normality Assumption Assumes normal distribution for the response variable and errors \(\epsilon \sim N(0, \sigma^2)\) Does not assume normality; allows for non-normal distributions such as binomial, Poisson, etc.
Error Distribution Errors follow a normal distribution \(N(0, \sigma^2)\) Errors can follow distributions from the exponential family (e.g., binomial, Poisson, gamma)
Response Distribution The response variable is assumed to be normally distributed The response variable can follow different distributions, depending on the link function and data type (e.g., binomial for binary data, Poisson for count data)
Variance Structure Constant variance \(\sigma^2\) across all observations Variance is a function of the mean (e.g., \(\mu(1 - \mu\) ) for binomial, \(\mu\) for Poisson)
Use Case Best suited for continuous, normally distributed data Suitable for a wide range of data types (binary, count, continuous with non-normal errors)

GLM Math setup

  • Linear Model (LM): \[ Y = X\beta + \epsilon \] Where:
    • \(Y\) is the response variable.
    • \(X\) is the matrix of predictor variables.
    • \(\beta\) is the vector of regression coefficients.
    • \(\epsilon\) is the error term, assumed to follow \(\epsilon \sim N(0, \sigma^2)\).
  • Generalized Linear Model (GLM): \[ g(\mu) = X\beta \] Where:
    • \(g(\mu)\) is the link function that relates the expected value of the response variable \(\mu = \mathbb{E}[Y]\) to the linear predictor \(X\beta\).
    • \(Y\) follows a distribution from the exponential family (e.g., binomial, Poisson, Gamma).
    • \(X\) is the matrix of predictor variables.
    • \(\beta\) is the vector of regression coefficients.

GLM Model

# Fit the linear model (lm)
lm_model <- lm(Ozone ~ ., data = Ozone)

# Fit the GLM model (for example, using Gamma family)
glm_model <- glm(Ozone ~ ., data = Ozone, family = Gamma(link = "log"))

# Compare AIC
AIC(lm_model, glm_model)
          df      AIC
lm_model  56 1190.588
glm_model 56 1081.454
  • The AIC for a given model is calculated as:

\[\text{AIC} = 2k - 2 \ln(L)\] Where:

  • \(k\) is the number of estimated parameters in the model.
  • \(L\) is the maximum likelihood of the model, representing how well the model fits the data.
  • \(\ln(L)\) is the log-likelihood of the model.

Common examples:

Variable Type Common Dist. Common Link
Real-valued with a bell-shaped dist. Normal (Gaussian) Identity
Binary (0/1) Binomial Logit
Count Poisson Log
+ve, continuous with right skew Gamma / inverse Gaussian Log
+ve, continuous with a large mass at zero Tweedie Log

(Note: For gamma and inverse Gaussian, the target variable has to be strictly positive. Values of zero are not allowed.)

Stepwise Selection - Idea

  • Selection process: Sequentially add/drop features, one at a time, until there is no improvement in the selection criterion.
Area Backward Forward
1. Which model to start with? Full model Intercept-only model
2. Add or drop variables? Drop Add
3. Which method tends to produce a simpler model? Forward selection Forward selection
  • Code
# Perform backward stepwise selection
backward_model <- step(lm_model, direction = "backward")
# Perform backward stepwise selection
backward_model <- step(glm_model, direction = "backward")

Selection criteria based on penalized likelihood

  • Idea: Prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount.

  • Two common choices:

    Criterion Definition Penalty per Parameter
    AIC \(-2l + 2(p + 1)\) 2
    BIC \(-2l + \ln(n_{tr})(p + 1)\) \(\ln(n_{tr})\)
  • AIC vs. BIC:

    • For both, the lower the value, the better.
    • BIC is more conservative and results in simpler models.

Source

  • Leo Breiman, Department of Statistics, UC Berkeley. Data used in Leo Breiman and Jerome H. Friedman (1985), Estimating optimal transformations for multiple regression and correlation, JASA, 80, pp. 580-598.
  • Dr. Colin rundel STA 523 Slides
  • Dr. David Banks STA 521 Slides
  • Dr. Ambrose Lo ACTEX
  • An Introduction to Statistical Learning(ISLR)

Thank you!

Remember to submit your work on Canvas. Have a great rest of your day!