df AIC
lm_model 56 1190.588
glm_model 56 1081.454
install.packages
—everything is pre-installed and ready to go!You want to predict how long someone lives from the number of hours of exercise they do in a week. From a sample of 50 people, you estimate the intercept as 90 and the slope as \(-2\). The coefficient of determination is 0.6, the sd of the residuals is 9, and the mean number of hours spent exercising is 8 with an sd of 4.
Method 1 The slope is negative, so it is the negative of the square root of the coefficient of determination. Hence, \(\hat{y} = 90 -2x\) then \(r = -\sqrt{0.6} = -0.77\)
Method 2 This part attempts to calculate (r) using the formula for the correlation coefficient from regression parameters.
\[ \sum\left(y_i-\hat{y}_i\right)^2 = 9^2 \cdot (n-2) = 9^2 \cdot 48 \]
Reformulation of (R^2):
\[ r^2 = 0.6 = 1- \frac{\text{RSS}}{\text{TSS}}= 1 - \frac{\sum\left(y_i-\hat{y}_i\right)^2}{\sum\left(y_i-\bar{y}\right)^2} = 1 - \frac{9^2 \times 48}{\sum\left(y_i-\bar{y}\right)^2} \] Hence,
\[ \sum\left(y_i-\bar{y}\right)^2 = \frac{9^2 \times 48}{0.4} \] Here, we are calculating the TSS (\(SS_y\)) based on the formula for (R^2) rearranged to extract (SS_y). The denominator 0.4 comes from rearranging (R^2 = 0.6) into (1 - R^2).
Estimation of (SS_x) (Total Sum of Squares of (x)):
\[ \sqrt{\frac{SS_x}{n-1}} = 4 \Rightarrow SS_{xx} = 4^2 \cdot 49 \] Here, you’re estimating the total sum of squares for (x) based on the sample variance formula, where 4 is the standard deviation of (x).
Calculation of (r):
\[ r = \frac{SS_{xy}}{\sqrt{SS_x SS_y}} = -2 \sqrt{\frac{SS_x}{SS_y}} = -2 \sqrt{\frac{4^2 \cdot 49}{9^2 \cdot 48 / 0.4}} \approx -0.568 \]
In the context of data analysis and machine learning, dimensions refer to the features or attributes of a dataset. For instance, if we consider the Ozone
dataset, which will be used in today’s lab, the dimensions include variables such as wind speed, temperature, humidity, pressure height, and ozone levels. Each of these features adds a new dimension to the dataset, increasing the complexity of the analysis as more variables are included, much like adding more characteristics in other datasets.
1 Variable: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \epsilon\)
2 Variables: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \beta_2 (\text{Humidity}) + \epsilon\)
3 Variables: \(\text{Ozone Level} = \beta_0 + \beta_1 (\text{Wind Speed}) + \beta_2 (\text{Humidity}) + \beta_3 (\text{Temperature}) + \epsilon\)
As you go up to even more dimensions (4D, 5D, etc.), the space grows exponentially larger, and you need exponentially more points to fill it. But no matter how many points you add, they get spread out thinner and thinner, making it harder to find patterns or meaningful relationships in the data.
So, in higher dimensions, it becomes extremely difficult to “fill” the space with enough data, and the data becomes very sparse. This is why analyzing data in many dimensions gets much harder.
Locate the Ozone
data, which records Los Angeles ozone pollution levels in 1976. Follow the instructions below and submit a PDF of your results to the Canvas . Be sure to include relevant figures.
Note
This is a very open-ended project, meaning every response you provide is valuable. The optimal approach is to extract as much information as possible. This ability will give you insight into ad-hoc problem-solving skills, which are essential when tackling real-world problems as a Data Scientist, Engineer, or Analyst.
V4
or Ozone
).Day_of_Month
and Day_of_Week
. Additionally, convert the Month
variable into a numerical format if necessary for analysis.You can load the data using either of the following methods:
or
mlbench
stands for “Machine Learning Benchmark Problems”
When you load the data, you’ll notice the variables are named V1
, V2
, …, V13
, which is not very descriptive. Therefore, it’s a good idea to rename the columns for clarity.
Important
While discarding data with missing values isn’t always the best approach, we’re opting for simplicity here. In practice, techniques like imputation could be used to fill in the gaps, preserving more of the data’s richness—though it’s a bit more complex.
The histogram clearly demonstrates that the assumption of normality no longer holds. The distribution of the data deviates significantly from what would be expected under a normal distribution, showing potential skewness or heavy tails. This visual evidence suggests that applying methods reliant on normality assumptions may lead to inaccurate results.
The histogram appears more consistent with a gamma distribution.
If we fit to the full linear model
underfitting
, as there might be a non-linear relationship in the data that the model doesn’t account for.Aspect | Linear Model (LM) | Generalized Linear Model (GLM) |
---|---|---|
Normality Assumption | Assumes normal distribution for the response variable and errors \(\epsilon \sim N(0, \sigma^2)\) | Does not assume normality; allows for non-normal distributions such as binomial, Poisson, etc. |
Error Distribution | Errors follow a normal distribution \(N(0, \sigma^2)\) | Errors can follow distributions from the exponential family (e.g., binomial, Poisson, gamma) |
Response Distribution | The response variable is assumed to be normally distributed | The response variable can follow different distributions, depending on the link function and data type (e.g., binomial for binary data, Poisson for count data) |
Variance Structure | Constant variance \(\sigma^2\) across all observations | Variance is a function of the mean (e.g., \(\mu(1 - \mu\) ) for binomial, \(\mu\) for Poisson) |
Use Case | Best suited for continuous, normally distributed data | Suitable for a wide range of data types (binary, count, continuous with non-normal errors) |
# Fit the linear model (lm)
lm_model <- lm(Ozone ~ ., data = Ozone)
# Fit the GLM model (for example, using Gamma family)
glm_model <- glm(Ozone ~ ., data = Ozone, family = Gamma(link = "log"))
# Compare AIC
AIC(lm_model, glm_model)
df AIC
lm_model 56 1190.588
glm_model 56 1081.454
\[\text{AIC} = 2k - 2 \ln(L)\] Where:
Variable Type | Common Dist. | Common Link |
---|---|---|
Real-valued with a bell-shaped dist. | Normal (Gaussian) | Identity |
Binary (0/1) | Binomial | Logit |
Count | Poisson | Log |
+ve, continuous with right skew | Gamma / inverse Gaussian | Log |
+ve, continuous with a large mass at zero | Tweedie | Log |
(Note: For gamma and inverse Gaussian, the target variable has to be strictly positive. Values of zero are not allowed.)
Area | Backward | Forward |
---|---|---|
1. Which model to start with? | Full model | Intercept-only model |
2. Add or drop variables? | Drop | Add |
3. Which method tends to produce a simpler model? | Forward selection | Forward selection |
Idea: Prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount.
Two common choices:
Criterion | Definition | Penalty per Parameter |
---|---|---|
AIC | \(-2l + 2(p + 1)\) | 2 |
BIC | \(-2l + \ln(n_{tr})(p + 1)\) | \(\ln(n_{tr})\) |
AIC vs. BIC:
Remember to submit your work on Canvas. Have a great rest of your day!
Sta 521 - Fall 2024