1 Theoretical framework

When modeling data, we often explore various models before selecting the most suitable one. We will apply multiple criteria to evaluate and compare the quality of model fit. In this exercise, we will examine several criteria related to the linear model

\[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} + \ldots + \beta_kx_{ik} +\varepsilon_i; \quad \varepsilon_i \overset{i.i.d.}{\sim} (0, \sigma^2), \] for \(i = 1, \ldots, n\), where \(n\) represents the size of the dataset and \(k\) denotes the number of covariates. We also define \(p = k + 1\) to indicate the total number of coefficients \(\beta_j\).

We will use the following models on \(\texttt{fev}\) dataset (see Practical example for details) to demonstrate the criteria:

\[ \text{Model 1: }\\ \texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Sex}_i +\varepsilon_i, \\ \phantom{=}\\ \text{Model 2: }\\ \texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}_i^2 + \beta_3\times\texttt{Smoker}_i + \beta_4\times\texttt{Sex}_i +\varepsilon_i. \]

load("fev.RData")
model1 <- lm(FEV ~ Height + Sex, data=fev)
model2 <- lm(FEV ~ Height + Height^2 + Smoker + Sex, data=fev)

1.1 Coefficient of determination

R-squared \(\left( R^2 \right)\), or the coefficient of determination, is a key statistical measure used to evaluate the goodness of fit for a linear model. It indicates the proportion of the variance in the dependent variable that can be explained by the independent variables in the model. To calculate \(R^2\), we first need to evaluate specific sums of squares based on the linear model:

\[ \text{Total Sum of Squares} \quad TSS = \sum_{i = 1}^n(Y_i - \bar Y)^2, \\ \text{Sum of Squares due to Regression} \quad SSR = \sum_{i = 1}^n(\hat Y_i - \bar{\hat Y})^2 = \sum_{i = 1}^n(\hat Y_i - \bar{ Y})^2, \\ \text{Sum of Squared Errors} \quad SSE = \sum_{i = 1}^n(Y_i - \hat Y_i)^2, \]

where \(n\) represents the size of the dataset, \(Y_i\) and \(\hat Y_i\) denote the \(i\)-th outcome and the fitted value, respectively. The means of all outcomes and all fitted values are represented as:

\[ \bar{Y} = \sum_{i = 1}^n Y_i, \qquad \bar{\hat Y} = \sum_{i = 1}^n \hat Y_i. \]

It can be shown that the equality \(\bar{Y} = \bar{\hat Y}\) holds.

In model 1 on \(\texttt{fev}\) dataset we can calculate these sums of squares as shown in the code below.

y <- fev$FEV # outcome
e <- residuals(model1) # residuals
y.hat <- fitted(model1) # fitted values

tss <- sum((y-mean(y))^2) # total sum of squares
ssr <- sum((y.hat-mean(y.hat))^2) # sum of squares due to regression
ssr <- sum((y.hat-mean(y))^2) # sum of squares due to regression
sse <- sum(e^2) # sum of squared errors

TSS	SSR	SSE
490.9198	372.479	118.4409

Using these sums of squares, we can compute \(R^2\) as follows:

\[ R^2 = \frac{SSR}{TSS} = 1 - \frac{SSE}{TSS} \]

The formula shows that \(R^2\) takes values within the interval \([0, 1]\). A higher \(R^2\) value suggests a better fit of the model to the data. For example, an \(R^2\) of 0.80 means that 80% of the variability in the dependent variable can be explained by the model.

In model 1 on \(\texttt{fev}\) dataset we can calculate \(R^2\) as shown in the code below.

R2 <- 1-sse/tss # R^2
R2 <- ssr/tss # R^2

\(R^2\)
0.7587368

\(R^2\) is also available in the summary of the model on the prelast line denoted as Multiple R-squared.

summary(model1)

## 
## Call:
## lm(formula = FEV ~ Height + Sex, data = fev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6763 -0.2505  0.0001  0.2347  2.0722 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.390263   0.180082 -29.932  < 2e-16 ***
## Height       0.051272   0.001167  43.933  < 2e-16 ***
## SexMale      0.125123   0.033801   3.702 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4265 on 651 degrees of freedom
## Multiple R-squared:  0.7587, Adjusted R-squared:  0.758 
## F-statistic:  1024 on 2 and 651 DF,  p-value: < 2.2e-16

\(R^2\) can be also extracted from the summary using the following code.

summary(model1)$r.squared

## [1] 0.7587368

A disadventage of \(R^2\) is that it does not account for the number of predictors in the model. A model with more predictors will certainly have a higher \(R^2\), even if those predictors do not contribute meaningfully to the model. Because of that, we don’t use \(R^2\) to compare nested models.

For that purpose can be used adjusted \(R^2\), which is defined as follows:

\[ R^2_{adj} = 1 - \frac{\frac{SSE}{n - p}}{\frac{TSS}{n - 1}} \]

This variant adjusts for the number of predictors in the model, providing a more accurate measure when comparing models with different numbers of predictors. The formula shows that models with more predictors are penalized for its complexity. \(R^2_{adj}\) decreases if the additional predictors do not significantly improve the model. This helps to prevent overfitting.

In model 1 on \(\texttt{fev}\) dataset we can calculate \(R^2_{adj}\) as shown in the code below.

R2adj <- 1-sse/(nrow(fev)-summary(model1)$df[1])/tss*(nrow(fev)-1) # R^2_{adj}

\(R^2_{adj}\)
0.7579956

\(R^2_{adj}\) is also available in the summary of the model on the prelast line denoted as Adjusted R-squared.

summary(model1)

## 
## Call:
## lm(formula = FEV ~ Height + Sex, data = fev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6763 -0.2505  0.0001  0.2347  2.0722 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.390263   0.180082 -29.932  < 2e-16 ***
## Height       0.051272   0.001167  43.933  < 2e-16 ***
## SexMale      0.125123   0.033801   3.702 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4265 on 651 degrees of freedom
## Multiple R-squared:  0.7587, Adjusted R-squared:  0.758 
## F-statistic:  1024 on 2 and 651 DF,  p-value: < 2.2e-16

\(R^2_{adj}\) can be also extracted from the summary using the following code.

summary(model1)$adj.r.squared

## [1] 0.7579956

The following table summarizes \(R^2\) and \(R^2_{adj}\) for Model 1 and Model 2 on \(\texttt{fev}\) data. We can see that \(R^2\) for Model 2 is higher, as it will always be for nested models with more predictors. However, \(R^2_{adj}\) is lower for Model 2, which indicates that the predictors added to Model 2 do not significantly improve the goodness of fit. We can conclude that Model 1 has better performance than Model 2.

	\(R^2\)	\(R^2_{adj}\)
Model 1	0.7587368	0.7579956
Model 2	0.7588628	0.7577499

1.2 Information criteria

We might also use information criteria to compare quality of model fit. Information criteria help assess how well a model explains the data, balancing goodness of fit with model complexity. They are used to attribute scores to different regression models.

Two of the most commonly used criteria are the Akaike Information Criterion (\(AIC\)) and the Bayesian Information Criterion (\(BIC\)) defined as

\[ AIC = - 2\ln (\hat{L}) + 2p, \\ BIC = - 2\ln (\hat{L}) + p\ln n, \]

where \(\hat{L}\) is the likelihood of the model, \(p\) is the number of coefficients \(\beta_j\) in the model and \(n\) is the size of dataset.

When comparing multiple models, those with lower \(AIC\) or \(BIC\) scores are generally considered better fits to the data. These criteria help prevent overfitting, as they reward models that are simpler. While both \(AIC\) and \(BIC\) are used for model selection, \(BIC\) applies stricter penalties for adding more parameters. This makes \(BIC\) more suitable when you want to choose simpler models.

\(AIC\) and \(BIC\) can be computed as showed in the following code.

AIC_mod1 <- AIC(model1)
BIC_mod1 <- BIC(model1)

\(AIC\)	\(BIC\)
746.4861	764.4185

The following table summarizes \(AIC\) and \(BIC\) for Model 1 and Model 2 on \(\texttt{fev}\) data. We can see that both criteria are in favor of the Model 1.

	\(AIC\)	\(BIC\)
Model 1	746.4861	764.4185
Model 2	748.1445	770.5600

2 Practical example

In this section, we illustrate the model selection procedure from theoretical framework of this seminar on \(\texttt{fev}\) data set. The forced expiratory volume, FEV, is a measure of how much air a person can exhale (in liter) during a forced breath. In this dataset, the FEV of 654 children, between the ages of 3 and 19, were measured. The dataset also provides additional information about these children: their Age, their Height, their gender (variable Sex) and whether the child is a smoker or a non-smoker (variable Smoker).

Our goal is to compare different models that describe the relationship between FEV and various explanatory variables (including their interactions), with a focus on the quality of the model fit.

Examine the various models listed below. What can be concluded about their goodness of fit and which model best captures the dependency of FEV on covariates? Additionally, which criterion (or criteria) appears to be the most suitable for comparing the models in this context, and why?

Model 0: \[\texttt{FEV}_i = \beta_0 + \varepsilon_i;\]

Model 1: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Sex}_i + \beta_2\times\texttt{Height}_i +\varepsilon_i;\]

Model 2: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Sex}_i + \beta_2\times\texttt{Height}_i +\beta_3\times\texttt{Age}_i + \varepsilon_i;\]

Model 3: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Sex}_i + \beta_2\times\texttt{Height}_i +\\ \beta_3\times\texttt{Age}_i +\beta_4\times\texttt{Smoker}_i + \varepsilon_i;\]

Model 4: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Sex}_i + \beta_3\times(\texttt{Sex}\times\texttt{Height})_i +\\ \beta_4\times\texttt{Age}_i +\beta_5\times\texttt{Smoker}_i + \varepsilon_i;\]

Model 5: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}^2_i + \varepsilon_i;\]

Model 6: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}^2_i +\\ \beta_3\times\texttt{Sex}_i + \varepsilon_i;\]

Model 7: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}^2_i +\\ \beta_3\times\texttt{Sex}_i + \beta_4\times\texttt{Age}_i +\beta_5\times\texttt{Smoker}_i + \varepsilon_i;\]

Model 8: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}^2_i + \beta_3\times\texttt{Sex}_i +\\ \beta_4\times(\texttt{Sex}\times\texttt{Height})_i + \beta_5\times(\texttt{Sex}\times\texttt{Height}^2)_i + \varepsilon_i;\]

Model 9: \[\texttt{FEV}_i = \beta_0 + \beta_1\times\texttt{Height}_i + \beta_2\times\texttt{Height}^2_i + \beta_3\times\texttt{Sex}_i +\\ \beta_4\times(\texttt{Sex}\times\texttt{Height})_i + \beta_5\times(\texttt{Sex}\times\texttt{Height}^2)_i + \\\beta_6\times\texttt{Smoker}_i + \varepsilon_i.\]

Fit the models using lm() function in R, specifying appropriate formula and data.

Reminder: we use : or * to define interaction between two or more explanatory variables. To define a transformation of a variable in the model formula, we use I() function.

mod0 <- lm(FEV ~ 1, data = fev)
mod1 <- lm(...)
...
mod9 <- lm(...)

To compare the fitted models, we use criteria introduced in the theoretical part. As already stated, \(R^2\) and \(R^2_{adj}\) can be find in the summary() of the model, functions AIC() and BIC() are helpful for calculating criteria based on the log-likelihood.

summary(mod0)$r.squared
summary(mod0)$adj.r.squared
AIC(mod0)
BIC(mod0)

	\(R^2\)	\(R^2_{adj}\)	\(AIC\)	\(BIC\)
Model 0	0.0000000	0.0000000	1672.3871	1681.3533
Model 1	0.7587368	0.7579956	746.4861	764.4185
Model 2	0.7746110	0.7735707	703.9746	726.3902
Model 3	0.7753614	0.7739769	703.7935	730.6922
Model 4	0.7849674	0.7833082	677.2115	708.5932
Model 5	0.7740993	0.7734053	703.4577	721.3901
Model 6	0.7754721	0.7744358	701.4711	723.8867
Model 7	0.7939639	0.7923741	649.2608	680.6426
Model 8	0.7831418	0.7814685	682.7406	714.1223
Model 9	0.7832992	0.7812896	684.2657	720.1306

Which model best captures the dependency of FEV on covariates?
How do you interpret the coefficients \(\beta_j\) of the selected model?
Can you find a model which better describes the relationship between FEV and other variables?

3 Research on \(CO_2\) emissions

Work with the \(CO_2\) emission by vehicle data set, which can be downloaded from interactive outline or directly form Kaggle.

The dataset captures the details of how \(CO_2\) emissions by a vehicle can vary with the different features. This dataset has been taken from Canada Government official open data website. This contains data over a period of 7 years. There are total 7385 rows and 12 columns associated with these features:

Make - Company of the vehicle,
Model - Car Model,
Class - Class of vehicle depending on their utility, capacity and weight,
Engine_size - Size of engine used in Litre,
Cylinders - Number of cylinders,
Transmission - Transmission type with number of gears,
Fuel_type - Type of Fuel used,
Consumption_city - Fuel consumption in city roads (L/100 km),
Consumption_hwy - Fuel consumption in highways (L/100 km),
Consumption_comb - The combined fuel consumption (55% city, 45% highway) is shown in L/100 km,
Consumption_mpg - The combined fuel consumption in both city and highway is shown in mile per gallon(mpg),
CO2 - The tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving.

There are few abbreviations that has been used to describe the features. We are listing them out here:

Model

4WD/4X4 = Four-wheel drive
AWD = All-wheel drive
FFV = Flexible-fuel vehicle
SWB = Short wheelbase
LWB = Long wheelbase
EWB = Extended wheelbase

Transmission

A = Automatic
AM = Automated manual
AS = Automatic with select shift
AV = Continuously variable
M = Manual
3 - 10 = Number of gears

Fuel type

X = Regular gasoline
Z = Premium gasoline
D = Diesel
E = Ethanol (E85)
N = Natural gas

Fuel Consumption

City and highway fuel consumption ratings are shown in litres per 100 kilometres (L/100 km) - the combined rating (55% city, 45% hwy) is shown in L/100 km and in miles per gallon (mpg).

CO2 Emissions

The tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving. (for details see Kaggle)

3.1 Tasks

As an environmental scientist or sustainability consultant, you are tasked with analyzing this dataset and addressing the following inquiries:

What are the most influential features that significantly impact \(CO_2\) emissions?
Will there be any differences in predicted \(CO_2\) emissions when fuel consumption for city and highway driving is considered separately, as opposed to when their weighted variable interaction is taken into account?

For this analysis, it is imperative to fit linear models incorporating various predictors, compare their goodness of fit, and determine which model is the most appropriate for modelling the relationship between the \(CO_2\) emissions and other characteristics of the vehicle.

3.2 Data preprocessing

As a researcher working with raw data, an important part of your job is to preprocess the data. This includes, for instance, identifying and dealing with missing values. By addressing these issues, you can ensure that your data is clean and reliable, which will help improve the quality of your research results.

First, load the data set, rename the variables and change appropriate ones to type factor. Also delete the Cunsumption_mpg variable (you are working with L units). Check, whether there are any missing values. How can we handle missing data?

data <- read.csv(...)

colnames(data) <- c('Make', 'Model', 'Class', 'Engine_size',
                    'Cylinders', 'Transmission', 'Fuel_type',
                    'Consumption_city', 'Consumption_hwy',
                    'Consumption_comb', 'Cunsumption_mpg',
                    'CO2')
data <- data |> 
  select(!(Cunsumption_mpg)) |>
  mutate(
    ... = factor(...),
    ...
  )

Consider the variables Make, Model and Class. Are these variables suitable for modelling using linear regression?

One potential approach is to cluster the car makes into similar groups, artificial intelligence can be helpful here.

HINT: Our prompt for ChatGPT:

Hi, I have a list of car makes, can you cluster them so 
the clusters contain the most similar car makes 
(for example cluster named luxury car contains aston martin,
bentley, rolls-royce). Cluster all the listed car makes below
and do not add other make which is not listed below. 
Clusters will be disjoint and their union will equal to the whole
list below. Write the car makes in one cluster in capital (SUBARU),
wrap up each car make in quotation marks ("SUBARU") 
and separate them by a comma (,).
Here is the list of car makes: 
ACURA, ALFA ROMEO, ASTON MARTIN , AUDI, BENTLEY, BMW, BUICK,
CADILLAC, CHEVROLET, CHRYSLER, DODGE, FIAT, FORD, GMC, HONDA, 
HYUNDAI, INFINITI, JAGUAR, JEEP, KIA, LAMBORGHINI, LAND ROVER,
LEXUS, LINCOLN, MASERATI, MAZDA, MERCEDES-BENZ, MINI, MITSUBISHI,
NISSAN, PORSCHE, RAM, ROLLS-ROYCE, SCION, SMART, SRT, SUBARU, 
TOYOTA, VOLKSWAGEN, VOLVO, GENESIS, BUGATTI.

According to the answer, cluster the car makes into similar groups, defining new variable Type. It is possible that certain makes do not fit within any of the defined clusters. If appropriate, either remove these makes from the analysis or assign them to the cluster that is most similar, based on your judgment.

Note: When working with outputs from artificial intelligence (AI) systems, it’s important to be careful. You should always check to make sure that the output is correct before using it. Since AI can produce results influenced by things like data quality and how the algorithms are designed, it’s crucial to review the information carefully. By taking the time to verify AI-generated outputs, you can avoid mistakes and improve the reliability of your work. For instance, you might consider asking a friend who is passionate about automobiles to review the final clustering results.

Delete unuseful variables. Functions filter(), mutate(), select() and case_when() can be helpful.

data <- data |> filter(Make != ...) |>
  mutate(Type = case_when(
    Make %in% c("...", ...) ~ "...",
    ...
  )
  ) |>
  select(!c(...))

Investigate the Fuel_type variable. Use table() function. If needed, delete appropriate rows.

data <- data |>
  filter(...) |>
  mutate(... = factor(....))

3.3 Explanatory data analysis

As a researcher, your subsequent step in data analysis is extracting insights regarding the relationships between variables through the use of visualizations. Especially, focus on the following areas:

Identification and handling outliers.
Visualization of significant relationships among the variables.
Examining the correlations between the variables.

The visualization provided below may serve as a source of inspiration. It is also recommended to employ two-dimensional exploratory data analysis (especially including outcome variable) to further enhance your insights.

The correlation structure of the data can be visualized using corrplot.mixed() function from corrplot package.

It is helpful to make graphs interactive. You can use AI to help you convert your ggplot graph into interactive version.

Example of our prompt:

Hi, how can I make ggplot in R interactive?

3.4 Model comparison

Your colleague has developed the model presented in the code below and claims that it represents the best possible fit. Your task is to assess the validity of this statement.

model <- lm(CO2 ~ Engine_size + Cylinders + 
              Fuel_type + Consumption_comb + Type, 
            data = data)

\(R^2\)	\(R^2_{adj}\)	\(AIC\)	\(BIC\)
0.9914716	0.99146	45871.45	45954.33

Fit a range of models to the \(CO_2\) emissions data and evaluate their performance based on the criteria outlined in the theoretical framework of this seminar. Can you identify a model that outperforms the one provided by your fellow researcher? Keep in mind that adding more covariates does not necessarily lead to improved model performance.

Using the models you have explored throughout your analysis, address the primary questions of your main task.

What are the most influential features that significantly impact \(CO_2\) emissions?
Will there be any differences in predicted \(CO_2\) emissions when fuel consumption for city and highway driving is considered separately, as opposed to when their weighted variable interaction is taken into account?

Linear models I

Quality of model fit

Week 8