Supervised learning: For each observation of the predictor measurement(s) \(x_i\), \(i=1,2...n\) there is an associated response measurement \(y_i\). We want to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).
Unsupervised learning: For each observation \(i=1,2...n\), we observe a vector of measurements \(x_i\) but no associated response \(y_i\). Due to no response variable to predict, we could not fit in a linear regression model. So this situation is called unsupervised learning since we don’t have a response variable to supervise our prediction.
Regression models are often used to solve problems with quantitative responses, where quantitative variables take on numerical values like a person’s income, age, height, and etc.
While classification models are used to solve problems with qualitative responses, where qualitative variables take on values in one of K different classes, or categories. For example, brands of product purchased are qualitative variables.
Two commonly used metrics for regression ML problems: Mean squared error, R-squared
Two commonly used metrics for classification ML problems: Accuracy, Error rate
Descriptive models: Choose model to best visually emphasize a trend in data, which means using a line on a scatter plot.
Inferential models: The aim is to test theories and state relationship between outcome & predictor(s). There are (possible) causal claims.
Predictive models: The aim is to predict Y with minimum reducible error and they are not focused on hypothesis tests.
Mechanistic assume a parametric form for f (i.e. \(\beta_0+\beta_1x_1+\cdot\cdot\cdot\)). Also, it won’t match true unknown f. We can add parameters, which add more flexibility. However, too many parameters might cause overfitting.
Empirically-driven has no or few assumptions about f and it requires a larger number of observations. It’s much more flexible by default and can also cause overfitting.
Difference: Mechanistic has assumption about the underlying process while empirically-driven model almost has no assumption and relies on observing the data patterns.
Similarity: They can both explain the relationship between variables and both have the risk of overfitting.
Mechanistic models are easier to understand because they are mostly based on theories to make prediction, while empirically-driven model requires us to observe some patterns to develop a theory.
Predictive. This is because we are trying to predict the likelihood of a voter will vote in favor of the candidate given some voter’s data. In this case, we want to build a model to predict the possibility instead of testing a theory.
Inferential. This is because we are making inferences about the change of likelihood under the effect of personal contact with candidate. In this case we are seeking a causal claim instead of just making some predictions. We want to state some sort of relationship between the outcome and predictor.
load packages and data needed
data(mpg)
## Warning in data(mpg): data set 'mpg' not found
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.0
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(ggthemes)
ggplot(data = mpg, aes(x = hwy)) + geom_histogram(binwidth = 2, fill = "grey", color = "black") + labs(title = "Highway Miles per Gallon Histogram", x = "Highway MPG", y = "Frequency")
Answer: I observe two main clusters around 16mpg and 26mpg, which the distribution looks like two normal distributions around these two values. Also, there are several cars have mpg around 40 and even larger.
ggplot(data = mpg, aes(hwy, cty)) + geom_point() +
labs(title = "Scatterplot of Highway MPG vs City MPG", x = "Highway MPG", y = "City MPG")
Answer: We can find a positive linear relationship in the pattern, which means as highway MPG goes up, city MPG becomes larger too. In reality it makes sense because a car with higher highway MPG usually has a higher city MPG, and vice versa. Also, for a single observation, the highway MPG is larger than its city MPG. This is because city driving needs to do speed change more often than highway driving thus it costs more gasoline, indicating smaller MPG.
# Count the frequency of cars by manufacturer and rearrange the data in order
manufacturer_freq <- mpg %>%
count(manufacturer) %>%
arrange(desc(n))
mpg$manufacturer <- factor(mpg$manufacturer, levels = manufacturer_freq$manufacturer)
# Create the bar plot of manufacturer with bars ordered by height
ggplot(data = mpg, aes(x = manufacturer)) + geom_bar() +
labs(title = "Number of Cars Produced by Each Manufacturer", x = "Manufacturer", y = "Number of Cars") +
coord_flip()
Answer: Dodge produced the most cars, and lincoln produced the least cars.
ggplot(data = mpg, aes(x = factor(cyl), y = hwy, group = cyl)) +
geom_boxplot() +
geom_jitter(alpha = 0.7) +
labs(title = "Highway MPG by Number of Cylinders", x = "Number of Cylinders", y = "Highway MPG")
Answer: From the box plot we notice that cars with more cylinders usually come with lower highway MPG.For example, cars with 4 cylinders have highway MPG median around 28, and cars with 8 cylinders only have the median highway MPG below 20, with some around 25. Thus, we see a negative linear relationship in the pattern. Also, we notice there are no too much cars with 5 cylinders.