Question 1

Define supervised and unsupervised learning. What are the difference(s) between them?

Supervised learning: For each observation of the predictor measurement(s) \(x_i\), \(i=1,2...n\) there is an associated response measurement \(y_i\). We want to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

Unsupervised learning: For each observation \(i=1,2...n\), we observe a vector of measurements \(x_i\) but no associated response \(y_i\). Due to no response variable to predict, we could not fit in a linear regression model. So this situation is called unsupervised learning since we don’t have a response variable to supervise our prediction.

Question 2

Explain the difference between a regression model and a classification model, specifically in the context of machine learning.

Regression models are often used to solve problems with quantitative responses, where quantitative variables take on numerical values like a person’s income, age, height, and etc.

While classification models are used to solve problems with qualitative responses, where qualitative variables take on values in one of K different classes, or categories. For example, brands of product purchased are qualitative variables.

Question 3

Name two commonly used metrics for regression ML problems. Name two commonly used metrics for classification ML problems.

Two commonly used metrics for regression ML problems: Mean squared error, R-squared

Two commonly used metrics for classification ML problems: Accuracy, Error rate

Question 4

As discussed, statistical models can be used for different purposes. These purposes can generally be classified into the following three categories. Provide a brief description of each.

Descriptive models: Choose model to best visually emphasize a trend in data, which means using a line on a scatter plot.

Inferential models: The aim is to test theories and state relationship between outcome & predictor(s). There are (possible) causal claims.

Predictive models: The aim is to predict Y with minimum reducible error and they are not focused on hypothesis tests.

Question 5

Predictive models are frequently used in machine learning, and they can usually be described as either mechanistic or empirically-driven. Answer the following questions.

Define mechanistic. Define empirically-driven. How do these model types differ? How are they similar?

Mechanistic assume a parametric form for f (i.e. \(\beta_0+\beta_1x_1+\cdot\cdot\cdot\)). Also, it won’t match true unknown f. We can add parameters, which add more flexibility. However, too many parameters might cause overfitting.

Empirically-driven has no or few assumptions about f and it requires a larger number of observations. It’s much more flexible by default and can also cause overfitting.

Difference: Mechanistic has assumption about the underlying process while empirically-driven model almost has no assumption and relies on observing the data patterns.

Similarity: They can both explain the relationship between variables and both have the risk of overfitting.

In general, is a mechanistic or empirically-driven model easier to understand? Explain your choice.

Mechanistic models are easier to understand because they are mostly based on theories to make prediction, while empirically-driven model requires us to observe some patterns to develop a theory.

Question 6

A political candidate’s campaign has collected some detailed voter history data from their constituents. The campaign is interested in two questions. Classify each question as either predictive or inferential. Explain your reasoning for each:

Given a voter’s profile/data, how likely is it that they will vote in favor of the candidate?

Predictive. This is because we are trying to predict the likelihood of a voter will vote in favor of the candidate given some voter’s data. In this case, we want to build a model to predict the possibility instead of testing a theory.

How would a voter’s likelihood of support for the candidate change if they had personal contact with the candidate?

Inferential. This is because we are making inferences about the change of likelihood under the effect of personal contact with candidate. In this case we are seeking a causal claim instead of just making some predictions. We want to state some sort of relationship between the outcome and predictor.

Coding Part

load packages and data needed

data(mpg)
## Warning in data(mpg): data set 'mpg' not found
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(ggthemes)

Exercise 1

We are interested in highway miles per gallon, or the hwy variable. Create a histogram of this variable. Describe what you see/learn.

ggplot(data = mpg, aes(x = hwy)) + geom_histogram(binwidth = 2, fill = "grey", color = "black") + labs(title = "Highway Miles per Gallon Histogram", x = "Highway MPG", y = "Frequency") 

Answer: I observe two main clusters around 16mpg and 26mpg, which the distribution looks like two normal distributions around these two values. Also, there are several cars have mpg around 40 and even larger.

Exercise 2

Create a scatterplot. Put hwy on the x-axis and cty on the y-axis. Describe what you notice. Is there a relationship between hwy and cty? What does this mean?

ggplot(data = mpg, aes(hwy, cty)) + geom_point() +
  labs(title = "Scatterplot of Highway MPG vs City MPG", x = "Highway MPG", y = "City MPG")

Answer: We can find a positive linear relationship in the pattern, which means as highway MPG goes up, city MPG becomes larger too. In reality it makes sense because a car with higher highway MPG usually has a higher city MPG, and vice versa. Also, for a single observation, the highway MPG is larger than its city MPG. This is because city driving needs to do speed change more often than highway driving thus it costs more gasoline, indicating smaller MPG.

Exercise 3

Make a bar plot of manufacturer. Flip it so that the manufacturers are on the y-axis. Order the bars by height. Which manufacturer produced the most cars? Which produced the least?

# Count the frequency of cars by manufacturer and rearrange the data in order
manufacturer_freq <- mpg %>% 
  count(manufacturer) %>%
  arrange(desc(n))
mpg$manufacturer <- factor(mpg$manufacturer, levels = manufacturer_freq$manufacturer)

# Create the bar plot of manufacturer with bars ordered by height
ggplot(data = mpg, aes(x = manufacturer)) + geom_bar() +
  labs(title = "Number of Cars Produced by Each Manufacturer", x = "Manufacturer", y = "Number of Cars") +
  coord_flip()

Answer: Dodge produced the most cars, and lincoln produced the least cars.

Exercise 4

Make a box plot of hwy, grouped by cyl. Use geom_jitter() and the alpha argument to add points to the plot. Describe what you see. Is there a relationship between hwy and cyl? What do you notice?

ggplot(data = mpg, aes(x = factor(cyl), y = hwy, group = cyl)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.7) +
  labs(title = "Highway MPG by Number of Cylinders", x = "Number of Cylinders", y = "Highway MPG") 

Answer: From the box plot we notice that cars with more cylinders usually come with lower highway MPG.For example, cars with 4 cylinders have highway MPG median around 28, and cars with 8 cylinders only have the median highway MPG below 20, with some around 25. Thus, we see a negative linear relationship in the pattern. Also, we notice there are no too much cars with 5 cylinders.

Exercise 5

Use the corrplot package to make a lower triangle correlation matrix of the mpg dataset. (Hint: You can find information on the package here.) Which variables are positively or negatively correlated with which others? Do these relationships make sense to you? Are there any that surprise you?

mpg%>%
select_if(is.numeric) %>%
cor() %>%
corrplot(type="lower")

Answer: First, engine displacement is positively correlated with cylinder numbers. This makes sense intuitively since more cylinders indicate more displacement. Second, city MPG is positively correlated with highway MPG. This is because fuel-efficient cars will cost less fuel on either city roads or highway. Third, engine displacement is negatively correlated with city MPG and highway MPG. Since cars with larger engine displacement will consume more fuel than those smaller ones, high displacement cars will need more fuel per mile, resulting lower city MPG and highway MPG. Finally, cylinder number is also negatively correlated with city MPG and highway MPG. This is also expected since cylinder numbers and engine displacement are positively correlated, and they have a negative correlation with MPG as explained above.