Part 1:

1. Identify four real-life applications of supervised and unsupervised problems:

In online shopping, supervised learning is used to recommend products based on past purchases (e.g., Amazon), while unsupervised learning clusters users for personalized movie/TV show suggestions (e.g., Netflix). Navigation apps like Google Maps employ supervised learning to optimize routes based on historical traffic data, whereas unsupervised learning segments customers for targeted marketing campaigns.

2. What benefits does machine learning bring to these problems/activities? How does machine learning improve your experience with these activities or how would it improve the organization’s capabilities?

3. Explain what makes these problems supervised versus unsupervised.

Supervised learning relies on labeled data with input-output pairs to predict the output variable based on input features, while unsupervised learning identifies patterns or structures within unlabeled data without direct guidance.

4. For each problem identify the target variable (if applicable) and potential feature variables that could be used. How do you think this data gets collected?

- In online shopping, the target variable could be purchase behavior, with feature variables including past purchase history, browsing behavior, and demographic information. Data is collected through user interactions on the platform, such as purchases, clicks, likes, and demographic surveys as a few examples.

5. For each of these applications could you foresee any ethical concerns in using machine learning? Could machine learning (or maybe the data collection process) be misused in any way?

Ethical concerns may include privacy issues, algorithmic biases, and manipulation of user behavior through personalized recommendations. Data collection processes could be misused for unauthorized access to personal information or discriminatory targeting of certain marginalized groups. It’s essential to address these concerns through transparent practices, responsible data usage, and ongoing monitoring of machine learning systems.

Part 2:

install.packages("tidymodels")
install.packages("kknn")
library(tidymodels)
library(kknn)

Modeling tasks:

1. Is this a supervised or unsupervised learning problem? Why?

It is a supervised learning problem since it has a target variable and we need to predict a numerical value.

2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?

Response variable (target variable): cmedv

Predictor variables (features): lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, and lstat.

3. Given the type of variable cmedv is, is this a regression or classification problem?

It is a regression problem.

4.Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- read_csv("/Users/jeenapatel/Desktop/BANA 4080/boston.csv")
sum(is.na(boston))
summary(boston$cmedv)

5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine

set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(boston_split)
test <- testing(boston_split)

6. How many observations are in the training set and test set?

boston_split

There are 352 observations in the training set

7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?

train %>% 
  mutate(id = 'train') %>% 
  bind_rows(test %>% mutate(id = 'test')) %>%
  ggplot(aes(cmedv, color = id)) +
  geom_density()

8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?

lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)
lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?

lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?

knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

R Notebook

Part 1:

1. Identify four real-life applications of supervised and unsupervised problems:

2. What benefits does machine learning bring to these problems/activities? How does machine learning improve your experience with these activities or how would it improve the organization’s capabilities?

3. Explain what makes these problems supervised versus unsupervised.

Supervised learning relies on labeled data with input-output pairs to predict the output variable based on input features, while unsupervised learning identifies patterns or structures within unlabeled data without direct guidance.

4. For each problem identify the target variable (if applicable) and potential feature variables that could be used. How do you think this data gets collected?

- For navigation apps, the target variable may be travel time or traffic conditions, with feature variables including time of day, weather conditions, and historical traffic data. Data is collected from GPS sensors, traffic cameras, crowd-sourced reports, and transportation agencies.

- Customer segmentation for marketing campaigns involves feature variables like purchase history, demographic information, and browsing behavior, collected from customer transactions, surveys, website analytics, and social media interactions.

5. For each of these applications could you foresee any ethical concerns in using machine learning? Could machine learning (or maybe the data collection process) be misused in any way?

Part 2:

Modeling tasks:

1. Is this a supervised or unsupervised learning problem? Why?

It is a supervised learning problem since it has a target variable and we need to predict a numerical value.

2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?

Response variable (target variable): cmedv

Predictor variables (features): lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, and lstat.

3. Given the type of variable cmedv is, is this a regression or classification problem?

It is a regression problem.

4.Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine

6. How many observations are in the training set and test set?

There are 352 observations in the training set

7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?

8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?

9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?

10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?