1 Last assignment guidelines
For this last assignment you need to pick one of the datasets below and apply what you’ve learned to predict/cluster. It’s up to you to define the question, thoroughly explore the data, do whatever preprocessing is necessary, fit and cross validate models and get average error (error rate or \(MSE\) depending on the question), and determine which model did the best. You must meet the following requirements:
You must clearly define you question along with some general predictions. So what is your target and how/why will it be influenced by what features. Exploring the data and googling what some of the features/targets are will help here.
You must truly explore your data. This means more than a
glimpse(). So additional exploration via summaries, looking at unique values, histograms of your target and other features. You need to demonstrate that you understand the data you’re working with and have chosen the right types of models. Some of these datasets have many features and it might be best if you remove some of them or explicitly use a model type that deals well with them.It’s up to you to figure out the proper type of preprocessing. If there are redundant columns/messy data/outliers/missing data, you need to find it an fix it. You also have to convert to the necessary datatypes for the question.
You must cross validate your models 10x and store your errors from each model and fold number in a data frame. You must then graph these results. The only model that you don’t need to cross validate is a random forest.
You must use at least two model types AND you can’t use a model type if we used it on the dataset originally! You must describe why you’re using these two models for the selected question.
You must describe the output of your models in the context of you question. So if you did a linear regression, you should be talking about what features are important based on SE and p-value, as well as how good the model is based on R2. This means talking about how much each one influences your target in non-technical speak as well. You should also compare your \(MSE\) from that and your other model.
Your script must work on my computer. I suggest after you’re done you wipe your environment (run
rm(list = ls())), completely close R, and then rerun it to make sure everything works.I don’t want extraneous code. Make sure you have just the code you need to answer the questions. You must annotate your logic throughout steps 1-7.
2 Datasets
Below are the datasets you can choose from.
# Wine quality - regression
wine <- read_csv("https://docs.google.com/spreadsheets/d/1MTgreXSW8rpvbOi7jPSaj1KGPaF_E7sdv5JdzWUru8E/gviz/tq?tqx=out:csv")
# Churn - classification
telco <- read_csv("https://docs.google.com/spreadsheets/d/1DZbq89b7IPXXzi_fjmECATtTm4670N_janeqxYLzQv8/gviz/tq?tqx=out:csv")
# Cancer M = malignant; B = benign - classification
cancer <- read_csv("https://docs.google.com/spreadsheets/d/1bWopVcJ3aWzzvw4Mp8YCX6lffeCOzDEgf5z7cO-G7c0/gviz/tq?tqx=out:csv")
# Bikeshare per-day rides - regression
bikes <- read_csv("https://docs.google.com/spreadsheets/d/1DK8ZSmIgvZ1eVVF33NCNLyLxvYFA8t1jeGrJppaBDpI/gviz/tq?tqx=out:csv")
# Number of shares for Mashable articles - regression
shares <- read_csv("https://docs.google.com/spreadsheets/d/1U3IjxGFh9WWCi155p4pqWmbzqjVNKLDoWnlO77uRpOE/gviz/tq?tqx=out:csv")
# Beer ratings - clustering or regression
beer <- read_csv("https://docs.google.com/spreadsheets/d/1FQvlCVdeGiMttYgBCsUXge3P_KfL8AmI12cvSXxkPuU/gviz/tq?tqx=out:csv")
# Student final grade based on family factors - regression
grades <- read_csv("https://docs.google.com/spreadsheets/d/1DPaEZ2G75lnUXOYjRjPNdZSBvr9O6SbEoYXlRgyKTow/gviz/tq?tqx=out:csv")