1 Last assignment guidelines
For this last assignment you need to pick one of the datasets below and apply what you’ve learned to predict/cluster. It’s up to you to define the question, thoroughly explore the data, do whatever preprocessing is necessary, fit and cross validate models and get average error (rate or \(MSE\) depending on the question), and determine which model did the best. A couple requirements.
You must clearly define you question along with some general predictions.
You must truly explore your data. This means more than a
glimpse(). So additional exploration via summaries, looking at unique values, histograms of your target and other features. You need to demonstrate that you understand the data you’re working with and have chosen the right types of models.It’s up to you to figure out the proper type of preprocessing. If there are redundant columns/messy data/outliers/missing data, you need to find it an fix it. You also have to convert to the necessary datatypes for the question.
You must cross validate your models 10x and store your errors from each model and fold number in a data frame. You must then graph these results. The only model that you don’t need to cross validate is a random forest.
You must use at two model types AND you can’t use a model type if we used it on the dataset originally! You must describe why you’re using these two models for the selected question.
You must describe the output of your models in the context of you question. So if you did a linear regression, you should be talking about what features are important based on SE and p-value, as well as how good the model is based on R2. This means talking about how much each one influences your target in non-technical speak as well. You should also compare your \(MSE\) from that and your other model.
Your script must work on my computer. I suggest after you’re done you completely close R and then rerun it to make sure everything works.
I don’t want extraneous code. Make sure you have just the code you need to answer the questions. Also make sure you annotate.
This is kind of a lot of work, but you also have no new content this week. So you are essentially spending the 10ish hours of class time/study time on this + homework time. I think that’s fair
2 Datasets
Below are the datasets you can choose from.
# Wine quality - regression
wine <- read_csv("https://docs.google.com/spreadsheets/d/1MTgreXSW8rpvbOi7jPSaj1KGPaF_E7sdv5JdzWUru8E/gviz/tq?tqx=out:csv")
# Churn - classification
telco <- read_csv("https://docs.google.com/spreadsheets/d/1DZbq89b7IPXXzi_fjmECATtTm4670N_janeqxYLzQv8/gviz/tq?tqx=out:csv")
# Cancer M = malignant; B = benign - classification or clustering (if you remove the target)
cancer <- read_csv("https://docs.google.com/spreadsheets/d/1bWopVcJ3aWzzvw4Mp8YCX6lffeCOzDEgf5z7cO-G7c0/gviz/tq?tqx=out:csv")
# Bikeshare per-day rides - regression
bikes <- read_csv("https://docs.google.com/spreadsheets/d/1DK8ZSmIgvZ1eVVF33NCNLyLxvYFA8t1jeGrJppaBDpI/gviz/tq?tqx=out:csv")
# Number of shares for Mashable articles - regression
shares <- read_csv("https://docs.google.com/spreadsheets/d/1U3IjxGFh9WWCi155p4pqWmbzqjVNKLDoWnlO77uRpOE/gviz/tq?tqx=out:csv")
# Beer ratings - clustering or regression
beer <- read_csv("https://docs.google.com/spreadsheets/d/1FQvlCVdeGiMttYgBCsUXge3P_KfL8AmI12cvSXxkPuU/gviz/tq?tqx=out:csv")
# Student final grade based on family factors - regression
grades <- read_csv("https://docs.google.com/spreadsheets/d/1DPaEZ2G75lnUXOYjRjPNdZSBvr9O6SbEoYXlRgyKTow/gviz/tq?tqx=out:csv")