Homework #1 - DA 6813

Ashwin Malshé


Access wine quality data set at https://archive.ics.uci.edu/ml/datasets/Wine+Quality There are separate CSV files for white and red wine. Combine them and make a larger united file.

Data Set Information:

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the paper by Cortez et al., 2009 (available in the homework 1 folder on Bb). Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Attribute Information

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 – alcohol

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Homework questions

  1. Build a logistic regression model using “quality” as the target variable. Note that you don’t have binary classification task any more. For this you will have to use multinomial logistic regression. However, you can still interpret the model output (i.e., statistical significance of the coefficients, etc. exactly the same way as binary logistic regression). (40 points)
  2. Use support vector machine (SVM) to estimate the model. Treat quality as a multiple categorical variable. (40 points)
  3. Treat quality as a continuous variable. Estimate a linear regression model and compare the output with multinomial regression and SVM. (40 points)
  4. Note that so far we treated quality as either categorical or continuous. Is there any other way you can model it? If the answer is yes, which is that way? (Hint: yes there is)
  • In case you want to be adventurous, estimate the model using quality as something other than categorical or continuous. This part of the question 4 is not graded so there is nothing to lose. (30 points)

Submit your R code. Please comment your code thoroughly. I would prefer if you submit this as R markdown format. I will show it to you in the class.

The deadline for submission is 6:00 pm, Monday October 31, 2016