Due: Wednesday, November 6, 2019
Libraries
library(tidyverse)
library(caret)
library(ggplot2)
library(dplyr)
library(GGally)
library(corrplot)
library(rpart)
library(gplots)
library(ggpubr)
Problem 2.11
The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications.
- Explore the data using the data visualization capabilities of R. Which of the pairs among the variables seem to be correlated?
toy.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/ToyotaCorolla.csv")
toy_cor <- subset(toy.df, select = -c(Model, Fuel_Type, Color))
ggcorr(toy_cor, hjust = 1)

toy_cor_matrix <- data.frame(cor(toy_cor))
toy.plot <- ggplot(toy.df)+geom_point(aes(toy.df$Mfg_Year, toy.df$Price))
toy.plot2 <- ggplot(toy.df)+geom_point(aes(toy.df$Mfg_Year, toy.df$Age_08_04))
toy.plot3 <- ggplot(toy.df)+geom_col(aes(toy.df$Mfg_Year, toy.df$Boardcomputer))
toy.plot4 <- ggplot(toy.df)+geom_point(aes(toy.df$Id, toy.df$Age_08_04))
ggarrange(toy.plot, toy.plot2, toy.plot3, toy.plot4, ncol = 2, nrow = 2)

The structure of the data is viewed str(toy_cor) and the three variables that were not numerical were Fuel_Type, Model, and Color. The highly correlating attributes (highlighted in dark red or dark blue) were Price & Year, Age & Year, Boardcomputer & Year, and ID & Age. These values were plotted to visualize the suggested correlation.
- We plan to analyze the data using various data mining techniques described in future chapters. Prepare the data for use as follows:
- The dataset has two categorical attributes, Fuel Type and Metallic. Describe how you would convert these to binary variables. Confirm this using R’s functions to transform categorical data into dummies.
- Prepare the dataset (as factored into dummies) for data mining techniques of supervised learning by creating partitions in R. Select all the variables and use default values for the random seed and partitioning percentages for training (50%), validation (30%), and test (20%) sets. Describe the roles that these par- titions will play in modeling.
set.seed(1)
library(dummies)
toy.df$Model <- as.character(toy.df$Model)
toy.df.dum <- dummy.data.frame(toy.df, sep = ".", dummy.class = "factor")
toy.df.dum <- toy.df.dum[, -c(10,13)] # drop one of the dummy variables from Color and Fuel_Type.
head(t(t(names(toy.df.dum))),22)
[,1]
[1,] "Id"
[2,] "Model"
[3,] "Price"
[4,] "Age_08_04"
[5,] "Mfg_Month"
[6,] "Mfg_Year"
[7,] "KM"
[8,] "Fuel_Type.CNG"
[9,] "Fuel_Type.Diesel"
[10,] "HP"
[11,] "Met_Color"
[12,] "Color.Black"
[13,] "Color.Blue"
[14,] "Color.Green"
[15,] "Color.Grey"
[16,] "Color.Red"
[17,] "Color.Silver"
[18,] "Color.Violet"
[19,] "Color.White"
[20,] "Color.Yellow"
[21,] "Automatic"
[22,] "CC"
The Factor variables Fuel_Type and Color were removed from the database as a matrix and recombined back into a dataframe, as per the textbook guidance. Another way to do this is to remove Model from the database and apply the dummy.data.frame and removing the extra variable. If dummy.data.frame is used on the raw data, it will also convert Model into multiple columns.
set.seed(0)
## partitioning into training (50%), validation (30%), test (20%)
# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(toy.df.dum), dim(toy.df)[1]*0.5)
# sample 30% of the row IDs into the validation set, drawing only from records
# not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(toy.df.dum), train.rows),
dim(toy.df)[1]*0.3)
# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(toy.df.dum), union(train.rows, valid.rows))
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- toy.df[train.rows, ]
valid.data <- toy.df[valid.rows, ]
test.data <- toy.df[test.rows, ]
The dataset was prepared by splitting the data between 50% for train, 30% for validation, and 20% for test. This method was used to ensure that the same values were not randomly selected in multiple data sets. Additionally, the validation section is used to tune the model to ensure robust model before predicting against the test data.
Problem 3.3
Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in London in January 2008. This is a subset of the full dataset that includes data for the entire year.
- Create a bar chart, showing the average retail price by store. Which store has the highest average? Which has the lowest?
- To better compare retail prices across stores, create side-by-side boxplots of retail price by store. Now compare the prices in the two stores from (a). Does there seem to be a difference between their price distributions?
lap.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/LaptopSalesJanuary2008.csv")
store.avg <- aggregate(lap.df$Retail.Price ~ lap.df$Store.Postcode, data = lap.df, mean)
store.avg
ggplot(store.avg)+geom_col(aes(x=store.avg$`lap.df$Store.Postcode`,y=store.avg$`lap.df$Retail.Price`))+theme(axis.text.x = element_text(angle = 90)) + coord_cartesian(ylim=c(475, 500))

The lowest average is Store.Postcode is 481 dollars at W4 3PH. The highest average is 494 dollars at N1 76QA (although, the three shown in the bar chart are +/- a couple dollars in price difference.
ggplot(lap.df) + geom_boxplot(aes(lap.df$Store.Postcode, lap.df$Retail.Price)) + theme(axis.text.x = element_text(angle = 90))

Comparing N1 760A (highest) with W4 3PH (lowest) the medians are similar, however the lowest - W4 3PH has a larger range of prices and more outliers outside the 1st and 4th quartiles.
Problem 4.1
Breakfast Cereals. Use the data for the breakfast cereals example in Section 4.8 to explore and summarize the data as follows:
- Which variables are quantitative/numerical? Which are ordinal? Which are nominal?
cereal.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/Cereals.csv")
str(cereal.df)
'data.frame': 77 obs. of 16 variables:
$ name : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
$ mfr : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
$ type : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
$ calories: int 70 120 70 50 110 110 110 130 90 90 ...
$ protein : int 4 3 4 4 2 2 2 3 2 3 ...
$ fat : int 1 5 1 0 2 2 0 2 1 0 ...
$ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
$ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
$ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
$ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
$ potass : int 280 135 320 330 NA 70 30 100 125 190 ...
$ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
$ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
$ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
$ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
$ rating : num 68.4 34 59.4 93.7 34.4 ...
Nominal(character/type): Name, mfr, type; Ordinal (ordered): Rating, Shelf Height; Quantitative/Numerical: All the rest of the variables
- Compute the mean, median, min, max, and standard deviation for each of the quantitative variables. This can be done through R’s sapply() function (e.g., sap- ply(data, mean, na.rm = TRUE)).
cereal.df.nofac <- subset(cereal.df, select = -c(name, mfr, type))
data.frame(mean=sapply(cereal.df.nofac, mean, na.rm = TRUE),
sd=sapply(cereal.df.nofac, sd, na.rm = TRUE),
min=sapply(cereal.df.nofac, min, na.rm = TRUE),
max=sapply(cereal.df.nofac, max, na.rm = TRUE),
median=sapply(cereal.df.nofac, median, na.rm = TRUE))
- Use R to plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions:
- Which variables have the largest variability?
- Which variables seem skewed?
- Are there any values that seem extreme?
ggplot(gather(cereal.df.nofac), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')

ggpairs(cereal.df.nofac)

i. The variabiles with the highest variability are sugars, sodium, carbs, and shelf height ii. The variables that seem skewed are fiber, fat, and patass iii. The values that seem extreme are vitamin value at 100, rating value at 100, potassium at 300, and fiber at 15.
- Use R to plot a side-by-side boxplot comparing the calories in hot vs. cold cereals. What does this plot show us?
ggplot(cereal.df) +
geom_boxplot(aes(cereal.df$type, cereal.df$calories))

NA
This shows us that there are only a few observations for Hot Cereal vs Cold Cereal and that Cold cereal has a much more predictable (i.e. less variable) amount of calories.
- Use R to plot a side-by-side boxplot of consumer rating as a function of the shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height?
cereal.df$shelf <- as.factor(cereal.df$shelf)
ggplot(cereal.df) +
geom_boxplot(aes(cereal.df$shelf, cereal.df$rating))

Yes, it does appear that there is a difference between teh shelf heights in terms of consumer ratings. However, it might be okay to reduce the variables to shelf 2 or not shelf 2.
- Compute the correlation table for the quantitative variable(functioncor()).Inaddi- tion, generate a matrix plot for these variables (function plot(data)).
- Which pair of variables is most strongly correlated?
- How can we reduce the number of variables based on these correlations?
- How would the correlations change if we normalized the data first?
cereal.df$shelf <- as.integer(cereal.df$shelf)
cereal_cor <- subset(cereal.df, select = -c(name, mfr, type))
ggcorr(cereal_cor, hjust = 1)

cor <- as.data.frame(cor(cereal_cor))
cor
pm <- ggpairs(cereal_cor)
pm

i. The highest correlated variables are fiber & potassium, rating & sugars, and ratings & calories ii. The values that are highly correlating in the dataset can be removed (one at a time) to see the resulting impact to the model. Only one of the highly correlating variables needs to be included, but it will take some trial/error to understand which variable is the best one to keep. iii. The correlations will change with the normalization of the data becuase the variance of the individual dependent variables will be normalized. Depending on the standard deviation of the variable, the deviation with respect to the mean or medium will adjust the values relative change, and this will alter the respective correlation between the variables.
- Consider the first PC of the analysis of the 13 numerical variables in Table 4.11. Describe briefly what this PC represents.
The purpose of PCA (pricipal component analysis) is to determine the combination of dependent variables that contains and describes most of the data. The idea is to reduce the number of dependent variables to (1) make the model simpler, and (2) prevent overfitting by reducing the multi-collinearity in the model.
---
title: "15.062 Homework 1"
author: "Michael Trevathan"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
  word_document: default
editor_options:
  chunk_output_type: inline
---
###### Due: Wednesday, November 6, 2019

---

#### Libraries
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(caret)
library(ggplot2)
library(dplyr)
library(GGally)
library(corrplot)
library(rpart)
library(gplots)
library(ggpubr)
```

---

#### Problem 2.11


The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications.

1. Explore the data using the data visualization capabilities of R. Which of the pairs among the variables seem to be correlated?

```{r message=FALSE, warning=FALSE}
toy.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/ToyotaCorolla.csv")
toy_cor <- subset(toy.df, select = -c(Model, Fuel_Type, Color))
ggcorr(toy_cor, hjust = 1)
toy_cor_matrix <- data.frame(cor(toy_cor))

toy.plot <- ggplot(toy.df)+geom_point(aes(toy.df$Mfg_Year, toy.df$Price))
toy.plot2 <- ggplot(toy.df)+geom_point(aes(toy.df$Mfg_Year, toy.df$Age_08_04))
toy.plot3 <- ggplot(toy.df)+geom_col(aes(toy.df$Mfg_Year, toy.df$Boardcomputer))
toy.plot4 <- ggplot(toy.df)+geom_point(aes(toy.df$Id, toy.df$Age_08_04))
ggarrange(toy.plot, toy.plot2, toy.plot3, toy.plot4, ncol = 2, nrow = 2)
```

<br><p style="color:blue">The structure of the data is viewed str(toy_cor) and the three variables that were not numerical were Fuel_Type, Model, and Color. The highly correlating attributes (highlighted in dark red or dark blue) were Price & Year, Age & Year, Boardcomputer & Year, and ID & Age. These values were plotted to visualize the suggested correlation.</p><br>

2. We plan to analyze the data using various data mining techniques described in future chapters. Prepare the data for use as follows:
   + The dataset has two categorical attributes, Fuel Type and Metallic. Describe how you would convert these to binary variables. Confirm this using R’s functions to transform categorical data into dummies.
   + Prepare the dataset (as factored into dummies) for data mining techniques of supervised learning by creating partitions in R. Select all the variables and use default values for the random seed and partitioning percentages for training (50%), validation (30%), and test (20%) sets. Describe the roles that these par- titions will play in modeling.

```{r message=FALSE, warning=FALSE}
set.seed(1)
library(dummies)
toy.df$Model <- as.character(toy.df$Model)
toy.df.dum <- dummy.data.frame(toy.df, sep = ".", dummy.class = "factor")
toy.df.dum <- toy.df.dum[, -c(10,13)] # drop one of the dummy variables from Color and Fuel_Type.
head(t(t(names(toy.df.dum))),22)
```
<br><p style="color:blue">The Factor variables Fuel_Type and Color were removed from the database as a matrix and recombined back into a dataframe, as per the textbook guidance. Another way to do this is to remove Model from the database and apply the dummy.data.frame and removing the extra variable. If dummy.data.frame is used on the raw data, it will also convert Model into multiple columns.</p><br>

```{r message=FALSE, warning=FALSE}

set.seed(0)
## partitioning into training (50%), validation (30%), test (20%)
# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(toy.df.dum), dim(toy.df)[1]*0.5)
# sample 30% of the row IDs into the validation set, drawing only from records
# not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(toy.df.dum), train.rows),
              dim(toy.df)[1]*0.3)
# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(toy.df.dum), union(train.rows, valid.rows))
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- toy.df[train.rows, ]
valid.data <- toy.df[valid.rows, ]
test.data <- toy.df[test.rows, ]
```

<br><p style="color:blue">The dataset was prepared by splitting the data between 50% for train, 30% for validation, and 20% for test. This method was used to ensure that the same values were not randomly selected in multiple data sets. Additionally, the validation section is used to tune the model to ensure robust model before predicting against the test data. </p>

---

#### Problem 3.3

Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in London in January 2008. This is a subset of the full dataset that includes data for the entire year.

a.  Create a bar chart, showing the average retail price by store. Which store has the highest average? Which has the lowest?
b.  To better compare retail prices across stores, create side-by-side boxplots of retail price by store. Now compare the prices in the two stores from (a). Does there seem to be a difference between their price distributions?

```{r message=FALSE, warning=FALSE}
lap.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/LaptopSalesJanuary2008.csv")
store.avg <- aggregate(lap.df$Retail.Price ~ lap.df$Store.Postcode, data = lap.df, mean)
store.avg
ggplot(store.avg)+geom_col(aes(x=store.avg$`lap.df$Store.Postcode`,y=store.avg$`lap.df$Retail.Price`))+theme(axis.text.x = element_text(angle = 90)) + coord_cartesian(ylim=c(475, 500))
```
<br><p style="color:blue">The lowest average is Store.Postcode is 481 dollars at W4 3PH. The highest average is 494 dollars at N1 76QA (although, the three shown in the bar chart are +/- a couple dollars in price difference.</p><br>

```{r message=FALSE, warning=FALSE}
ggplot(lap.df) + geom_boxplot(aes(lap.df$Store.Postcode, lap.df$Retail.Price)) + theme(axis.text.x = element_text(angle = 90))

```

<br><p style="color:blue">Comparing N1 760A (highest) with W4 3PH (lowest) the medians are similar, however the lowest - W4 3PH has a larger range of prices and more outliers outside the 1st and 4th quartiles.</P><br>

---

#### Problem 4.1

Breakfast Cereals. Use the data for the breakfast cereals example in Section 4.8 to explore and summarize the data as follows:

a.  Which variables are quantitative/numerical? Which are ordinal? Which are nominal?
```{r message=FALSE, warning=FALSE}
cereal.df <- read.csv("/Users/miketrevathan/OneDrive/Documents/MIT/Data Mining/Boot Camp/R Data Camp/Dataset/Cereals.csv")
str(cereal.df)
```
<br><p style="color:blue">
Nominal(character/type): Name, mfr, type; 
Ordinal (ordered): Rating, Shelf Height; 
Quantitative/Numerical: All the rest of the variables</p><br>

b.  Compute the mean, median, min, max, and standard deviation for each of the quantitative variables. This can be done through R’s sapply() function (e.g., sap- ply(data, mean, na.rm = TRUE)).
```{r message=FALSE, warning=FALSE}
cereal.df.nofac <- subset(cereal.df, select = -c(name, mfr, type))
data.frame(mean=sapply(cereal.df.nofac, mean, na.rm = TRUE), 
           sd=sapply(cereal.df.nofac, sd, na.rm = TRUE),
           min=sapply(cereal.df.nofac, min, na.rm = TRUE), 
           max=sapply(cereal.df.nofac, max, na.rm = TRUE), 
           median=sapply(cereal.df.nofac, median, na.rm = TRUE))
```
<br><font color="blue"></font><br>

c.  Use R to plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions:
    i. Which variables have the largest variability?
    ii. Which variables seem skewed?
    iii. Are there any values that seem extreme?
    
```{r message=FALSE, warning=FALSE}
ggplot(gather(cereal.df.nofac), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

ggpairs(cereal.df.nofac)

```

<br><font color="blue">
   i. The variabiles with the highest variability are sugars, sodium, carbs, and shelf height
   ii. The variables that seem skewed are fiber, fat, and patass
   iii. The values that seem extreme are vitamin value at 100, rating value at 100, potassium at 300, and fiber at 15.
</font><br>

d.  Use R to plot a side-by-side boxplot comparing the calories in hot vs. cold cereals.
What does this plot show us?
```{r}
ggplot(cereal.df) + 
    geom_boxplot(aes(cereal.df$type, cereal.df$calories))
    
```
<br><font color="blue">
This shows us that there are only a few observations for Hot Cereal vs Cold Cereal and that Cold cereal has a much more predictable (i.e. less variable) amount of calories. 
</font><br>

e. Use R to plot a side-by-side boxplot of consumer rating as a function of the shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height?
```{r}
cereal.df$shelf <- as.factor(cereal.df$shelf)
ggplot(cereal.df) + 
    geom_boxplot(aes(cereal.df$shelf, cereal.df$rating))
```

<br><font color="blue">
Yes, it does appear that there is a difference between teh shelf heights in terms of consumer ratings. However, it might be okay to reduce the variables to shelf 2 or not shelf 2.  
</font><br>

f. Compute the correlation table for the quantitative variable(functioncor()).Inaddi- tion, generate a matrix plot for these variables (function plot(data)).
    i. Which pair of variables is most strongly correlated?
    ii. How can we reduce the number of variables based on these correlations?
    iii. How would the correlations change if we normalized the data first?
    
```{r message=FALSE, warning=FALSE}
cereal.df$shelf <- as.integer(cereal.df$shelf)
cereal_cor <- subset(cereal.df, select = -c(name, mfr, type))
ggcorr(cereal_cor, hjust = 1)
cor <- as.data.frame(cor(cereal_cor))
cor
pm <- ggpairs(cereal_cor)
pm

```

<br><font color="blue">
   i. The highest correlated variables are fiber & potassium, rating & sugars, and ratings & calories
   ii. The values that are highly correlating in the dataset can be removed (one at a time) to see the resulting impact to the model. Only one of the highly correlating variables needs to be included, but it will take some trial/error to understand which variable is the best one to keep.
   iii. The correlations will change with the normalization of the data becuase the variance of the individual dependent variables will be normalized. Depending on the standard deviation of the variable, the deviation with respect to the mean or medium will adjust the values relative change, and this will alter the respective correlation between the variables. 
</font><br>

g. Consider the first PC of the analysis of the 13 numerical variables in Table 4.11.
Describe briefly what this PC represents.

<br><font color="blue">
The purpose of PCA (pricipal component analysis) is to determine the combination of dependent variables that contains and describes most of the data. The idea is to reduce the number of dependent variables to (1) make the model simpler, and (2) prevent overfitting by reducing the multi-collinearity in the model. 
</font><br>
