library("dplyr")
library("tidyr")
library("ggplot2")
library("ROCR")
library("rpart")
library("rpart.plot")
library("caret")
library("randomForest")
library("tidyverse")
library("tm")
library("SnowballC")
library("softImpute")
library("glmnet")
library("Hmisc")
library("dummies")
library('tinytex')
library('GGally')
library('ggpubr')
library('gplots')

2.11) The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. a) Explore the data using the data visualization capabilities of R.Which of the pairs among the variables seem to be correlated?

rm(list=ls())
ToyotaCorolla=read.csv("/Users/kayhanbabakan/OneDrive/MIT/Data Mining/Data_export/ToyotaCorolla.csv")
drops = c("Id","Model","Fuel_Type","Color","Cylinders")
TCData = ToyotaCorolla[,!colnames(ToyotaCorolla) %in% drops]
cormatrix = suppressWarnings(cor(TCData))
TCData$Boardcomputer = as.factor(TCData$Boardcomputer)
TCData$Powered_Windows = as.factor(TCData$Powered_Windows)
TCData$Central_Lock = as.factor(TCData$Central_Lock)
TCData$Central_Lock = as.factor(TCData$Radio)
TCData$Central_Lock = as.factor(TCData$Radio_cassette)

#heatmap
heatmap(cormatrix, Rowv = NA, Colv = NA)

data.frame(cormatrix)

#price/age/mfg_year
prcageyr=ggplot(ToyotaCorolla,aes(ToyotaCorolla$Age_08_04,Price,colour=Mfg_Year))+
  geom_point()
#mfg_year/bpardcomp
mfgbcmp=ggplot(ToyotaCorolla,aes(x=TCData$Mfg_Year,fill=Boardcomputer==1))+
  geom_histogram(binwidth=1,position='fill')
#powerwindow/centrallock
powcent=ggplot(ToyotaCorolla,aes(x=TCData$Powered_Windows,y=TCData$Central_Lock))+
  geom_col()
#radio/cassette
radcas=ggplot(ToyotaCorolla,aes(x=TCData$Radio,y=TCData$Radio_cassette))+
  geom_col()
ggarrange(prcageyr,mfgbcmp,powcent,radcas,nrow=2,ncol=2)

  1. We plan to analyze the data using various data mining techniques described in future chapters. Prepare the data for use as follows:
    1. The dataset has two categorical attributes, Fuel Type and Color. Describe how you would convert these to binary variables. Confirm this using R’s functions to transform categorical data into dummies.
    2. Prepare the dataset (as factored into dummies) for data mining techniques of supervised learning by creating partitions in R. Select all the variables and use default values for the random seed and partitioning percentages for training (50%), validation (30%), and test (20%) sets. Describe the roles that these partitions will play in modeling.
ToyotaCorolla$Model = as.character(ToyotaCorolla$Model)
ToyotaCorollaDummy=dummy.data.frame(ToyotaCorolla,dummy.class="factor")
non-list contrasts argument ignorednon-list contrasts argument ignored
ToyotaCorollaDummy=ToyotaCorollaDummy[,!colnames(ToyotaCorollaDummy) %in% "Fuel_TypeCNG"]
ToyotaCorollaDummy=ToyotaCorollaDummy[,!colnames(ToyotaCorollaDummy) %in% "ColorBeige"]
as.array(names(ToyotaCorollaDummy))
 [1] "Id"                "Model"             "Price"            
 [4] "Age_08_04"         "Mfg_Month"         "Mfg_Year"         
 [7] "KM"                "Fuel_TypeDiesel"   "Fuel_TypePetrol"  
[10] "HP"                "Met_Color"         "ColorBlack"       
[13] "ColorBlue"         "ColorGreen"        "ColorGrey"        
[16] "ColorRed"          "ColorSilver"       "ColorViolet"      
[19] "ColorWhite"        "ColorYellow"       "Automatic"        
[22] "CC"                "Doors"             "Cylinders"        
[25] "Gears"             "Quarterly_Tax"     "Weight"           
[28] "Mfr_Guarantee"     "BOVAG_Guarantee"   "Guarantee_Period" 
[31] "ABS"               "Airbag_1"          "Airbag_2"         
[34] "Airco"             "Automatic_airco"   "Boardcomputer"    
[37] "CD_Player"         "Central_Lock"      "Powered_Windows"  
[40] "Power_Steering"    "Radio"             "Mistlamps"        
[43] "Sport_Model"       "Backseat_Divider"  "Metallic_Rim"     
[46] "Radio_cassette"    "Parking_Assistant" "Tow_Bar"          

Analysis
a) I would convert hem into factors using as.factor function, create dummy varialbes using the dummy.data.frame function and remove one of the columns from my dummy variables as shown above.
b) we partion the data due to the need for building the model, testing it on a validation set (to determine the strength of the model) and using our testing data set as our preidction set.

set.seed(1)
train.rows <- sample(rownames(ToyotaCorolla), dim(ToyotaCorolla)[1]*0.5)
set.seed(1)
valid.rows <- sample(setdiff(rownames(ToyotaCorolla), train.rows),dim(ToyotaCorolla)[1]*0.3)
set.seed(1)
test.rows <- setdiff(rownames(ToyotaCorolla), union(train.rows, valid.rows))

train.data <- ToyotaCorolla[train.rows, ]
valid.data <- ToyotaCorolla[valid.rows, ]
test.data <- ToyotaCorolla[test.rows, ]

3.3) Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in London in January 2008. This is a subset of the full dataset that includes data for the entire year. a. Create a bar chart, showing the average retail price by store. Which store has the highest average? Which has the lowest?


LaptopSales = read.csv("/Users/kayhanbabakan/OneDrive/MIT/Data Mining/Data_export/LaptopSalesJanuary2008.csv")
str(LaptopSales)
'data.frame':   7956 obs. of  17 variables:
 $ Date                  : Factor w/ 7303 levels "1/1/2008 0:01",..: 1 2 3 3 4 5 6 7 8 9 ...
 $ Configuration         : int  163 320 23 169 365 309 75 346 70 351 ...
 $ Customer.Postcode     : Factor w/ 834 levels "BR3 1AG","BR3 3LA",..: 230 563 208 533 229 613 464 320 733 580 ...
 $ Store.Postcode        : Factor w/ 16 levels "CR7 8LE","E2 0RY",..: 9 11 2 9 14 14 10 2 12 14 ...
 $ Retail.Price          : int  455 545 515 395 585 555 465 450 455 620 ...
 $ Screen.Size..Inches.  : int  15 15 15 15 15 15 15 15 15 15 ...
 $ Battery.Life..Hours.  : int  5 6 4 5 6 6 4 6 4 6 ...
 $ RAM..GB.              : int  1 1 1 1 2 1 2 2 2 2 ...
 $ Processor.Speeds..GHz.: num  2 2 2 2 2 2 2 1.5 2 1.5 ...
 $ Integrated.Wireless.  : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 2 1 ...
 $ HD.Size..GB.          : int  80 300 300 40 120 120 80 40 120 300 ...
 $ Bundled.Applications. : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 1 1 2 ...
 $ OS.X.Customer         : int  532041 529240 533095 529902 531684 529207 534575 530461 520898 530298 ...
 $ OS.Y.Customer         : int  180995 175537 181047 179641 180948 180969 168236 186176 180071 177435 ...
 $ OS.X.Store            : int  534057 528739 535652 534057 528924 528924 537175 535652 525155 528924 ...
 $ OS.Y.Store            : int  179682 173080 182961 179682 178440 178440 177885 182961 175180 178440 ...
 $ CustomerStoreDistance : num  2406 2508 3194 4155 3729 ...
ggplot(LaptopSales)+
  geom_bar(aes(Store.Postcode,Retail.Price),stat="summary",fun.y="mean")+
  theme(axis.text.x = element_text(angle = 90))


agg.data=aggregate(data = LaptopSales,Retail.Price~Store.Postcode,mean)
agg.data[order(agg.data$Retail.Price),]
Analysis
  • lowest average sales store in Post Code: W4 3PH
  • highest average sales store Post Code store: N17 6QA

  1. To better compare retail prices across stores, create side-by-side boxplots of retail price by store. Now compare the prices in the two stores from (a). Does there seem to be a difference between their price distributions?
x=subset(LaptopSales,Store.Postcode %in% c("W4 3PH","N17 6QA"))
ggplot(LaptopSales)+ 
  geom_boxplot(aes(LaptopSales$Store.Postcode,LaptopSales$Retail.Price))+
  theme(axis.text.x = element_text(angle = 90))


ggplot(x)+ 
  geom_boxplot(aes(x$Store.Postcode,x$Retail.Price))+
  theme(axis.text.x = element_text(angle = 90))

Analysis
the median sale, 3rd and 1st quartiles of the higher sales store (left) are stronger. Additionally, store W43PH has a few additional outliers which need further investigation.


4.1 Breakfast Cereals. Use the data for the breakfast cereals example in Section 4.8 to explore and summarize the data as follows: a. Which variables are quantitative/numerical? Which are ordinal? Which are nominal?

breakfast = read.csv("/Users/kayhanbabakan/OneDrive/MIT/Data Mining/Data_export/Cereals.csv")
str(breakfast)
'data.frame':   77 obs. of  16 variables:
 $ name    : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ mfr     : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
 $ type    : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
 $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
 $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
 $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
 $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
 $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
 $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
 $ potass  : int  280 135 320 330 NA 70 30 100 125 190 ...
 $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
 $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
 $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
 $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
 $ rating  : num  68.4 34 59.4 93.7 34.4 ...

Ordinal Variables: Rating, Shelf
Nominal Variables: Manufacturer, Type, Name
Quantitative/Numerical: Calories, Protein, Fat, Sodium, Fiber, Carbo, Sugars, Potass, Vitamins, Weight, Cups


  1. Compute the mean, median, min, max, and standard deviation for each of the quantitative variables. This can be done through R’s sapply() function (e.g., sap- ply(data, mean, na.rm = TRUE))
cereal=breakfast[,c("calories","protein","fat","sodium","fiber","carbo","sugars","potass","vitamins","weight","cups")]
data.frame(mean=sapply(cereal, mean,na.rm=TRUE),
           median=sapply(cereal, median,na.rm=TRUE),
           min=sapply(cereal, min,na.rm=TRUE),
           max=sapply(cereal, max,na.rm=TRUE),
           standardev=sapply(cereal,sd,na.rm=TRUE))
  1. Use R to plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions:
cereal %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value),binwidth=5) +                     
    facet_wrap(~ key, scales = "free") +   
    geom_density()

  1. Which variables have the largest variability?
    Variabiles with the highest variability are: Sugars,Shelf, and Sodium
  2. Which variables seem skewed?
    Variables that are Skewed: Fat, Fiber, Potass, Vitamins
  3. Are there any values that seem extreme?
    Extreme values are vitamins at 100 g, Sugars at 300 g, protein at 6 g, fiber at 15g

  1. Use R to plot a side-by-side boxplot comparing the calories in hot vs. cold cereals.What does this plot show us?
hotvcold=subset(breakfast,breakfast$type %in% c("C","H"))
ggplot(hotvcold)+ 
  geom_boxplot(aes(hotvcold$type,hotvcold$calories))

Tells us that there are few H observations (hot cereals) and they are all of the same calorie content. of the cold ceral the median value is at the 3rd quartile due to a significant amount of cereals being in the 110 calorie range the data is highly skewed left.

  1. Use R to plot a side-by-side boxplot of consumer rating as a function of the shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height?
breakfast$shelf=as.factor(breakfast$shelf)
ggplot(breakfast)+ 
  geom_boxplot(aes(breakfast$shelf,breakfast$rating))


due to the similarites in the distribution of shelves 1/3, I believe we could group both of them together into their own category.

  1. Compute the correlation table for the quantitative variable (function cor()). Inaddition, generate a matrix plot for these variables (function plot(data)).
zz=data.frame(cor(cereal, use = "complete.obs"))
zz
plot(zz)

  1. Which pair of variables is most strongly correlated? Potassium and Fiber have the strongest correlation
ggplot(cereal,aes(potass,fiber))+
  geom_point()

  1. How can we reduce the number of variables based on these correlations?
    We can perform principal component analysis to reduce the # of dimensions through understanding how much of the variance is explaiend by Potassium to Fiber

  2. How would the correlations change if we normalized the data first?
    Normalizing the data first will greatly impact the PCA as the scale of the variables are not the same. Only when normalizing the data and bringing everything to a level playing field with the PCA work correctly.

  1. Consider the first PC of the analysis of the 13 numerical variables in Table 4.11. Describe briefly what this PC represents.
    The PC1 is dominated by the samount of sodium and is measuring the total amount of sodium in the given cereal.
