DATA622 Homework 1

As the quiz that was part of the original content was discarded, here’s a new assignment: Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.

File Selection

df_100_small <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA622/main/100%20Sales%20Records.csv")
df_1000_large <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA622/main/10000%20Sales%20Records.csv")

Exploratory Analysis

First we start off by getting a glimpses of our data. Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it. There are various steps involved when doing EDA but the following are the common steps that a data analyst can take when performing EDA:

Import the data Clean the data Process the data Visualize the data

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

glimpse(df_100_small)

## Rows: 100
## Columns: 14
## $ Region         <chr> "Australia and Oceania", "Central America and the Carib~
## $ Country        <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",~
## $ Item.Type      <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of~
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Online", "Offline", "O~
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", ~
## $ Order.Date     <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1~
## $ Order.ID       <int> 669165933, 963881480, 341417157, 514321792, 115456712, ~
## $ Ship.Date      <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/~
## $ Units.Sold     <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6~
## $ Unit.Price     <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1~
## $ Unit.Cost      <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9~
## $ Total.Revenue  <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02~
## $ Total.Cost     <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,~
## $ Total.Profit   <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2~

colnames(df_100_small)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"

glimpse(df_1000_large)

## Rows: 10,000
## Columns: 14
## $ Region         <chr> "Sub-Saharan Africa", "Europe", "Middle East and North ~
## $ Country        <chr> "Chad", "Latvia", "Pakistan", "Democratic Republic of t~
## $ Item.Type      <chr> "Office Supplies", "Beverages", "Vegetables", "Househol~
## $ Sales.Channel  <chr> "Online", "Online", "Offline", "Online", "Online", "Off~
## $ Order.Priority <chr> "L", "C", "C", "C", "C", "H", "L", "C", "L", "C", "M", ~
## $ Order.Date     <chr> "1/27/2011", "12/28/2015", "1/13/2011", "9/11/2012", "1~
## $ Order.ID       <int> 292494523, 361825549, 141515767, 500364005, 127481591, ~
## $ Ship.Date      <chr> "2/12/2011", "1/23/2016", "2/1/2011", "10/6/2012", "12/~
## $ Units.Sold     <int> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6~
## $ Unit.Price     <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 25~
## $ Unit.Cost      <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.~
## $ Total.Revenue  <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95~
## $ Total.Cost     <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89,~
## $ Total.Profit   <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 1~

Cleaning up the data

# Conversions
df_1000_large[['Order ID']] <- toString(df_1000_large[['Order.ID']])
df_1000_large[['Region']] <- as.factor(df_1000_large[['Region']])
df_1000_large[['Sales Channel']] <- as.factor(df_1000_large[['Sales.Channel']])
df_1000_large[['Order Priority']] <- as.factor(df_1000_large[['Order.Priority']])
df_1000_large[['Item Type']] <- as.factor(df_1000_large[['Item.Type']])
df_1000_large[['Order Date']] <- as.Date(df_1000_large[['Order.Date']], "%m/%d/%Y")
df_1000_large[['Ship Date']] <- as.Date(df_1000_large[['Ship.Date']], "%m/%d/%Y")
df_1000_large[['Units Sold']] <- as.numeric(df_1000_large[['Units.Sold']])
df_1000_large[['Unit Price']] <- as.numeric(df_1000_large[['Unit.Price']])
df_1000_large[['Unit Cost']] <- as.numeric(df_1000_large[['Unit.Cost']])
df_1000_large[['Total Revenue']] <- as.numeric(df_1000_large[['Total.Revenue']])
df_1000_large[['Total Profit']] <- as.numeric(df_1000_large[['Total.Profit']])
df_1000_large[['Total Cost']] <- as.numeric(df_1000_large[['Total.Cost']])



df_100_small[['Order ID']] <- toString(df_100_small[['Order.ID']])
df_100_small[['Region']] <- as.factor(df_100_small[['Region']])
df_100_small[['Sales Channel']] <- as.factor(df_100_small[['Sales.Channel']])
df_100_small[['Order Priority']] <- as.factor(df_100_small[['Order.Priority']])
df_100_small[['Item Type']] <- as.factor(df_100_small[['Item.Type']])
df_100_small[['Order Date']] <- as.Date(df_100_small[['Order.Date']], "%m/%d/%Y")
df_100_small[['Ship Date']] <- as.Date(df_100_small[['Ship.Date']], "%m/%d/%Y")
df_100_small[['Units Sold']] <- as.numeric(df_100_small[['Units.Sold']])
df_100_small[['Unit Price']] <- as.numeric(df_100_small[['Unit.Price']])
df_100_small[['Unit Cost']] <- as.numeric(df_100_small[['Unit.Cost']])
df_100_small[['Total Revenue']] <- as.numeric(df_100_small[['Total.Revenue']])
df_100_small[['Total Profit']] <- as.numeric(df_100_small[['Total.Profit']])
df_100_small[['Total Cost']] <- as.numeric(df_100_small[['Total.Cost']])

Next we wish to get a summary of our data

summary(df_100_small)

##                                Region     Country           Item.Type        
##  Asia                             :11   Length:100         Length:100        
##  Australia and Oceania            :11   Class :character   Class :character  
##  Central America and the Caribbean: 7   Mode  :character   Mode  :character  
##  Europe                           :22                                        
##  Middle East and North Africa     :10                                        
##  North America                    : 3                                        
##  Sub-Saharan Africa               :36                                        
##  Sales.Channel      Order.Priority      Order.Date           Order.ID        
##  Length:100         Length:100         Length:100         Min.   :114606559  
##  Class :character   Class :character   Class :character   1st Qu.:338922488  
##  Mode  :character   Mode  :character   Mode  :character   Median :557708561  
##                                                           Mean   :555020412  
##                                                           3rd Qu.:790755081  
##                                                           Max.   :994022214  
##                                                                              
##   Ship.Date           Units.Sold     Unit.Price       Unit.Cost     
##  Length:100         Min.   : 124   Min.   :  9.33   Min.   :  6.92  
##  Class :character   1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84  
##  Mode  :character   Median :5382   Median :179.88   Median :107.28  
##                     Mean   :5129   Mean   :276.76   Mean   :191.05  
##                     3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33  
##                     Max.   :9925   Max.   :668.27   Max.   :524.96  
##                                                                     
##  Total.Revenue       Total.Cost       Total.Profit       Order ID        
##  Min.   :   4870   Min.   :   3612   Min.   :   1258   Length:100        
##  1st Qu.: 268721   1st Qu.: 168868   1st Qu.: 121444   Class :character  
##  Median : 752314   Median : 363566   Median : 290768   Mode  :character  
##  Mean   :1373488   Mean   : 931806   Mean   : 441682                     
##  3rd Qu.:2212045   3rd Qu.:1613870   3rd Qu.: 635829                     
##  Max.   :5997055   Max.   :4509794   Max.   :1719922                     
##                                                                          
##  Sales Channel Order Priority           Item Type    Order Date        
##  Offline:50    C:22           Clothes        :13   Min.   :2010-02-02  
##  Online :50    H:30           Cosmetics      :13   1st Qu.:2012-02-14  
##                L:27           Office Supplies:12   Median :2013-07-12  
##                M:21           Fruits         :10   Mean   :2013-09-16  
##                               Personal Care  :10   3rd Qu.:2015-04-07  
##                               Household      : 9   Max.   :2017-05-22  
##                               (Other)        :33                       
##    Ship Date            Units Sold     Unit Price       Unit Cost     
##  Min.   :2010-02-25   Min.   : 124   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2012-02-24   1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-08-11   Median :5382   Median :179.88   Median :107.28  
##  Mean   :2013-10-09   Mean   :5129   Mean   :276.76   Mean   :191.05  
##  3rd Qu.:2015-04-28   3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33  
##  Max.   :2017-06-17   Max.   :9925   Max.   :668.27   Max.   :524.96  
##                                                                       
##  Total Revenue      Total Profit       Total Cost     
##  Min.   :   4870   Min.   :   1258   Min.   :   3612  
##  1st Qu.: 268721   1st Qu.: 121444   1st Qu.: 168868  
##  Median : 752314   Median : 290768   Median : 363566  
##  Mean   :1373488   Mean   : 441682   Mean   : 931806  
##  3rd Qu.:2212045   3rd Qu.: 635829   3rd Qu.:1613870  
##  Max.   :5997055   Max.   :1719922   Max.   :4509794  
##

We can see that both our large dataset and small dataset have data ranging from newest 2017 to oldest 2010.

Visualize

Data visualization is the technique used to deliver insights in data using visual cues such as graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy understanding of the large quantities of data and thereby make better decisions regarding it. Data Visualization in R Programming Language

The various data visualization platforms have different capabilities, functionality, and use cases. They also require a different skill set. This article discusses the use of R for data visualization.

R is a language that is designed for statistical computing, graphical data analysis, and scientific research. It is usually preferred for data visualization as it offers flexibility and minimum required coding through its packages.

Visualiztion for large dataset

hist(df_1000_large$`Total Profit`, col = 'green')

Visualiztion for small dataset

hist(df_100_small$`Total Profit`, col = 'green')

Algorithm Selections

Split data into train and test in r, It is critical to partition the data into training and testing sets when using supervised learning algorithms such as Linear Regression, Random Forest, Naïve Bayes classification, Logistic Regression, and Decision Trees etc.

We first train the model using the training dataset’s observations and then use it to predict from the testing dataset.

Splitting helps to avoid overfitting and to improve the training dataset accuracy.

Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. Analysis Services randomly samples the data to help ensure that the testing and training sets are similar. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.

After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model’s guesses are correct.

Finally, we need a model that can perform well on unknown data, therefore we utilize test data to test the trained model’s performance at the end.

set.seed(555)


df_sample <- sample(nrow(df_100_small), round(nrow(df_100_small)*0.75), replace = FALSE)
df_100_small_train <- df_100_small[df_sample, ]
df_100_small_test <- df_100_small[-df_sample, ]

RANDOM FOREST

A big part of machine learning is classification — we want to know what class (a.k.a. group) an observation belongs to. The ability to precisely classify observations is extremely valuable for various business applications like predicting whether a particular user will buy a product or forecasting whether a given loan will default or not.

Data science provides a plethora of classification algorithms such as logistic regression, support vector machine, naive Bayes classifier, and decision trees. But near the top of the classifier hierarchy is the random forest classifier (there is also the random forest regressor but that is a topic for another day).

A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms. This algorithm is applied in various industries such as banking and e-commerce to predict behavior and outcomes.

A random forest is a machine learning technique that’s used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.

A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.

The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.

# Splitting the data 80/20
set.seed(444)

df_100_small.partition <- df_100_small$`Sales Channel` %>% 
  createDataPartition(p = 0.8, list=FALSE)

df_100_small_train.data <- df_100_small[df_100_small.partition,]
df_100_small_test.data <- df_100_small[-df_100_small.partition,]


colnames(df_100_small)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"   "Order ID"       "Sales Channel" 
## [17] "Order Priority" "Item Type"      "Order Date"     "Ship Date"     
## [21] "Units Sold"     "Unit Price"     "Unit Cost"      "Total Revenue" 
## [25] "Total Profit"   "Total Cost"

df_100_small_random <- randomForest(`Sales Channel` ~ Region+Item.Type+Order.ID+Units.Sold+Unit.Price+Unit.Cost+Total.Revenue+Total.Cost+Total.Profit, data = df_100_small,importance = TRUE, na.omit=T)

df_100_small_random

## 
## Call:
##  randomForest(formula = `Sales Channel` ~ Region + Item.Type +      Order.ID + Units.Sold + Unit.Price + Unit.Cost + Total.Revenue +      Total.Cost + Total.Profit, data = df_100_small, importance = TRUE,      na.omit = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 52%
## Confusion matrix:
##         Offline Online class.error
## Offline      24     26        0.52
## Online       26     24        0.52

varImpPlot(df_100_small_random)

#set.seed(111)

#df_sample <- sample(nrow(df_100_small), round(nrow(df_100_small)*0.75), replace = FALSE)
#small_train <- df_100_small[df_sample, ]
#small_test <- df_100_small[-df_sample, ]

#df_100_small_small_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + #Total.Revenue + Total.Cost + Total.Profit , method = "class", data = small_train)

#rpart.plot(df_100_small_small_model)

GLM

some datasets are too large for pc to handle and have commented out for knit purpose

Logistic regression is useful when you are predicting a binary outcome from a set of continuous predictor variables. It is frequently preferred over discriminant function analysis because of its less restrictive. In statistics, binomial regression is a regression analysis technique in which the response (often referred to as Y) has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success. The Binomial Regression model can be used for predicting the odds of seeing an event, given a vector of regression variables. For e.g. one could use the Binomial Regression model to predict the odds of its starting to rain in the next 2 hours, given the current temperature, humidity, barometric pressure, time of year, geo-location, altitude etc.In a Binomial Regression model, the dependent variable y is a discrete random variable that takes on values such as 0, 1, 5, 67 etc. Each value represents the number of ‘successes’ observed in m trials. Thus y follows the binomial distribution.

glm.df.small<-glm(`Sales Channel` ~ Region + Country + Item.Type + Order.Priority +
        Units.Sold + Unit.Price +
          Unit.Cost + Total.Cost +
          Total.Profit + Total.Revenue,data=df_100_small_train, family=binomial)

summary(glm.df.small)

## 
## Call:
## glm(formula = `Sales Channel` ~ Region + Country + Item.Type + 
##     Order.Priority + Units.Sold + Unit.Price + Unit.Cost + Total.Cost + 
##     Total.Profit + Total.Revenue, family = binomial, data = df_100_small_train)
## 
## Deviance Residuals: 
##            88             16             93              4             29  
## -0.0000028555   0.0000032572   0.0000026763   0.0000024942  -0.0000028555  
##            68             32             14             62             49  
## -0.0000024086  -0.0000024086  -0.0000022448   0.0000024086  -0.0000024086  
##            77              1             25              9             43  
##  0.0000024086  -0.0000024086   0.0000024086  -0.0000024086  -0.0000016790  
##            52             92             47             55             12  
##  0.0000024086  -0.0000023685   0.0000024086  -0.0000033283  -0.0000024086  
##             8             50             11             60             59  
##  0.0000024086  -0.0000024086   0.0000024086  -0.0000018846   0.0000024086  
##            91             51             79             94             30  
## -0.0000017493   0.0000024086  -0.0000024086   0.0000024086  -0.0000025631  
##             6             80              2             73             24  
##  0.0000024086   0.0000024086   0.0000024086   0.0000024086   0.0000024086  
##            76             35             90             70             63  
## -0.0000024086   0.0000025631  -0.0000024086  -0.0000024086   0.0000029049  
##            40             71             48             13             54  
##  0.0000011101   0.0000024086   0.0000024086   0.0000024086  -0.0000029412  
##            41             46             21             37             64  
##  0.0000024086  -0.0000024086   0.0000028555   0.0000024086  -0.0000024086  
##            18             82             27             39              7  
## -0.0000029049   0.0000024086   0.0000024086   0.0000024086  -0.0000024086  
##            28             86             44             99             69  
##  0.0000032572  -0.0000035358   0.0000024086  -0.0000000211  -0.0000024086  
##            15              3             17             38             36  
## -0.0000024086  -0.0000024086  -0.0000024086   0.0000024086  -0.0000024086  
##            96             23             42              5             66  
##  0.0000011101   0.0000022448   0.0000021118  -0.0000022059  -0.0000026026  
##            22             57             87             74             33  
##  0.0000024086  -0.0000024086  -0.0000011101   0.0000016790   0.0000023685  
## 
## Coefficients: (10 not defined because of singularities)
##                                                 Estimate       Std. Error
## (Intercept)                                 -30.51692511 1959290.89595267
## RegionAustralia and Oceania                 -60.33031184  975061.88876943
## RegionCentral America and the Caribbean      27.88984319 1114629.11835897
## RegionEurope                                -41.46305524 1188888.88812300
## RegionMiddle East and North Africa           58.57376119 1089800.60598525
## RegionNorth America                         -27.38991360 1048454.85054840
## RegionSub-Saharan Africa                    -26.55271756  975862.36028832
## CountryAngola                                 9.27147033  706062.37685170
## CountryAustralia                             21.56494139  760722.30510812
## CountryAzerbaijan                           -73.35763525 1288791.29765834
## CountryBangladesh                          -110.63723081 2196990.07253583
## CountryBelize                              -148.03320255 2148851.46414269
## CountryBrunei                                11.04434847 1161525.20558613
## CountryBulgaria                              38.02469164 1038863.58355830
## CountryBurkina Faso                           0.83399324 1344975.73527355
## CountryCameroon                              28.65481462  868849.40563290
## CountryCape Verde                           -58.00363263 2005661.76312319
## CountryComoros                               15.45117022  748747.33717922
## CountryCosta Rica                           -55.79724147  789771.38590161
## CountryDemocratic Republic of the Congo      74.94246870  926477.18392942
## CountryDjibouti                             -28.81424188 1496411.50425276
## CountryFederated States of Micronesia        33.03815475 1025472.04863411
## CountryFiji                                -162.38290896 2508613.35391304
## CountryGrenada                              -51.25030535 1092645.77839474
## CountryHonduras                                       NA               NA
## CountryIceland                               -4.85606568 1495687.47190877
## CountryKiribati                              79.64155236 1531085.39061856
## CountryKyrgyzstan                            97.07027959 1953764.63402352
## CountryLebanon                             -162.85206941 2538638.24433091
## CountryLesotho                              -46.52472272 1803735.70492243
## CountryLibya                               -175.81396779 2310748.30263599
## CountryLithuania                             17.58025827 1440096.10496470
## CountryMacedonia                           -146.63682070 1789145.62335812
## CountryMadagascar                          -121.76396047 2596411.09526176
## CountryMali                                  38.43817229 1033527.26553023
## CountryMauritania                           -73.46211006 1413458.32732949
## CountryMexico                                         NA               NA
## CountryMoldova                               92.46548447 1700177.90002545
## CountryMonaco                                -4.84163928 1106540.72526390
## CountryMongolia                             -30.27033119 1373005.26835453
## CountryNew Zealand                          165.52302440 1532240.69418134
## CountryNiger                                127.62331630 1255160.51658938
## CountryNorway                                49.89661203 1220903.92358574
## CountryPortugal                             126.82753160 2144254.48648367
## CountryRepublic of the Congo                  2.40595137  951018.64468596
## CountryRomania                               38.59376096 1360701.40645861
## CountryRussia                               -24.65648075 1245390.43778733
## CountryRwanda                               -23.73615592 1156486.84422851
## CountrySamoa                                 54.63769366 1438771.58246108
## CountrySan Marino                            79.12271283 1686115.78879184
## CountrySao Tome and Principe                -43.97916932 1577276.66220425
## CountrySierra Leone                         -33.75166671 1086573.31526804
## CountrySlovakia                             137.80814678 2358295.52030176
## CountrySlovenia                              59.20889432 1324752.45245241
## CountrySolomon Islands                       85.59623052 1648660.40575036
## CountrySouth Sudan                           49.72440828 1278438.21618938
## CountrySri Lanka                            -67.72259728 1035421.30653760
## CountrySwitzerland                          182.04907517 2875363.13650431
## CountrySyria                                          NA               NA
## CountryThe Gambia                                     NA               NA
## CountryTurkmenistan                                   NA               NA
## CountryTuvalu                                         NA               NA
## CountryUnited Kingdom                        53.94294549 1927788.37471161
## Item.TypeBeverages                         -110.73772930 1672472.68037832
## Item.TypeCereal                              44.68486362  856934.40463009
## Item.TypeClothes                             60.47639886 1675636.21305181
## Item.TypeCosmetics                           89.04751920 1418742.90620209
## Item.TypeFruits                             -55.66315396 1871351.66239503
## Item.TypeHousehold                           77.88467451 1601556.10094457
## Item.TypeMeat                                85.13977309 2394661.56775232
## Item.TypeOffice Supplies                    110.13817191 2600892.36402120
## Item.TypePersonal Care                      -60.13016286 1654756.33681553
## Item.TypeSnacks                              25.74133373 1536855.09892741
## Item.TypeVegetables                                   NA               NA
## Order.PriorityH                             -41.90056453  345134.86951404
## Order.PriorityL                             -31.39411771  538096.03794553
## Order.PriorityM                             -28.62850865  922422.27843734
## Units.Sold                                    0.02303059     269.47105130
## Unit.Price                                            NA               NA
## Unit.Cost                                             NA               NA
## Total.Cost                                   -0.00003703       1.48719734
## Total.Profit                                 -0.00006707       4.66913265
## Total.Revenue                                         NA               NA
##                                         z value Pr(>|z|)
## (Intercept)                                   0        1
## RegionAustralia and Oceania                   0        1
## RegionCentral America and the Caribbean       0        1
## RegionEurope                                  0        1
## RegionMiddle East and North Africa            0        1
## RegionNorth America                           0        1
## RegionSub-Saharan Africa                      0        1
## CountryAngola                                 0        1
## CountryAustralia                              0        1
## CountryAzerbaijan                             0        1
## CountryBangladesh                             0        1
## CountryBelize                                 0        1
## CountryBrunei                                 0        1
## CountryBulgaria                               0        1
## CountryBurkina Faso                           0        1
## CountryCameroon                               0        1
## CountryCape Verde                             0        1
## CountryComoros                                0        1
## CountryCosta Rica                             0        1
## CountryDemocratic Republic of the Congo       0        1
## CountryDjibouti                               0        1
## CountryFederated States of Micronesia         0        1
## CountryFiji                                   0        1
## CountryGrenada                                0        1
## CountryHonduras                              NA       NA
## CountryIceland                                0        1
## CountryKiribati                               0        1
## CountryKyrgyzstan                             0        1
## CountryLebanon                                0        1
## CountryLesotho                                0        1
## CountryLibya                                  0        1
## CountryLithuania                              0        1
## CountryMacedonia                              0        1
## CountryMadagascar                             0        1
## CountryMali                                   0        1
## CountryMauritania                             0        1
## CountryMexico                                NA       NA
## CountryMoldova                                0        1
## CountryMonaco                                 0        1
## CountryMongolia                               0        1
## CountryNew Zealand                            0        1
## CountryNiger                                  0        1
## CountryNorway                                 0        1
## CountryPortugal                               0        1
## CountryRepublic of the Congo                  0        1
## CountryRomania                                0        1
## CountryRussia                                 0        1
## CountryRwanda                                 0        1
## CountrySamoa                                  0        1
## CountrySan Marino                             0        1
## CountrySao Tome and Principe                  0        1
## CountrySierra Leone                           0        1
## CountrySlovakia                               0        1
## CountrySlovenia                               0        1
## CountrySolomon Islands                        0        1
## CountrySouth Sudan                            0        1
## CountrySri Lanka                              0        1
## CountrySwitzerland                            0        1
## CountrySyria                                 NA       NA
## CountryThe Gambia                            NA       NA
## CountryTurkmenistan                          NA       NA
## CountryTuvalu                                NA       NA
## CountryUnited Kingdom                         0        1
## Item.TypeBeverages                            0        1
## Item.TypeCereal                               0        1
## Item.TypeClothes                              0        1
## Item.TypeCosmetics                            0        1
## Item.TypeFruits                               0        1
## Item.TypeHousehold                            0        1
## Item.TypeMeat                                 0        1
## Item.TypeOffice Supplies                      0        1
## Item.TypePersonal Care                        0        1
## Item.TypeSnacks                               0        1
## Item.TypeVegetables                          NA       NA
## Order.PriorityH                               0        1
## Order.PriorityL                               0        1
## Order.PriorityM                               0        1
## Units.Sold                                    0        1
## Unit.Price                                   NA       NA
## Unit.Cost                                    NA       NA
## Total.Cost                                    0        1
## Total.Profit                                  0        1
## Total.Revenue                                NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 103.85204506349423  on 74  degrees of freedom
## Residual deviance:   0.00000000044153  on  2  degrees of freedom
## AIC: 146
## 
## Number of Fisher Scoring iterations: 25

LARGE DATASET

set.seed(999)


df_sample <- sample(nrow(df_1000_large), round(nrow(df_1000_large)*0.75), replace = FALSE)
df_1000_large_train <- df_1000_large[df_sample, ]
df_1000_large_test <- df_1000_large[-df_sample, ]

#df_1000_large_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", #data = df_1000_large_train,control=rpart.control(minsplit=2, minbucket=3, cp=0.001))

#rpart.plot(df_1000_large_model)

accuracy of model

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.

#sum(diag(df_1000_large_train_large_pred)) / nrow(df_1000_large_train_large_test)

Essay summarizing

GLMs are useful when the range of your response variable is constrained and/or the variance is not constant or normally distributed. GLM models transform the response variable to allow the fit to be done by least squares. The transformation done on the response variable is defined by the link function. This transformation of the response may constrain the range of the response variable. The variance function specifies the relationship of the variance to the mean. In R, a family specifies the variance and link functions which are used in the model fit. As an example the “poisson” family uses the “log” link function and “μ” as the variance function. A GLM model is defined by both the formula and the family. GLM models can also be used to fit data in which the variance is proportional to one of the defined variance functions. This is done with quasi families, where Pearson’s χ2 (“chi-squared”) is used to scale the variance. An example would be data in which the variance is proportional to the mean. This would use the “quasipoisson” family. This results in a variance function of αμ instead of 1μ as for Poisson distributed data. The quasi families allows inference to be done when your data is overdispersed or underdispersed, provided that the variance is proportional. GLM models have a defined relationship between the expected variance and the mean. This relationship can be used to evaluate the model’s goodness of fit to the data. The deviance can be used for this goodness of fit check. Under asymptotic conditions the deviance is expected to be χ2df distributed. Pearson’s χ2

can also be used for this measure of goodness of fit, though technically it is the deviance which is minimized when fitting a GLM model. There are some limits to the goodness of fit evaluation.When the response data is binary, the deviance approximations are not even approximately correct. The deviance approximations are also not useful when there are small group sizes. The goodness of fit tests using deviance or Pearson’s χ2 are not applicable with a quasi family model. Residual plots are useful for some GLM models and much less useful for others. When residuals are useful in the evaluation a GLM model, the plot of Pearson residuals versus the fitted link values is typically the most helpful. The Pearson residuals are normalized by the variance and are expected to then be constant across the prediction range. Pearson residuals and the fitted link values are obtained by the extractor functions residuals() and predict(), each of which has a type argument that determines what values are returned

Variable selection for a GLM model is similar to the process for an OLS model. Nested model tests for significance of a coefficient are preferred to Wald test of coefficients. This is due to GLM coefficients standard errors being sensitive to even small deviations from the model assumptions. It is also more accurate to obtain p-values for the GLM coefficients from nested model tests.

The likelihood ratio test (LRT) is typically used to test nested models. For quasi family models an F-test is used for nested model tests (or when the fit is overdispersed or underdispersed). This use of the F statistic is appropriate if the group sizes are approximately equal. Which variable to select for a model may depend on the family that is being used in the model. In these cases variable selection is connected with family selection. Variable selection criteria such as AIC and BIC are generally not applicable for selecting between families.

DATA622 Homework 1

John Mazon

3/16/2022

DATA622 Homework 1

File Selection

Exploratory Analysis

Cleaning up the data

Visualize

Algorithm Selections

RANDOM FOREST

GLM

accuracy of model

Essay summarizing