As the quiz that was part of the original content was discarded, here’s a new assignment: Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records). https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large) Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.
df_100_small <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA622/main/100%20Sales%20Records.csv")
df_1000_large <- read.csv("https://raw.githubusercontent.com/johnm1990/DATA622/main/10000%20Sales%20Records.csv")
First we start off by getting a glimpses of our data. Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it. There are various steps involved when doing EDA but the following are the common steps that a data analyst can take when performing EDA:
Import the data Clean the data Process the data Visualize the data
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
glimpse(df_100_small)
## Rows: 100
## Columns: 14
## $ Region <chr> "Australia and Oceania", "Central America and the Carib~
## $ Country <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",~
## $ Item.Type <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of~
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Online", "Offline", "O~
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", ~
## $ Order.Date <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1~
## $ Order.ID <int> 669165933, 963881480, 341417157, 514321792, 115456712, ~
## $ Ship.Date <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/~
## $ Units.Sold <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6~
## $ Unit.Price <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1~
## $ Unit.Cost <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9~
## $ Total.Revenue <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02~
## $ Total.Cost <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,~
## $ Total.Profit <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2~
colnames(df_100_small)
## [1] "Region" "Country" "Item.Type" "Sales.Channel"
## [5] "Order.Priority" "Order.Date" "Order.ID" "Ship.Date"
## [9] "Units.Sold" "Unit.Price" "Unit.Cost" "Total.Revenue"
## [13] "Total.Cost" "Total.Profit"
glimpse(df_1000_large)
## Rows: 10,000
## Columns: 14
## $ Region <chr> "Sub-Saharan Africa", "Europe", "Middle East and North ~
## $ Country <chr> "Chad", "Latvia", "Pakistan", "Democratic Republic of t~
## $ Item.Type <chr> "Office Supplies", "Beverages", "Vegetables", "Househol~
## $ Sales.Channel <chr> "Online", "Online", "Offline", "Online", "Online", "Off~
## $ Order.Priority <chr> "L", "C", "C", "C", "C", "H", "L", "C", "L", "C", "M", ~
## $ Order.Date <chr> "1/27/2011", "12/28/2015", "1/13/2011", "9/11/2012", "1~
## $ Order.ID <int> 292494523, 361825549, 141515767, 500364005, 127481591, ~
## $ Ship.Date <chr> "2/12/2011", "1/23/2016", "2/1/2011", "10/6/2012", "12/~
## $ Units.Sold <int> 4484, 1075, 6515, 7683, 3491, 9880, 4825, 3330, 2431, 6~
## $ Unit.Price <dbl> 651.21, 47.45, 154.06, 668.27, 47.45, 47.45, 154.06, 25~
## $ Unit.Cost <dbl> 524.96, 31.79, 90.93, 502.54, 31.79, 31.79, 90.93, 159.~
## $ Total.Revenue <dbl> 2920025.64, 51008.75, 1003700.90, 5134318.41, 165647.95~
## $ Total.Cost <dbl> 2353920.64, 34174.25, 592408.95, 3861014.82, 110978.89,~
## $ Total.Profit <dbl> 566105.00, 16834.50, 411291.95, 1273303.59, 54669.06, 1~
# Conversions
df_1000_large[['Order ID']] <- toString(df_1000_large[['Order.ID']])
df_1000_large[['Region']] <- as.factor(df_1000_large[['Region']])
df_1000_large[['Sales Channel']] <- as.factor(df_1000_large[['Sales.Channel']])
df_1000_large[['Order Priority']] <- as.factor(df_1000_large[['Order.Priority']])
df_1000_large[['Item Type']] <- as.factor(df_1000_large[['Item.Type']])
df_1000_large[['Order Date']] <- as.Date(df_1000_large[['Order.Date']], "%m/%d/%Y")
df_1000_large[['Ship Date']] <- as.Date(df_1000_large[['Ship.Date']], "%m/%d/%Y")
df_1000_large[['Units Sold']] <- as.numeric(df_1000_large[['Units.Sold']])
df_1000_large[['Unit Price']] <- as.numeric(df_1000_large[['Unit.Price']])
df_1000_large[['Unit Cost']] <- as.numeric(df_1000_large[['Unit.Cost']])
df_1000_large[['Total Revenue']] <- as.numeric(df_1000_large[['Total.Revenue']])
df_1000_large[['Total Profit']] <- as.numeric(df_1000_large[['Total.Profit']])
df_1000_large[['Total Cost']] <- as.numeric(df_1000_large[['Total.Cost']])
df_100_small[['Order ID']] <- toString(df_100_small[['Order.ID']])
df_100_small[['Region']] <- as.factor(df_100_small[['Region']])
df_100_small[['Sales Channel']] <- as.factor(df_100_small[['Sales.Channel']])
df_100_small[['Order Priority']] <- as.factor(df_100_small[['Order.Priority']])
df_100_small[['Item Type']] <- as.factor(df_100_small[['Item.Type']])
df_100_small[['Order Date']] <- as.Date(df_100_small[['Order.Date']], "%m/%d/%Y")
df_100_small[['Ship Date']] <- as.Date(df_100_small[['Ship.Date']], "%m/%d/%Y")
df_100_small[['Units Sold']] <- as.numeric(df_100_small[['Units.Sold']])
df_100_small[['Unit Price']] <- as.numeric(df_100_small[['Unit.Price']])
df_100_small[['Unit Cost']] <- as.numeric(df_100_small[['Unit.Cost']])
df_100_small[['Total Revenue']] <- as.numeric(df_100_small[['Total.Revenue']])
df_100_small[['Total Profit']] <- as.numeric(df_100_small[['Total.Profit']])
df_100_small[['Total Cost']] <- as.numeric(df_100_small[['Total.Cost']])
Next we wish to get a summary of our data
summary(df_100_small)
## Region Country Item.Type
## Asia :11 Length:100 Length:100
## Australia and Oceania :11 Class :character Class :character
## Central America and the Caribbean: 7 Mode :character Mode :character
## Europe :22
## Middle East and North Africa :10
## North America : 3
## Sub-Saharan Africa :36
## Sales.Channel Order.Priority Order.Date Order.ID
## Length:100 Length:100 Length:100 Min. :114606559
## Class :character Class :character Class :character 1st Qu.:338922488
## Mode :character Mode :character Mode :character Median :557708561
## Mean :555020412
## 3rd Qu.:790755081
## Max. :994022214
##
## Ship.Date Units.Sold Unit.Price Unit.Cost
## Length:100 Min. : 124 Min. : 9.33 Min. : 6.92
## Class :character 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84
## Mode :character Median :5382 Median :179.88 Median :107.28
## Mean :5129 Mean :276.76 Mean :191.05
## 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33
## Max. :9925 Max. :668.27 Max. :524.96
##
## Total.Revenue Total.Cost Total.Profit Order ID
## Min. : 4870 Min. : 3612 Min. : 1258 Length:100
## 1st Qu.: 268721 1st Qu.: 168868 1st Qu.: 121444 Class :character
## Median : 752314 Median : 363566 Median : 290768 Mode :character
## Mean :1373488 Mean : 931806 Mean : 441682
## 3rd Qu.:2212045 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :5997055 Max. :4509794 Max. :1719922
##
## Sales Channel Order Priority Item Type Order Date
## Offline:50 C:22 Clothes :13 Min. :2010-02-02
## Online :50 H:30 Cosmetics :13 1st Qu.:2012-02-14
## L:27 Office Supplies:12 Median :2013-07-12
## M:21 Fruits :10 Mean :2013-09-16
## Personal Care :10 3rd Qu.:2015-04-07
## Household : 9 Max. :2017-05-22
## (Other) :33
## Ship Date Units Sold Unit Price Unit Cost
## Min. :2010-02-25 Min. : 124 Min. : 9.33 Min. : 6.92
## 1st Qu.:2012-02-24 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84
## Median :2013-08-11 Median :5382 Median :179.88 Median :107.28
## Mean :2013-10-09 Mean :5129 Mean :276.76 Mean :191.05
## 3rd Qu.:2015-04-28 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33
## Max. :2017-06-17 Max. :9925 Max. :668.27 Max. :524.96
##
## Total Revenue Total Profit Total Cost
## Min. : 4870 Min. : 1258 Min. : 3612
## 1st Qu.: 268721 1st Qu.: 121444 1st Qu.: 168868
## Median : 752314 Median : 290768 Median : 363566
## Mean :1373488 Mean : 441682 Mean : 931806
## 3rd Qu.:2212045 3rd Qu.: 635829 3rd Qu.:1613870
## Max. :5997055 Max. :1719922 Max. :4509794
##
We can see that both our large dataset and small dataset have data ranging from newest 2017 to oldest 2010.
Data visualization is the technique used to deliver insights in data using visual cues such as graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy understanding of the large quantities of data and thereby make better decisions regarding it. Data Visualization in R Programming Language
The various data visualization platforms have different capabilities, functionality, and use cases. They also require a different skill set. This article discusses the use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and scientific research. It is usually preferred for data visualization as it offers flexibility and minimum required coding through its packages.
Visualiztion for large dataset
hist(df_1000_large$`Total Profit`, col = 'green')
Visualiztion for small dataset
hist(df_100_small$`Total Profit`, col = 'green')
Split data into train and test in r, It is critical to partition the data into training and testing sets when using supervised learning algorithms such as Linear Regression, Random Forest, Naïve Bayes classification, Logistic Regression, and Decision Trees etc.
We first train the model using the training dataset’s observations and then use it to predict from the testing dataset.
Splitting helps to avoid overfitting and to improve the training dataset accuracy.
Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. Analysis Services randomly samples the data to help ensure that the testing and training sets are similar. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.
After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model’s guesses are correct.
Finally, we need a model that can perform well on unknown data, therefore we utilize test data to test the trained model’s performance at the end.
set.seed(555)
df_sample <- sample(nrow(df_100_small), round(nrow(df_100_small)*0.75), replace = FALSE)
df_100_small_train <- df_100_small[df_sample, ]
df_100_small_test <- df_100_small[-df_sample, ]
A big part of machine learning is classification — we want to know what class (a.k.a. group) an observation belongs to. The ability to precisely classify observations is extremely valuable for various business applications like predicting whether a particular user will buy a product or forecasting whether a given loan will default or not.
Data science provides a plethora of classification algorithms such as logistic regression, support vector machine, naive Bayes classifier, and decision trees. But near the top of the classifier hierarchy is the random forest classifier (there is also the random forest regressor but that is a topic for another day).
A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms. This algorithm is applied in various industries such as banking and e-commerce to predict behavior and outcomes.
A random forest is a machine learning technique that’s used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.
# Splitting the data 80/20
set.seed(444)
df_100_small.partition <- df_100_small$`Sales Channel` %>%
createDataPartition(p = 0.8, list=FALSE)
df_100_small_train.data <- df_100_small[df_100_small.partition,]
df_100_small_test.data <- df_100_small[-df_100_small.partition,]
colnames(df_100_small)
## [1] "Region" "Country" "Item.Type" "Sales.Channel"
## [5] "Order.Priority" "Order.Date" "Order.ID" "Ship.Date"
## [9] "Units.Sold" "Unit.Price" "Unit.Cost" "Total.Revenue"
## [13] "Total.Cost" "Total.Profit" "Order ID" "Sales Channel"
## [17] "Order Priority" "Item Type" "Order Date" "Ship Date"
## [21] "Units Sold" "Unit Price" "Unit Cost" "Total Revenue"
## [25] "Total Profit" "Total Cost"
df_100_small_random <- randomForest(`Sales Channel` ~ Region+Item.Type+Order.ID+Units.Sold+Unit.Price+Unit.Cost+Total.Revenue+Total.Cost+Total.Profit, data = df_100_small,importance = TRUE, na.omit=T)
df_100_small_random
##
## Call:
## randomForest(formula = `Sales Channel` ~ Region + Item.Type + Order.ID + Units.Sold + Unit.Price + Unit.Cost + Total.Revenue + Total.Cost + Total.Profit, data = df_100_small, importance = TRUE, na.omit = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 52%
## Confusion matrix:
## Offline Online class.error
## Offline 24 26 0.52
## Online 26 24 0.52
varImpPlot(df_100_small_random)
#set.seed(111)
#df_sample <- sample(nrow(df_100_small), round(nrow(df_100_small)*0.75), replace = FALSE)
#small_train <- df_100_small[df_sample, ]
#small_test <- df_100_small[-df_sample, ]
#df_100_small_small_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + #Total.Revenue + Total.Cost + Total.Profit , method = "class", data = small_train)
#rpart.plot(df_100_small_small_model)
some datasets are too large for pc to handle and have commented out for knit purpose
Logistic regression is useful when you are predicting a binary outcome from a set of continuous predictor variables. It is frequently preferred over discriminant function analysis because of its less restrictive. In statistics, binomial regression is a regression analysis technique in which the response (often referred to as Y) has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success. The Binomial Regression model can be used for predicting the odds of seeing an event, given a vector of regression variables. For e.g. one could use the Binomial Regression model to predict the odds of its starting to rain in the next 2 hours, given the current temperature, humidity, barometric pressure, time of year, geo-location, altitude etc.In a Binomial Regression model, the dependent variable y is a discrete random variable that takes on values such as 0, 1, 5, 67 etc. Each value represents the number of ‘successes’ observed in m trials. Thus y follows the binomial distribution.
glm.df.small<-glm(`Sales Channel` ~ Region + Country + Item.Type + Order.Priority +
Units.Sold + Unit.Price +
Unit.Cost + Total.Cost +
Total.Profit + Total.Revenue,data=df_100_small_train, family=binomial)
summary(glm.df.small)
##
## Call:
## glm(formula = `Sales Channel` ~ Region + Country + Item.Type +
## Order.Priority + Units.Sold + Unit.Price + Unit.Cost + Total.Cost +
## Total.Profit + Total.Revenue, family = binomial, data = df_100_small_train)
##
## Deviance Residuals:
## 88 16 93 4 29
## -0.0000028555 0.0000032572 0.0000026763 0.0000024942 -0.0000028555
## 68 32 14 62 49
## -0.0000024086 -0.0000024086 -0.0000022448 0.0000024086 -0.0000024086
## 77 1 25 9 43
## 0.0000024086 -0.0000024086 0.0000024086 -0.0000024086 -0.0000016790
## 52 92 47 55 12
## 0.0000024086 -0.0000023685 0.0000024086 -0.0000033283 -0.0000024086
## 8 50 11 60 59
## 0.0000024086 -0.0000024086 0.0000024086 -0.0000018846 0.0000024086
## 91 51 79 94 30
## -0.0000017493 0.0000024086 -0.0000024086 0.0000024086 -0.0000025631
## 6 80 2 73 24
## 0.0000024086 0.0000024086 0.0000024086 0.0000024086 0.0000024086
## 76 35 90 70 63
## -0.0000024086 0.0000025631 -0.0000024086 -0.0000024086 0.0000029049
## 40 71 48 13 54
## 0.0000011101 0.0000024086 0.0000024086 0.0000024086 -0.0000029412
## 41 46 21 37 64
## 0.0000024086 -0.0000024086 0.0000028555 0.0000024086 -0.0000024086
## 18 82 27 39 7
## -0.0000029049 0.0000024086 0.0000024086 0.0000024086 -0.0000024086
## 28 86 44 99 69
## 0.0000032572 -0.0000035358 0.0000024086 -0.0000000211 -0.0000024086
## 15 3 17 38 36
## -0.0000024086 -0.0000024086 -0.0000024086 0.0000024086 -0.0000024086
## 96 23 42 5 66
## 0.0000011101 0.0000022448 0.0000021118 -0.0000022059 -0.0000026026
## 22 57 87 74 33
## 0.0000024086 -0.0000024086 -0.0000011101 0.0000016790 0.0000023685
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error
## (Intercept) -30.51692511 1959290.89595267
## RegionAustralia and Oceania -60.33031184 975061.88876943
## RegionCentral America and the Caribbean 27.88984319 1114629.11835897
## RegionEurope -41.46305524 1188888.88812300
## RegionMiddle East and North Africa 58.57376119 1089800.60598525
## RegionNorth America -27.38991360 1048454.85054840
## RegionSub-Saharan Africa -26.55271756 975862.36028832
## CountryAngola 9.27147033 706062.37685170
## CountryAustralia 21.56494139 760722.30510812
## CountryAzerbaijan -73.35763525 1288791.29765834
## CountryBangladesh -110.63723081 2196990.07253583
## CountryBelize -148.03320255 2148851.46414269
## CountryBrunei 11.04434847 1161525.20558613
## CountryBulgaria 38.02469164 1038863.58355830
## CountryBurkina Faso 0.83399324 1344975.73527355
## CountryCameroon 28.65481462 868849.40563290
## CountryCape Verde -58.00363263 2005661.76312319
## CountryComoros 15.45117022 748747.33717922
## CountryCosta Rica -55.79724147 789771.38590161
## CountryDemocratic Republic of the Congo 74.94246870 926477.18392942
## CountryDjibouti -28.81424188 1496411.50425276
## CountryFederated States of Micronesia 33.03815475 1025472.04863411
## CountryFiji -162.38290896 2508613.35391304
## CountryGrenada -51.25030535 1092645.77839474
## CountryHonduras NA NA
## CountryIceland -4.85606568 1495687.47190877
## CountryKiribati 79.64155236 1531085.39061856
## CountryKyrgyzstan 97.07027959 1953764.63402352
## CountryLebanon -162.85206941 2538638.24433091
## CountryLesotho -46.52472272 1803735.70492243
## CountryLibya -175.81396779 2310748.30263599
## CountryLithuania 17.58025827 1440096.10496470
## CountryMacedonia -146.63682070 1789145.62335812
## CountryMadagascar -121.76396047 2596411.09526176
## CountryMali 38.43817229 1033527.26553023
## CountryMauritania -73.46211006 1413458.32732949
## CountryMexico NA NA
## CountryMoldova 92.46548447 1700177.90002545
## CountryMonaco -4.84163928 1106540.72526390
## CountryMongolia -30.27033119 1373005.26835453
## CountryNew Zealand 165.52302440 1532240.69418134
## CountryNiger 127.62331630 1255160.51658938
## CountryNorway 49.89661203 1220903.92358574
## CountryPortugal 126.82753160 2144254.48648367
## CountryRepublic of the Congo 2.40595137 951018.64468596
## CountryRomania 38.59376096 1360701.40645861
## CountryRussia -24.65648075 1245390.43778733
## CountryRwanda -23.73615592 1156486.84422851
## CountrySamoa 54.63769366 1438771.58246108
## CountrySan Marino 79.12271283 1686115.78879184
## CountrySao Tome and Principe -43.97916932 1577276.66220425
## CountrySierra Leone -33.75166671 1086573.31526804
## CountrySlovakia 137.80814678 2358295.52030176
## CountrySlovenia 59.20889432 1324752.45245241
## CountrySolomon Islands 85.59623052 1648660.40575036
## CountrySouth Sudan 49.72440828 1278438.21618938
## CountrySri Lanka -67.72259728 1035421.30653760
## CountrySwitzerland 182.04907517 2875363.13650431
## CountrySyria NA NA
## CountryThe Gambia NA NA
## CountryTurkmenistan NA NA
## CountryTuvalu NA NA
## CountryUnited Kingdom 53.94294549 1927788.37471161
## Item.TypeBeverages -110.73772930 1672472.68037832
## Item.TypeCereal 44.68486362 856934.40463009
## Item.TypeClothes 60.47639886 1675636.21305181
## Item.TypeCosmetics 89.04751920 1418742.90620209
## Item.TypeFruits -55.66315396 1871351.66239503
## Item.TypeHousehold 77.88467451 1601556.10094457
## Item.TypeMeat 85.13977309 2394661.56775232
## Item.TypeOffice Supplies 110.13817191 2600892.36402120
## Item.TypePersonal Care -60.13016286 1654756.33681553
## Item.TypeSnacks 25.74133373 1536855.09892741
## Item.TypeVegetables NA NA
## Order.PriorityH -41.90056453 345134.86951404
## Order.PriorityL -31.39411771 538096.03794553
## Order.PriorityM -28.62850865 922422.27843734
## Units.Sold 0.02303059 269.47105130
## Unit.Price NA NA
## Unit.Cost NA NA
## Total.Cost -0.00003703 1.48719734
## Total.Profit -0.00006707 4.66913265
## Total.Revenue NA NA
## z value Pr(>|z|)
## (Intercept) 0 1
## RegionAustralia and Oceania 0 1
## RegionCentral America and the Caribbean 0 1
## RegionEurope 0 1
## RegionMiddle East and North Africa 0 1
## RegionNorth America 0 1
## RegionSub-Saharan Africa 0 1
## CountryAngola 0 1
## CountryAustralia 0 1
## CountryAzerbaijan 0 1
## CountryBangladesh 0 1
## CountryBelize 0 1
## CountryBrunei 0 1
## CountryBulgaria 0 1
## CountryBurkina Faso 0 1
## CountryCameroon 0 1
## CountryCape Verde 0 1
## CountryComoros 0 1
## CountryCosta Rica 0 1
## CountryDemocratic Republic of the Congo 0 1
## CountryDjibouti 0 1
## CountryFederated States of Micronesia 0 1
## CountryFiji 0 1
## CountryGrenada 0 1
## CountryHonduras NA NA
## CountryIceland 0 1
## CountryKiribati 0 1
## CountryKyrgyzstan 0 1
## CountryLebanon 0 1
## CountryLesotho 0 1
## CountryLibya 0 1
## CountryLithuania 0 1
## CountryMacedonia 0 1
## CountryMadagascar 0 1
## CountryMali 0 1
## CountryMauritania 0 1
## CountryMexico NA NA
## CountryMoldova 0 1
## CountryMonaco 0 1
## CountryMongolia 0 1
## CountryNew Zealand 0 1
## CountryNiger 0 1
## CountryNorway 0 1
## CountryPortugal 0 1
## CountryRepublic of the Congo 0 1
## CountryRomania 0 1
## CountryRussia 0 1
## CountryRwanda 0 1
## CountrySamoa 0 1
## CountrySan Marino 0 1
## CountrySao Tome and Principe 0 1
## CountrySierra Leone 0 1
## CountrySlovakia 0 1
## CountrySlovenia 0 1
## CountrySolomon Islands 0 1
## CountrySouth Sudan 0 1
## CountrySri Lanka 0 1
## CountrySwitzerland 0 1
## CountrySyria NA NA
## CountryThe Gambia NA NA
## CountryTurkmenistan NA NA
## CountryTuvalu NA NA
## CountryUnited Kingdom 0 1
## Item.TypeBeverages 0 1
## Item.TypeCereal 0 1
## Item.TypeClothes 0 1
## Item.TypeCosmetics 0 1
## Item.TypeFruits 0 1
## Item.TypeHousehold 0 1
## Item.TypeMeat 0 1
## Item.TypeOffice Supplies 0 1
## Item.TypePersonal Care 0 1
## Item.TypeSnacks 0 1
## Item.TypeVegetables NA NA
## Order.PriorityH 0 1
## Order.PriorityL 0 1
## Order.PriorityM 0 1
## Units.Sold 0 1
## Unit.Price NA NA
## Unit.Cost NA NA
## Total.Cost 0 1
## Total.Profit 0 1
## Total.Revenue NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 103.85204506349423 on 74 degrees of freedom
## Residual deviance: 0.00000000044153 on 2 degrees of freedom
## AIC: 146
##
## Number of Fisher Scoring iterations: 25
LARGE DATASET
set.seed(999)
df_sample <- sample(nrow(df_1000_large), round(nrow(df_1000_large)*0.75), replace = FALSE)
df_1000_large_train <- df_1000_large[df_sample, ]
df_1000_large_test <- df_1000_large[-df_sample, ]
#df_1000_large_model <- rpart(Order.Priority ~ Region + Item.Type + Sales.Channel + Order.Date + Order.ID + Ship.Date + Units.Sold + Total.Revenue + Total.Cost + Total.Profit , method = "class", #data = df_1000_large_train,control=rpart.control(minsplit=2, minbucket=3, cp=0.001))
#rpart.plot(df_1000_large_model)
The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables dont have significant contribution in explaining the outcome. Put in other words, including additional variables in the model will always increase the R2 and reduce the RMSE. So, we need a more robust metric to guide the model choice.
Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.
Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, he better the model.
#sum(diag(df_1000_large_train_large_pred)) / nrow(df_1000_large_train_large_test)
GLMs are useful when the range of your response variable is constrained and/or the variance is not constant or normally distributed. GLM models transform the response variable to allow the fit to be done by least squares. The transformation done on the response variable is defined by the link function. This transformation of the response may constrain the range of the response variable. The variance function specifies the relationship of the variance to the mean. In R, a family specifies the variance and link functions which are used in the model fit. As an example the “poisson” family uses the “log” link function and “μ” as the variance function. A GLM model is defined by both the formula and the family. GLM models can also be used to fit data in which the variance is proportional to one of the defined variance functions. This is done with quasi families, where Pearson’s χ2 (“chi-squared”) is used to scale the variance. An example would be data in which the variance is proportional to the mean. This would use the “quasipoisson” family. This results in a variance function of αμ instead of 1μ as for Poisson distributed data. The quasi families allows inference to be done when your data is overdispersed or underdispersed, provided that the variance is proportional. GLM models have a defined relationship between the expected variance and the mean. This relationship can be used to evaluate the model’s goodness of fit to the data. The deviance can be used for this goodness of fit check. Under asymptotic conditions the deviance is expected to be χ2df distributed. Pearson’s χ2
can also be used for this measure of goodness of fit, though technically it is the deviance which is minimized when fitting a GLM model. There are some limits to the goodness of fit evaluation.When the response data is binary, the deviance approximations are not even approximately correct. The deviance approximations are also not useful when there are small group sizes. The goodness of fit tests using deviance or Pearson’s χ2 are not applicable with a quasi family model. Residual plots are useful for some GLM models and much less useful for others. When residuals are useful in the evaluation a GLM model, the plot of Pearson residuals versus the fitted link values is typically the most helpful. The Pearson residuals are normalized by the variance and are expected to then be constant across the prediction range. Pearson residuals and the fitted link values are obtained by the extractor functions residuals() and predict(), each of which has a type argument that determines what values are returned
Variable selection for a GLM model is similar to the process for an OLS model. Nested model tests for significance of a coefficient are preferred to Wald test of coefficients. This is due to GLM coefficients standard errors being sensitive to even small deviations from the model assumptions. It is also more accurate to obtain p-values for the GLM coefficients from nested model tests.
The likelihood ratio test (LRT) is typically used to test nested models. For quasi family models an F-test is used for nested model tests (or when the fit is overdispersed or underdispersed). This use of the F statistic is appropriate if the group sizes are approximately equal. Which variable to select for a model may depend on the family that is being used in the model. In these cases variable selection is connected with family selection. Variable selection criteria such as AIC and BIC are generally not applicable for selecting between families.