Data 622 - Machine Learning and Big Data - HW #1 - Exploratory Analysis and Essay

# Upload the libraries needed.
library(rpart.plot)
library(rpart)
library(RColorBrewer)
library(corrplot)
library(GGally)
library(scales)
library(tidyr)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tabulate)
library(knitr)
library(kableExtra)
library(summarytools)
library(stats)
library(class)

# Import the data from github. 

urlfile1 <- "https://raw.githubusercontent.com/enidroman/Data-622-Machine-Learning-and-Big-Data/main/100%20Sales%20Records.csv"
urlfile2 <- "https://raw.githubusercontent.com/enidroman/Data-622-Machine-Learning-and-Big-Data/main/10000%20Sales%20Records.csv"
table_small_df <- read.csv(urlfile1)
table_large_df <- read.csv(urlfile2)

In the realm of business analytics, a thorough exploration of our dataset reveals intriguing patterns and variations in key numerical indicators. These metrics not only serve as numerical snapshots of our business operations but also offer valuable insights into the underlying dynamics.

Small Dataset

Content of the Tables:

Head of the Dataframe:

# head() displays the first few rows of a dataframe, giving you a quick look at the data.
head(table_small_df)

##                              Region               Country       Item.Type
## 1             Australia and Oceania                Tuvalu       Baby Food
## 2 Central America and the Caribbean               Grenada          Cereal
## 3                            Europe                Russia Office Supplies
## 4                Sub-Saharan Africa Sao Tome and Principe          Fruits
## 5                Sub-Saharan Africa                Rwanda Office Supplies
## 6             Australia and Oceania       Solomon Islands       Baby Food
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1       Offline              H  5/28/2010 669165933 6/27/2010       9925
## 2        Online              C  8/22/2012 963881480 9/15/2012       2804
## 3       Offline              L   5/2/2014 341417157  5/8/2014       1779
## 4        Online              C  6/20/2014 514321792  7/5/2014       8102
## 5       Offline              L   2/1/2013 115456712  2/6/2013       5062
## 6        Online              C   2/4/2015 547995746 2/21/2015       2974
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     255.28    159.42    2533654.00 1582243.50    951410.50
## 2     205.70    117.11     576782.80  328376.44    248406.36
## 3     651.21    524.96    1158502.59  933903.84    224598.75
## 4       9.33      6.92      75591.66   56065.84     19525.82
## 5     651.21    524.96    3296425.02 2657347.52    639077.50
## 6     255.28    159.42     759202.72  474115.08    285087.64

Summary Statistic:

# summary() provides a summary of the central tendency, dispersion, and distribution of the data, including mean, median, quartiles, min, and max. In summary we get the calculation for the mean and median of numeric variables, computation of specific quantiles (e.g., quartiles) of numeric variables.

summary(table_small_df)

##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100         Length:100         Min.   :114606559   Length:100        
##  Class :character   Class :character   1st Qu.:338922488   Class :character  
##  Mode  :character   Mode  :character   Median :557708561   Mode  :character  
##                                        Mean   :555020412                     
##                                        3rd Qu.:790755081                     
##                                        Max.   :994022214                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit    
##  Min.   :   3612   Min.   :   1258  
##  1st Qu.: 168868   1st Qu.: 121444  
##  Median : 363566   Median : 290768  
##  Mean   : 931806   Mean   : 441682  
##  3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :4509794   Max.   :1719922

The dataset contains 100 observations for each variable, and the variables have different types (character, integer, numeric). The descriptive statistics give insights into the distribution and central tendency of the numeric variables.

Structure of Tables:

# The str() function provides an overview of the structure of an R object, showing the data type and structure of each variable (column) in the dataset.
str(table_small_df)

## 'data.frame':    100 obs. of  14 variables:
##  $ Region        : chr  "Australia and Oceania" "Central America and the Caribbean" "Europe" "Sub-Saharan Africa" ...
##  $ Country       : chr  "Tuvalu" "Grenada" "Russia" "Sao Tome and Principe" ...
##  $ Item.Type     : chr  "Baby Food" "Cereal" "Office Supplies" "Fruits" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Offline" "Online" ...
##  $ Order.Priority: chr  "H" "C" "L" "C" ...
##  $ Order.Date    : chr  "5/28/2010" "8/22/2012" "5/2/2014" "6/20/2014" ...
##  $ Order.ID      : int  669165933 963881480 341417157 514321792 115456712 547995746 135425221 871543967 770463311 616607081 ...
##  $ Ship.Date     : chr  "6/27/2010" "9/15/2012" "5/8/2014" "7/5/2014" ...
##  $ Units.Sold    : int  9925 2804 1779 8102 5062 2974 4187 8082 6070 6593 ...
##  $ Unit.Price    : num  255.28 205.7 651.21 9.33 651.21 ...
##  $ Unit.Cost     : num  159.42 117.11 524.96 6.92 524.96 ...
##  $ Total.Revenue : num  2533654 576783 1158503 75592 3296425 ...
##  $ Total.Cost    : num  1582244 328376 933904 56066 2657348 ...
##  $ Total.Profit  : num  951411 248406 224599 19526 639078 ...

Again, the dataset consists of 100 observations and 14 variables. The variables include categorical features such as Region, Country, Item Type, Sales Channel, Order Priority, Order Date, and Ship Date. Additionally, there are numerical features, including Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, and Total Profit.

The categorical features provide information about the geographic and logistical aspects of each transaction, while the numerical features quantify the financial details, such as sales quantities, pricing, and associated costs. Notably, the dataset covers a diverse range of regions, countries, and product types, making it suitable for exploring sales trends and financial performance.

Dataset Characteristics (Structure, Size, Dependencies, Labels, etc.):

Number of Rows and Columns:

# nrow() and ncol() give the number of rows and columns, respectively.
number_of_rows <- nrow(table_small_df)
print(paste("Number of rows:", number_of_rows))

## [1] "Number of rows: 100"

number_of_columns <- ncol(table_small_df)
print(paste("Number of rows:", number_of_columns))

## [1] "Number of rows: 14"

# names() shows the variable (column) names.
names(table_small_df)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"

Number of NA’s:

# Get the total number of NA's in the entire dataframe
total_na_count <- sum(is.na(table_small_df))
print("Total number of NA's in the entire dataframe:")

## [1] "Total number of NA's in the entire dataframe:"

print(total_na_count)

## [1] 0

# Get the number of NA's in each column
na_per_column <- colSums(is.na(table_small_df))
print("Number of NA's in each column:")

## [1] "Number of NA's in each column:"

print(na_per_column)

##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0

There are no NA’s in this dataset.

Standard Deviation and Variance:

# Exclude "Order.ID" and non-numeric variables
numeric_table_small_df <- table_small_df[sapply(table_small_df, is.numeric) & colnames(table_small_df) != "Order.ID"]

# Check if there are any missing values
if (any(is.na(numeric_table_small_df))) {
  cat("Warning: There are missing values in the numeric columns. Consider handling missing values before calculations.\n")
} else {
  # Calculate standard deviation and variance for numeric variables
  std_dev <- apply(numeric_table_small_df, 2, sd)
  variance <- apply(numeric_table_small_df, 2, var)

  # Combine results into a data frame
  result_table_small_df <- data.frame(Variable = names(numeric_table_small_df), Standard_Deviation = std_dev, Variance = variance)

 # Print the result as a nicely formatted table
  result_table <- kable(result_table_small_df, "html") %>%
    kable_styling()

  # Display the table
  print(result_table)
}

## <table class="table" style="margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:left;">   </th>
##    <th style="text-align:left;"> Variable </th>
##    <th style="text-align:right;"> Standard_Deviation </th>
##    <th style="text-align:right;"> Variance </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> Units.Sold </td>
##    <td style="text-align:left;"> Units.Sold </td>
##    <td style="text-align:right;"> 2794.4846 </td>
##    <td style="text-align:right;"> 7.809144e+06 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Unit.Price </td>
##    <td style="text-align:left;"> Unit.Price </td>
##    <td style="text-align:right;"> 235.5922 </td>
##    <td style="text-align:right;"> 5.550370e+04 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Unit.Cost </td>
##    <td style="text-align:left;"> Unit.Cost </td>
##    <td style="text-align:right;"> 188.2082 </td>
##    <td style="text-align:right;"> 3.542232e+04 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Revenue </td>
##    <td style="text-align:left;"> Total.Revenue </td>
##    <td style="text-align:right;"> 1460028.7068 </td>
##    <td style="text-align:right;"> 2.131684e+12 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Cost </td>
##    <td style="text-align:left;"> Total.Cost </td>
##    <td style="text-align:right;"> 1083938.2522 </td>
##    <td style="text-align:right;"> 1.174922e+12 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Profit </td>
##    <td style="text-align:left;"> Total.Profit </td>
##    <td style="text-align:right;"> 438537.9071 </td>
##    <td style="text-align:right;"> 1.923155e+11 </td>
##   </tr>
## </tbody>
## </table>

Examining the ‘Units Sold’ metric, we observe a considerable range in the quantity of items sold. This variance suggests a diverse sales landscape, possibly influenced by factors such as product popularity, market demand, or seasonal fluctuations.

‘Unit Price,’ the stability in its values indicates a consistent pricing strategy across different products or time periods. This uniformity in unit prices may be a deliberate choice or could signify a need for reassessment in pricing strategies.

‘Unit Cost’ displays a moderate level of variability. This metric, representing the cost incurred for each unit, could be influenced by factors like raw material prices, production efficiency, or supplier negotiations.

Shifting our focus to broader financial indicators, ‘Total Revenue’ showcases substantial fluctuations. This variability in overall income suggests that our business experiences diverse sales performances, potentially tied to external market dynamics or internal factors.

‘Total Cost’ also demonstrates notable variations, encompassing various expenses associated with our business operations. Understanding these fluctuations is vital for effective cost management and resource allocation.

‘Total Profit’ metric, reflecting the net financial outcome, displays changes that warrant further investigation. Unraveling the factors influencing these profit variations will be pivotal in devising strategies to enhance overall business performance.

Relationships between numeric variables using a correlation matrix.

See below visualization for Correlation.

# Identify numeric columns
numeric_columns <- sapply(table_small_df, is.numeric)

# Correlation matrix for numeric columns
cor_matrix <- cor(table_small_df[, numeric_columns])
cor_matrix

##                 Order.ID  Units.Sold  Unit.Price   Unit.Cost Total.Revenue
## Order.ID       1.0000000 -0.22290682 -0.19094121 -0.21320058    -0.3146876
## Units.Sold    -0.2229068  1.00000000 -0.07048559 -0.09223245     0.4477845
## Unit.Price    -0.1909412 -0.07048559  1.00000000  0.98726981     0.7523596
## Unit.Cost     -0.2132006 -0.09223245  0.98726981  1.00000000     0.7156226
## Total.Revenue -0.3146876  0.44778449  0.75235960  0.71562263     1.0000000
## Total.Cost    -0.3289444  0.37474592  0.78790543  0.77489522     0.9839277
## Total.Profit  -0.2346378  0.56455046  0.55736525  0.46721391     0.8973269
##               Total.Cost Total.Profit
## Order.ID      -0.3289444   -0.2346378
## Units.Sold     0.3747459    0.5645505
## Unit.Price     0.7879054    0.5573652
## Unit.Cost      0.7748952    0.4672139
## Total.Revenue  0.9839277    0.8973269
## Total.Cost     1.0000000    0.8040911
## Total.Profit   0.8040911    1.0000000

The summarized correlation matrix indicates the relationships between various pairs of variables in the dataset. Here are some key observations:

“Units Sold” and “Total Profit” have a strong positive correlation of 0.5645, suggesting that higher unit sales tend to be associated with increased total profits.

“Unit Price” and “Unit Cost” exhibit a strong positive correlation of 0.9873, indicating a nearly linear relationship between these two variables.

“Total Revenue” and “Total Cost” have a high positive correlation of 0.9839, implying that these variables are strongly related, as expected.

“Order ID” has a relatively weak negative correlation with most variables, indicating limited linear relationships with other features in the dataset.

The correlation matrix provides valuable insights into the linear associations between different pairs of variables. Positive correlations suggest a positive relationship, while negative correlations indicate an inverse relationship. The strength of correlation is indicated by the magnitude of the correlation coefficient, with values closer to 1 or -1 indicating stronger associations.

Visualization

Histograms to visualize the distribution of numeric variables.

# Identify numeric columns
numeric_columns <- sapply(table_small_df, is.numeric)

# Create histograms for numeric columns with blue color
for (col in names(table_small_df)[numeric_columns]) {
  hist(table_small_df[[col]], main = col, xlab = col, col = "blue")
}

For both “Order ID” and “Units Sold,” the histograms show a uniform distribution. This suggests that values are spread relatively evenly across the range, and there’s no clear tendency for values to cluster around specific points. The histogram for “Unit Price” is described as right-skewed. Right-skewed distributions have a tail extending to the right, indicating that there are fewer higher values but with a few extremely high values (outliers) pulling the distribution to the right. The histograms for “Total Revenue,” “Total Cost,” and “Total Profit” are all described as right-skewed. Similar to the Unit Price, this indicates that these variables have a distribution with a tail extending to the right, implying the presence of relatively fewer higher values.

Correlation between variables.

# Create a full correlation plot with a lighter color palette
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 45, col = colorRampPalette(brewer.pal(9, "YlGn"))(100))

The correlation coefficients in the provided dataset offer valuable insights into the relationships between various key variables. Notably, a correlation coefficient of 0.99 between Unit Cost and Unit Price indicates an exceptionally strong positive correlation. This implies that as the cost of producing a unit increases, there is a proportional increase in its selling price. This tight relationship suggests a direct link between production costs and pricing.

Moving on to Total Cost and Total Revenue, a correlation coefficient of 0.98 suggests a robust positive correlation. This implies that an increase in the total costs incurred in production corresponds to a proportional increase in the total revenue generated. This strong positive correlation underscores the interconnectedness of production costs and overall revenue.

The correlation coefficient of 0.90 between Total Profit and Total Revenue indicates a strong positive correlation, albeit slightly less intense than the previous examples. This suggests that as total profits increase, there is a tendency for total revenue to increase as well. While not as extreme, this correlation emphasizes a meaningful relationship between overall profitability and revenue.

Individual Scatter Plots to explore relationships for each pair of variables.

# Assuming your dataframe is named 'table_small_df'
# Identify numeric columns
numeric_columns <- sapply(table_small_df, is.numeric)

# Create individual scatterplots for pairs of numeric columns using plot
for (col1 in names(table_small_df)[numeric_columns]) {
  for (col2 in names(table_small_df)[numeric_columns]) {
    if (col1 != col2) {
      plot(table_small_df[[col1]], table_small_df[[col2]], main = paste("Scatterplot of", col1, "vs", col2), col = "blue")
    }
  }
}

A scatter plot with scattered points suggests a lack of a strong linear relationship between variables, indicating their independence. The randomness in the distribution implies no structured or predictable connection. This phenomenon holds whether analyzing numeric or categorical variables. In statistical terms, this scattered appearance may signify homoscedasticity, indicating consistent variability across levels. While the lack of a clear pattern hints at independence or absence of correlation, further analysis is essential for a comprehensive understanding of the data dynamics.

A scatter plot showing points clustered at the top and bottom with a straight line in the middle suggests a potential linear relationship between the variables. This pattern indicates a positive correlation, implying that as one variable increases, the other tends to increase, and vice versa. Further statistical analysis, such as calculating correlation coefficients, is needed to quantify and confirm the strength of this relationship.

When scatter plots display a pattern with outliers concentrated at the top and the majority of data points clustered towards the bottom, it implies potential non-linear relationships or the presence of influential data points. These outliers can strongly affect correlation or regression analyses, necessitating careful investigation. Understanding the nature of these outliers is crucial, as they might indicate unique characteristics or anomalies in the dataset. Further analysis, such as examining residuals and exploring alternative modeling techniques, may be needed to accurately capture underlying patterns in the data.

A diagonal line in a scatter plot from the bottom-left to the top-right indicates a positive linear relationship between the variables being plotted. This suggests that as one variable increases, the other also tends to increase. The steeper the slope, the stronger the positive correlation. Further analysis, such as calculating correlation coefficients and conducting regression analysis, can provide a more quantitative understanding of the relationship.

A scatter plot displaying a straight vertical line from bottom to top suggests that the two variables being compared have a perfect linear relationship. This means that as one variable increases, the other also increases proportionally. The correlation coefficient would be +1, indicating a strong positive correlation. However, it’s important to note that this ideal scenario is less common in real-world data, and some variations or deviations may be present due to other factors or measurement errors.

A scatter plot with points aligned horizontally indicates a perfect linear relationship where the two variables being compared have a constant value for one of them, regardless of the changes in the other variable. This implies a correlation coefficient of -1, representing a strong negative correlation. In simpler terms, as one variable increases, the other decreases proportionally. As with other ideal scenarios, variations and deviations may occur in real-world data due to external factors or measurement errors.

Frequency for each variables for each columns:

for (column_name in names(table_small_df)) {
  if (!is.numeric(table_small_df[[column_name]])) {
    barplot(table(table_small_df[[column_name]]), main = paste("Bar Plot of", column_name))
  }
}

Boxplots for all numeric columns.

From the above summary of your dataset, it appears that the numeric variables (Units.Sold, Unit.Price, Unit.Cost, Total.Revenue, Total.Cost, Total.Profit) have a wide range of values with varying scales. The Units.Sold variable, for instance, has a minimum value of 124 and a maximum of 9,925, while the Total.Revenue variable ranges from 4,870 to 5,997,055.

Given the diverse scales of these variables, when I create a boxplot for all of them at once, some may dominate the plot, making it challenging to see the details of others. I considered normalizing or scaling these variables to bring them to a similar scale for better visualization.

# Assuming 'df' is your dataframe
scaled_df <- table_small_df[, c("Units.Sold", "Unit.Price", "Unit.Cost", "Total.Revenue", "Total.Cost", "Total.Profit")]

# Scale the numeric variables
scaled_df <- scale(scaled_df)

# Create a boxplot
boxplot(scaled_df, col = "blue", main = "Boxplot of Scaled Numeric Variables", las = 2)

The boxplot analysis reveals a positive distribution, indicating a concentration of data towards higher values. The right-skewed pattern suggests a majority of observations falling on the higher end of the axis. Additionally, the presence of outliers highlights extreme values that significantly deviate from the general trend, adding complexity to the dataset.

Large Dataset

Content of the Tables:

Head of the Dataframe:

# head() displays the first few rows of a dataframe, giving you a quick look at the data.
head(table_large_df)

##                         Region                          Country       Item.Type
## 1           Sub-Saharan Africa                             Chad Office Supplies
## 2                       Europe                           Latvia       Beverages
## 3 Middle East and North Africa                         Pakistan      Vegetables
## 4           Sub-Saharan Africa Democratic Republic of the Congo       Household
## 5                       Europe                   Czech Republic       Beverages
## 6           Sub-Saharan Africa                     South Africa       Beverages
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1        Online              L  1/27/2011 292494523 2/12/2011       4484
## 2        Online              C 12/28/2015 361825549 1/23/2016       1075
## 3       Offline              C  1/13/2011 141515767  2/1/2011       6515
## 4        Online              C  9/11/2012 500364005 10/6/2012       7683
## 5        Online              C 10/27/2015 127481591 12/5/2015       3491
## 6       Offline              H  7/10/2012 482292354 8/21/2012       9880
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     651.21    524.96    2920025.64 2353920.64    566105.00
## 2      47.45     31.79      51008.75   34174.25     16834.50
## 3     154.06     90.93    1003700.90  592408.95    411291.95
## 4     668.27    502.54    5134318.41 3861014.82   1273303.59
## 5      47.45     31.79     165647.95  110978.89     54669.06
## 6      47.45     31.79     468806.00  314085.20    154720.80

Summary Statistic:

# summary() provides a summary of the central tendency, dispersion, and distribution of the data, including mean, median, quartiles, min, and max. In summary we get the calculation for the mean and median of numeric variables, computation of specific quantiles (e.g., quartiles) of numeric variables.

summary(table_large_df)

##     Region            Country           Item.Type         Sales.Channel     
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:10000       Length:10000       Min.   :100089156   Length:10000      
##  Class :character   Class :character   1st Qu.:321806669   Class :character  
##  Mode  :character   Mode  :character   Median :548566305   Mode  :character  
##                                        Mean   :549871874                     
##                                        3rd Qu.:775998104                     
##                                        Max.   :999934232                     
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    2   Min.   :  9.33   Min.   :  6.92   Min.   :    168  
##  1st Qu.: 2531   1st Qu.:109.28   1st Qu.: 56.67   1st Qu.: 288551  
##  Median : 4962   Median :205.70   Median :117.11   Median : 800051  
##  Mean   : 5003   Mean   :268.14   Mean   :188.81   Mean   :1333355  
##  3rd Qu.: 7472   3rd Qu.:437.20   3rd Qu.:364.69   3rd Qu.:1819143  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6680027  
##    Total.Cost       Total.Profit      
##  Min.   :    125   Min.   :     43.4  
##  1st Qu.: 164786   1st Qu.:  98329.1  
##  Median : 481606   Median : 289099.0  
##  Mean   : 938266   Mean   : 395089.3  
##  3rd Qu.:1183822   3rd Qu.: 566422.7  
##  Max.   :5241726   Max.   :1738178.4

The dataset contains 10,000 observations for each variable, and the variables have different types (character, integer, numeric). The descriptive statistics give insights into the distribution and central tendency of the numeric variables.

Structure of Tables:

# The str() function provides an overview of the structure of an R object, showing the data type and structure of each variable (column) in the dataset.
str(table_large_df)

## 'data.frame':    10000 obs. of  14 variables:
##  $ Region        : chr  "Sub-Saharan Africa" "Europe" "Middle East and North Africa" "Sub-Saharan Africa" ...
##  $ Country       : chr  "Chad" "Latvia" "Pakistan" "Democratic Republic of the Congo" ...
##  $ Item.Type     : chr  "Office Supplies" "Beverages" "Vegetables" "Household" ...
##  $ Sales.Channel : chr  "Online" "Online" "Offline" "Online" ...
##  $ Order.Priority: chr  "L" "C" "C" "C" ...
##  $ Order.Date    : chr  "1/27/2011" "12/28/2015" "1/13/2011" "9/11/2012" ...
##  $ Order.ID      : int  292494523 361825549 141515767 500364005 127481591 482292354 844532620 564251220 411809480 327881228 ...
##  $ Ship.Date     : chr  "2/12/2011" "1/23/2016" "2/1/2011" "10/6/2012" ...
##  $ Units.Sold    : int  4484 1075 6515 7683 3491 9880 4825 3330 2431 6197 ...
##  $ Unit.Price    : num  651.2 47.5 154.1 668.3 47.5 ...
##  $ Unit.Cost     : num  525 31.8 90.9 502.5 31.8 ...
##  $ Total.Revenue : num  2920026 51009 1003701 5134318 165648 ...
##  $ Total.Cost    : num  2353921 34174 592409 3861015 110979 ...
##  $ Total.Profit  : num  566105 16835 411292 1273304 54669 ...

Again, the dataset consists of 10,000 observations and 14 variables. The variables include categorical features such as Region, Country, Item Type, Sales Channel, Order Priority, Order Date, and Ship Date. Additionally, there are numerical features, including Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, and Total Profit.

Dataset Characteristics (Structure, Size, Dependencies, Labels, etc.):

Number of Rows and Columns:

# nrow() and ncol() give the number of rows and columns, respectively.
number_of_rows <- nrow(table_large_df)
print(paste("Number of rows:", number_of_rows))

## [1] "Number of rows: 10000"

number_of_columns <- ncol(table_large_df)
print(paste("Number of rows:", number_of_columns))

## [1] "Number of rows: 14"

# names() shows the variable (column) names.
names(table_large_df)

##  [1] "Region"         "Country"        "Item.Type"      "Sales.Channel" 
##  [5] "Order.Priority" "Order.Date"     "Order.ID"       "Ship.Date"     
##  [9] "Units.Sold"     "Unit.Price"     "Unit.Cost"      "Total.Revenue" 
## [13] "Total.Cost"     "Total.Profit"

Number of NA’s:

# Get the total number of NA's in the entire dataframe
total_na_count <- sum(is.na(table_large_df))
print("Total number of NA's in the entire dataframe:")

## [1] "Total number of NA's in the entire dataframe:"

print(total_na_count)

## [1] 0

# Get the number of NA's in each column
na_per_column <- colSums(is.na(table_large_df))
print("Number of NA's in each column:")

## [1] "Number of NA's in each column:"

print(na_per_column)

##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0

There are no NA’s in this dataset.

Standard Deviation and Variance:

# Exclude "Order.ID" and non-numeric variables
numeric_table_large_df <- table_large_df[sapply(table_large_df, is.numeric) & colnames(table_large_df) != "Order.ID"]

# Check if there are any missing values
if (any(is.na(numeric_table_large_df))) {
  cat("Warning: There are missing values in the numeric columns. Consider handling missing values before calculations.\n")
} else {
  # Calculate standard deviation and variance for numeric variables
  std_dev <- apply(numeric_table_large_df, 2, sd)
  variance <- apply(numeric_table_large_df, 2, var)

  # Combine results into a data frame
  result_table_large_df <- data.frame(Variable = names(numeric_table_large_df), Standard_Deviation = std_dev, Variance = variance)

 # Print the result as a nicely formatted table
  result_table <- kable(result_table_large_df, "html") %>%
    kable_styling()

  # Display the table
  print(result_table)
}

## <table class="table" style="margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:left;">   </th>
##    <th style="text-align:left;"> Variable </th>
##    <th style="text-align:right;"> Standard_Deviation </th>
##    <th style="text-align:right;"> Variance </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> Units.Sold </td>
##    <td style="text-align:left;"> Units.Sold </td>
##    <td style="text-align:right;"> 2873.2465 </td>
##    <td style="text-align:right;"> 8.255545e+06 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Unit.Price </td>
##    <td style="text-align:left;"> Unit.Price </td>
##    <td style="text-align:right;"> 217.9441 </td>
##    <td style="text-align:right;"> 4.749963e+04 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Unit.Cost </td>
##    <td style="text-align:left;"> Unit.Cost </td>
##    <td style="text-align:right;"> 176.4459 </td>
##    <td style="text-align:right;"> 3.113316e+04 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Revenue </td>
##    <td style="text-align:left;"> Total.Revenue </td>
##    <td style="text-align:right;"> 1465026.1739 </td>
##    <td style="text-align:right;"> 2.146302e+12 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Cost </td>
##    <td style="text-align:left;"> Total.Cost </td>
##    <td style="text-align:right;"> 1145914.0694 </td>
##    <td style="text-align:right;"> 1.313119e+12 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total.Profit </td>
##    <td style="text-align:left;"> Total.Profit </td>
##    <td style="text-align:right;"> 377554.9607 </td>
##    <td style="text-align:right;"> 1.425477e+11 </td>
##   </tr>
## </tbody>
## </table>

The dataset reveals interesting insights into key variables, each with its unique characteristics. ‘Units Sold’ displays considerable variability, with a standard deviation of 2873.25 and a variance of 8.26 million, suggesting diverse sales patterns. ‘Unit Price’ exhibits lower variability, indicated by a standard deviation of 217.94 and a variance of 47,499.63, reflecting a more stable pricing strategy.

Similarly, ‘Unit Cost’ showcases moderate variability, with a standard deviation of 176.45 and a variance of 31,133.32, hinting at factors impacting production costs. The financial indicators ‘Total Revenue’ and ‘Total Cost’ demonstrate significant fluctuations, with standard deviations of 1,465,026.17 and 1,145,914.07, respectively, emphasizing the complexity of overall sales and operational expenses.

Lastly, ‘Total Profit’ displays notable variability, indicated by a standard deviation of 377,554.96 and a variance of 142,547.7, suggesting potential areas for further investigation to optimize overall financial outcomes. This comprehensive summary provides a snapshot of the dataset’s numerical features, paving the way for deeper analysis and strategic decision-making.

Relationships between numeric variables using a correlation matrix.

See below visualization for Correlation.

cor_matrix <- cor(table_large_df[, numeric_columns])
cor_matrix

##                   Order.ID  Units.Sold  Unit.Price   Unit.Cost Total.Revenue
## Order.ID       1.000000000 -0.01973240  0.01217491  0.01044871  -0.009288123
## Units.Sold    -0.019732401  1.00000000 -0.01297840 -0.01244102   0.518615102
## Unit.Price     0.012174909 -0.01297840  1.00000000  0.98632386   0.733225977
## Unit.Cost      0.010448711 -0.01244102  0.98632386  1.00000000   0.723267379
## Total.Revenue -0.009288123  0.51861510  0.73322598  0.72326738   1.000000000
## Total.Cost    -0.009530296  0.46617752  0.74924307  0.75983568   0.987873673
## Total.Profit  -0.007115369  0.59749001  0.57111439  0.50032251   0.882011543
##                 Total.Cost Total.Profit
## Order.ID      -0.009530296 -0.007115369
## Units.Sold     0.466177522  0.597490009
## Unit.Price     0.749243073  0.571114386
## Unit.Cost      0.759835678  0.500322512
## Total.Revenue  0.987873673  0.882011543
## Total.Cost     1.000000000  0.798153247
## Total.Profit   0.798153247  1.000000000

Visualization

Histograms to visualize the distribution of numeric variables.

# Identify numeric columns
numeric_columns <- sapply(table_large_df, is.numeric)

# Create histograms for numeric columns with blue color
for (col in names(table_large_df)[numeric_columns]) {
  hist(table_large_df[[col]], main = col, xlab = col, col = "blue")
}

The result of the of Histograms to visualize the distribution of numeric variables is similar as the small dataset.

Correlation between variables.

# Create a full correlation plot with a lighter color palette
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 45, col = colorRampPalette(brewer.pal(9, "YlGn"))(100))

Similar to the small data set the correlation coefficients in the provided dataset offer valuable insights into the relationships between various key variables. Notably, a correlation coefficient of 0.99 between Unit Cost and Unit Price indicates an exceptionally strong positive correlation. This implies that as the cost of producing a unit increases, there is a proportional increase in its selling price. This tight relationship suggests a direct link between production costs and pricing.

Moving on to Total Cost and Total Revenue, a correlation coefficient slightly higher of .01 the the small dataset of 0.99 suggests a robust positive correlation. This implies that an increase in the total costs incurred in production corresponds to a proportional increase in the total revenue generated. This strong positive correlation underscores the interconnectedness of production costs and overall revenue.

The correlation coefficient of 0.88 between Total Profit and Total Revenue indicates a little less strong positive correlation then the small dataset, albeit slightly less intense than the previous examples. This suggests that as total profits increase, there is a tendency for total revenue to increase as well. While not as extreme, this correlation emphasizes a meaningful relationship between overall profitability and revenue.

Individual Scatter Plots to explore relationships for each pair of variables.

# Assuming your dataframe is named 'table_small_df'
# Identify numeric columns
numeric_columns <- sapply(table_large_df, is.numeric)

# Create individual scatterplots for pairs of numeric columns using plot
for (col1 in names(table_large_df)[numeric_columns]) {
  for (col2 in names(table_large_df)[numeric_columns]) {
    if (col1 != col2) {
      plot(table_large_df[[col1]], table_large_df[[col2]], main = paste("Scatterplot of", col1, "vs", col2), col = "blue")
    }
  }
}

Frequency table for each variables in each columns:

for (column_name in names(table_large_df)) {
  if (!is.numeric(table_large_df[[column_name]])) {
    barplot(table(table_large_df[[column_name]]), main = paste("Bar Plot of", column_name))
  }
}

Boxplots for all numeric columns.

From the above summary of your dataset, it appears that the numeric variables (Units.Sold, Unit.Price, Unit.Cost, Total.Revenue, Total.Cost, Total.Profit) have a wide range of values with varying scales. The Units.Sold variable, for instance, has a minimum value of 2 and a maximum of 10,000, while the Total.Revenue variable ranges from 168 to 6,680,027.

# Assuming 'df' is your dataframe
scaled_df <- table_large_df[, c("Units.Sold", "Unit.Price", "Unit.Cost", "Total.Revenue", "Total.Cost", "Total.Profit")]

# Scale the numeric variables
scaled_df <- scale(scaled_df)

# Create a boxplot
boxplot(scaled_df, col = "blue", main = "Boxplot of Scaled Numeric Variables", las = 2)

The boxplot analysis reveals a positive distribution, indicating a concentration of data towards higher values. The right-skewed pattern suggests a majority of observations falling on the higher end of the axis. Additionally, the presence of too much outliers in Total Revenue, Total Cost, and Total Profit highlights extreme values that significantly deviate from the general trend, adding an extreme complexity to the dataset.

Machine Learning Algorithms

Large Dataset

Decision Tree

set.seed(123)
rpart.control(maxdepth = 30)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 30
## 
## $xval
## [1] 10

set.seed(42)
# Assuming your large dataset is named "table_large_df"
train_indices <- sample(1:nrow(table_large_df), 0.8 * nrow(table_large_df))
train_data <- table_large_df[train_indices, ]
test_data <- table_large_df[-train_indices, ]

# Remove unnecessary variables
train_data <- train_data[, !(names(train_data) %in% c("Order.ID", "Order.Date", "Ship.Date"))]
test_data <- test_data[, !(names(test_data) %in% c("Order.ID", "Order.Date", "Ship.Date"))]

fit <- rpart( Total.Profit ~ Item.Type + Region + Sales.Channel + Order.Priority + Country + Units.Sold, method = "anova", data = train_data)

# detailed summary of splits
#summary(fit)

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results

## 
## Regression tree:
## rpart(formula = Total.Profit ~ Item.Type + Region + Sales.Channel + 
##     Order.Priority + Country + Units.Sold, data = train_data, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Item.Type  Units.Sold
## 
## Root node error: 1.1362e+15/8000 = 1.4202e+11
## 
## n= 8000 
## 
##         CP nsplit rel error   xerror      xstd
## 1 0.331177      0  1.000000 1.000433 0.0202950
## 2 0.271669      1  0.668823 0.669623 0.0112266
## 3 0.104528      2  0.397154 0.398184 0.0055617
## 4 0.034684      4  0.188099 0.189715 0.0035696
## 5 0.033076      5  0.153415 0.154213 0.0031046
## 6 0.018992      6  0.120339 0.122592 0.0020515
## 7 0.014891      7  0.101347 0.103165 0.0016952
## 8 0.013673      8  0.086456 0.093172 0.0016603
## 9 0.010000      9  0.072783 0.074309 0.0011999

The regression tree utilized two variables, Item.Type and Units.Sold, to construct the tree. These variables were deemed the most relevant for predicting Total.Profit based on the tree-building algorithm’s criteria. The root node error is a measure of the initial error before any splits are made. In this case, the sum of squared errors (SSE) divided by the number of observations (8000) results in a root node error of approximately 1.42e+11. CP (Complexity Parameter): A measure of tree complexity, where smaller values indicate simpler trees. nsplit: Number of splits. rel error: Relative error reduction at each split. xerror: Cross-validated error rate, an estimate of the model’s prediction error on unseen data. xstd: Standard deviation of the cross-validated error. These values show how the cross-validated error changes as the tree grows. The goal is often to identify a level of complexity (number of splits) that minimizes the cross-validated error, ensuring good generalization to new data. In summary, the regression tree was built using Item.Type and Units.Sold as key predictors. The cross-validated error rates provide insights into the model’s performance at different levels of complexity, helping to balance simplicity and predictive accuracy.

# plot tree
plot(fit, uniform=TRUE,
   main="Regression Tree for Total profit ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# Create an attractive postscript plot of the tree
prp(fit, main = "Regression Tree for Total profit", extra = 101)

# Make predictions on the test set
predictions <- predict(fit, test_data)

# Extract the actual Total.Profit values from the test set
actual_values <- test_data$Total.Profit

# Calculate Mean Squared Error
mse <- mean((predictions - actual_values)^2)

# Print the MSE
cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 10369206128

The Mean Squared Error (MSE) value for a decision tree regression model represents the average squared difference between the actual and predicted values. A MSE of 10369206128 indicates the average squared error across all predictions made by the decision tree. The MSE of 10369206128 suggests that, on average, the squared difference between the predicted and actual values is relatively high.

KNN Model

set.seed(42)

# Split the dataset into training and testing sets
train_indices <- sample(1:nrow(table_large_df), 0.8 * nrow(table_large_df))
train_data <- table_large_df[train_indices, ]
test_data <- table_large_df[-train_indices, ]

# Select relevant predictor variables (assuming all numeric)
predictors <- train_data[, c("Units.Sold", "Unit.Price", "Unit.Cost", "Total.Revenue", "Total.Cost")]

# Standardize the predictors (optional but often recommended)
scaled_predictors <- scale(predictors)

# Select the target variable
target <- train_data$Total.Profit

# Build the KNN model
knn_model <- knn(train = scaled_predictors, test = scaled_predictors, cl = target, k = 2)

# Evaluate the model on the test set
test_predictions <- knn(train = scaled_predictors, test = scaled_predictors, cl = target, k = 2)

# Assess accuracy or other metrics
accuracy <- sum(test_predictions == target) / length(target)

# Print or visualize the results as needed
print(accuracy)

## [1] 0.5125

The output [1] 0.5125 likely represents a result or metric from a k-Nearest Neighbors (kNN) model. Depending on the specific context, it could be a predicted value or a performance metric such as accuracy or mean squared error. The exact interpretation depends on the type of problem the model is solving (e.g., regression or classification) and the evaluation metric used.

The analysis of the two models, Decision Tree and KNN, on the given datasets reveals differences in their performance metrics. The Decision Tree model, evaluated using Mean Squared Error (MSE), yielded a value of 10,369,206,128, indicating the average squared difference between predicted and actual values. Meanwhile, the KNN model was assessed with a metric of [1] 0.5125, likely representing classification accuracy or error for discrete outcomes. It’s crucial to note that these metrics are not directly comparable, as MSE is tailored for regression tasks, while the KNN metric is more relevant to classification tasks. Further exploration of the models’ strengths and weaknesses is necessary to determine their suitability for the specific dataset and analysis goals.

In conclusion, this exploratory analysis offers a nuanced perspective on our business metrics. Beyond the numerical values, it provides a narrative of our business journey, highlighting areas of strength, potential concerns, and opportunities for refinement. Armed with these insights, we are better equipped to make informed decisions that will steer our business towards sustained growth and success.

Reference: https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Data 622 - Machine Learning and Big Data - HW #1 - Exploratory Analysis and Essay

Enid Roman

2024-03-09

Small Dataset

Content of the Tables:

Head of the Dataframe:

Summary Statistic:

Structure of Tables:

Dataset Characteristics (Structure, Size, Dependencies, Labels, etc.):

Number of Rows and Columns:

Number of NA’s:

Standard Deviation and Variance:

Relationships between numeric variables using a correlation matrix.

See below visualization for Correlation.

Visualization

Histograms to visualize the distribution of numeric variables.

Correlation between variables.

Individual Scatter Plots to explore relationships for each pair of variables.

Frequency for each variables for each columns:

Boxplots for all numeric columns.

Large Dataset

Content of the Tables:

Head of the Dataframe:

Summary Statistic:

Structure of Tables:

Dataset Characteristics (Structure, Size, Dependencies, Labels, etc.):

Number of Rows and Columns:

Number of NA’s:

Standard Deviation and Variance:

Relationships between numeric variables using a correlation matrix.

See below visualization for Correlation.

Visualization

Histograms to visualize the distribution of numeric variables.

Correlation between variables.

Individual Scatter Plots to explore relationships for each pair of variables.

Frequency table for each variables in each columns:

Boxplots for all numeric columns.

Machine Learning Algorithms

Large Dataset

Decision Tree

KNN Model