HW1 Data622

Essay
Exploratory Data Analysis (EDA)

Essay

Exploratory Data Analysis and Model Building for Profit Prediction

Exploratory data analysis (EDA) is a crucial step in understanding and preparing data for predictive modeling. In this essay, we delve into the exploration of two datasets: one containing 100 sales records (small data) and the other with 1,000,000 sales records (big data). Our objective is to predict future profits using linear regression, employing techniques such as One-Hot encoding, Box-Cox transformation, data standardization, and correlation analysis.

Initially, we assessed the summary statistics of numerical data and the frequency distribution of categorical data. Fortunately, no missing information was detected, eliminating the need for imputation. Subsequently, we examined the histograms of both datasets to check for normality and standardized the data where necessary. Outliers were identified using box plots, with a decision made to ignore outliers in the predicting variable.

To ensure the accuracy of profit calculations, we validated that Total Revenue minus Total Cost equaled Total Profit. Surprisingly, discrepancies arose when comparing the values, prompting an investigation into floating-point precision issues. Implementing a tolerance check resolved the inconsistencies, highlighting the importance of meticulous data validation.

Next, we analyzed the correlation between variables and the predicting variable (Total Profit). Numeric and categorical data were analyzed separately, with a focus on identifying highly correlated variables for model inclusion. Unit Price and Unit Cost exhibited strong correlations with Total Profit, leading to the creation of a composite variable, Unit Comb. Variables such as Order Date and Ship Date were excluded from the analysis due to their potential for time series modeling.

One-Hot encoding was performed on categorical variables such as Order Priority, Sales Channel, and Region. Despite this transformation, no significant correlation with Total Profit was observed. Consequently, we opted for a Linear Regression model due to the high correlation between Total Profit and numeric variables.

Now, addressing the questions:

The columns of our data exhibited correlation, particularly Total Revenue and Total Cost, as well as Unit Price and Unit Cost.
Yes, the presence of labels (categorical data) influenced our choice of algorithm. A Random Forest model would have been chosen if categorical variables showed high correlation with the predicting variable. However, since the variables were numeric, a Multilinear Regression model was selected.
The pros of using a small dataset included facilitating initial EDA and modeling tests, while the larger dataset provided more accurate correlation relationships. The choice of algorithm was impacted by dataset size, with the larger dataset deemed less prone to overfitting.
The choice of algorithm was directly influenced by the datasets. For instance, the selection of a Linear Regression model was based on the numeric nature of the variables and their correlation with Total Profit.
In making a business decision, the results from the bigger dataset would be trusted due to its higher accuracy and reduced risk of overfitting.
Analyzing too much data can increase the likelihood of errors, particularly in complex models prone to overfitting.
The analysis between datasets revealed comparable results, with R-squared values close to 80%. However, the bigger dataset was favored due to its potential for greater accuracy and reduced overfitting.

In conclusion, thorough EDA and model selection are crucial steps in leveraging data for predictive analytics. By understanding the characteristics of the dataset and employing appropriate techniques, we can extract valuable insights to inform decisions effectively.

Exploratory Data Analysis (EDA)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Used the excel sample csv files to explore the range of sizes of this dataset. The small_data are 100 Sales Records. The big_data are 1000000 Sales Records.

https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

small_data = read.csv('https://raw.githubusercontent.com/melbow2424/Data-622-HW1/main/100%20Sales%20Records.csv')
big_data = read.csv('https://raw.githubusercontent.com/melbow2424/Data-622-HW1/main/100000%20Sales%20Records.csv')

Summaries of the two files.

small_data_factor <- small_data %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, as.factor)

big_data_factor <- big_data %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, as.factor)

Checking for the frequency of the categorical variables.

summary(small_data_factor)

##                                Region                    Country  
##  Asia                             :11   The Gambia           : 4  
##  Australia and Oceania            :11   Australia            : 3  
##  Central America and the Caribbean: 7   Djibouti             : 3  
##  Europe                           :22   Mexico               : 3  
##  Middle East and North Africa     :10   Sao Tome and Principe: 3  
##  North America                    : 3   Sierra Leone         : 3  
##  Sub-Saharan Africa               :36   (Other)              :81  
##            Item.Type  Sales.Channel Order.Priority     Order.Date
##  Clothes        :13   Offline:50    C:22           1/11/2012: 1  
##  Cosmetics      :13   Online :50    H:30           1/13/2017: 1  
##  Office Supplies:12                 L:27           1/14/2017: 1  
##  Fruits         :10                 M:21           1/16/2011: 1  
##  Personal Care  :10                                1/16/2015: 1  
##  Household      : 9                                1/4/2011 : 1  
##  (Other)        :33                                (Other)  :94  
##       Order.ID       Ship.Date    Units.Sold   Unit.Price       Unit.Cost     
##  114606559: 1   11/17/2010: 2   8656   : 2   Min.   :  9.33   Min.   :  6.92  
##  115456712: 1   1/13/2012 : 1   124    : 1   1st Qu.: 81.73   1st Qu.: 35.84  
##  122583663: 1   1/20/2011 : 1   171    : 1   Median :179.88   Median :107.28  
##  135425221: 1   1/21/2011 : 1   273    : 1   Mean   :276.76   Mean   :191.05  
##  142278373: 1   1/23/2017 : 1   282    : 1   3rd Qu.:437.20   3rd Qu.:263.33  
##  158535134: 1   1/28/2014 : 1   522    : 1   Max.   :668.27   Max.   :524.96  
##  (Other)  :94   (Other)   :93   (Other):93                                    
##  Total.Revenue       Total.Cost       Total.Profit    
##  Min.   :   4870   Min.   :   3612   Min.   :   1258  
##  1st Qu.: 268721   1st Qu.: 168868   1st Qu.: 121444  
##  Median : 752314   Median : 363566   Median : 290768  
##  Mean   :1373488   Mean   : 931806   Mean   : 441682  
##  3rd Qu.:2212045   3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :5997055   Max.   :4509794   Max.   :1719922  
##

summary(big_data_factor)

##                                Region              Country     
##  Asia                             :14547   Sudan       :  623  
##  Australia and Oceania            : 8113   New Zealand :  593  
##  Central America and the Caribbean:10731   Vatican City:  590  
##  Europe                           :25877   Malta       :  589  
##  Middle East and North Africa     :12580   Mozambique  :  589  
##  North America                    : 2133   Cambodia    :  584  
##  Sub-Saharan Africa               :26019   (Other)     :96432  
##            Item.Type     Sales.Channel   Order.Priority      Order.Date   
##  Office Supplies: 8426   Offline:49946   C:24951        11/27/2010:   57  
##  Cereal         : 8421   Online :50054   H:24945        10/3/2010 :   56  
##  Baby Food      : 8407                   L:25016        3/23/2011 :   56  
##  Cosmetics      : 8370                   M:25088        5/22/2017 :   56  
##  Personal Care  : 8364                                  10/22/2016:   55  
##  Meat           : 8320                                  12/6/2016 :   55  
##  (Other)        :49692                                  (Other)   :99665  
##       Order.ID          Ship.Date       Units.Sold      Unit.Price    
##  100008904:    1   10/4/2015 :   61   172    :   23   Min.   :  9.33  
##  100009763:    1   8/4/2013  :   60   1679   :   22   1st Qu.:109.28  
##  100035941:    1   10/26/2010:   59   5955   :   22   Median :205.70  
##  100043666:    1   11/6/2014 :   57   1222   :   21   Mean   :266.70  
##  100050961:    1   12/20/2011:   56   1409   :   21   3rd Qu.:437.20  
##  100051820:    1   12/13/2013:   55   2655   :   21   Max.   :668.27  
##  (Other)  :99994   (Other)   :99652   (Other):99870                   
##    Unit.Cost      Total.Revenue       Total.Cost       Total.Profit      
##  Min.   :  6.92   Min.   :     19   Min.   :     14   Min.   :      4.8  
##  1st Qu.: 56.67   1st Qu.: 279753   1st Qu.: 162928   1st Qu.:  95900.0  
##  Median :117.11   Median : 789892   Median : 467937   Median : 283657.5  
##  Mean   :188.02   Mean   :1336067   Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:364.69   3rd Qu.:1836490   3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :524.96   Max.   :6682700   Max.   :5249075   Max.   :1738700.0  
##

Histogram to check for normality of variables and boxplot to check for outliers in variables.

# Separate numeric 
small_numeric_vars <- select_if(small_data, is.numeric)
big_numeric_vars <- select_if(big_data, is.numeric)

# Convert numeric variables to long format for both datasets
small_numeric_long <- small_numeric_vars %>%
  gather(key = "variable", value = "value")

big_numeric_long <- big_numeric_vars %>%
  gather(key = "variable", value = "value")

# Plot histograms for numeric variables in both datasets
ggplot(small_numeric_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histogram of Numeric Columns - Small Data")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(small_numeric_long, aes(x = variable, y = value)) +
  geom_boxplot(outlier.color = "red") +
  facet_wrap(~ variable, scales = 'free')+
  labs(title = "Boxplot of Numeric Columns - Small Data")

ggplot(big_numeric_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histogram of Numeric Columns - Big Data")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(big_numeric_long, aes(x = variable, y = value)) +
  geom_boxplot(outlier.color = "red") +
  facet_wrap(~ variable, scales = 'free')+
  labs(title = "Boxplot of Numeric Columns - Big Data")

To forecast the overall profit, it’s essential to ensure that the figures in the Total.Revenue column minus those in the Total.Cost column match the values in the Total.Profit column.

# Add a new column to check if column1 + column2 equals total_column
small_data_total_profit_check <- small_data %>%
  mutate(matches_total = Total.Revenue - Total.Cost == Total.Profit)%>%
  mutate(inital_value =Total.Revenue - Total.Cost)%>%
  mutate(matches_total_difference = abs(inital_value-Total.Profit))

# Check if all values in matches_total are TRUE
if (all(small_data_total_profit_check$matches_total)) {
  cat("All columns are true.\n")
} else {
  count_data <- small_data_total_profit_check %>%
  group_by(matches_total) %>%
  summarize(count = n())
  
  ggplot(count_data, aes(x = matches_total, y = count, fill = matches_total)) +
  geom_bar(stat = "identity") +
  labs(x = "Matches Total", y = "Count", title = "Counts of Matches Total")
}

# Add a new column to check if column1 + column2 equals total_column
big_data_total_profit_check <- big_data %>%
  mutate(matches_total = Total.Revenue - Total.Cost == Total.Profit)

# Check if all values in matches_total are TRUE
if (all(big_data_total_profit_check$matches_total)) {
  cat("All columns are true.\n")
} else {
  count_data <- big_data_total_profit_check %>%
  group_by(matches_total) %>%
  summarize(count = n())
  
  ggplot(count_data, aes(x = matches_total, y = count, fill = matches_total)) +
  geom_bar(stat = "identity") +
  labs(x = "Matches Total", y = "Count", title = "Counts of Matches Total")
}

There seems to be an issue here. The Total.Revenue column minus the Total.Cost column should align with the values in the Total.Profit column. However, upon examining the graphs of both the small and large datasets, they aren’t matching perfectly. What could be causing this inconsistency?

In many programming languages, directly comparing floating-point numbers for equality can lead to unexpected outcomes due to precision issues inherent in representing these numbers. This is because floating-point arithmetic can introduce small rounding errors.

To accurately compare floating-point numbers, it’s often advisable to assess if the absolute difference between the two numbers falls within a certain tolerance range.

Let’s apply this method to both datasets and investigate further.

# Define a tolerance level
tolerance <- 1e-9

# Add a new column to check if Total.Revenue - Total.Cost is approximately equal to Total.Profit
small_data_check_tolerance <- small_data %>%
  mutate(matches_total = abs(Total.Revenue - Total.Cost - Total.Profit) < tolerance)

# Check if all values in matches_total are TRUE
if (all(small_data_check_tolerance$matches_total)) {
  cat("All rows satisfy the condition.\n")
} else {
  print(small_data_check_tolerance$matches_total)
}

## All rows satisfy the condition.

# Add a new column to check if Total.Revenue - Total.Cost is approximately equal to Total.Profit
big_data_check_tolerance <- big_data %>%
  mutate(matches_total = abs(Total.Revenue - Total.Cost - Total.Profit) < tolerance)

# Check if all values in matches_total are TRUE
if (all(big_data_check_tolerance$matches_total)) {
  cat("All rows satisfy the condition.\n")
} else {
  print(big_data_check_tolerance$matches_total)
}

## All rows satisfy the condition.

Now that we’ve established the equation Total.Revenue - Total.Cost = Total.Profit, let’s examine the correlations among the numerical values of Total.Profit. We’ll address the correlations involving categorical values later.

small_data_cor <- small_data%>%
  select(Total.Revenue, Total.Cost, Total.Profit, Units.Sold,Unit.Price, Unit.Cost)

cor_matrix <- cor(small_data_cor)

corrplot(cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

big_data_cor <- big_data%>%
  select(Total.Revenue, Total.Cost, Total.Profit, Units.Sold,Unit.Price, Unit.Cost)

big_cor_matrix <- cor(big_data_cor)

corrplot(big_cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

Given that we’ve confirmed Total.Revenue - Total.Cost = Total.Profit, we can eliminate Total.Revenue and Total.Cost since they are duplicative. Additionally, it’s necessary to merge the Unit.Price and Unit.Cost variables as they exhibit high correlation with each other.

small_data_total <- small_data %>%
  select(-Total.Revenue, -Total.Cost)%>%
  mutate(Unit.Comb = Unit.Price + Unit.Cost)%>%
  select(-Unit.Price, -Unit.Cost)

small_data_total_cor_matrix<- small_data_total%>%
  select(Total.Profit, Units.Sold,Unit.Comb, Order.ID)

small_data_total_cor_matrix <- cor(small_data_total_cor_matrix)

corrplot(small_data_total_cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

big_data_total <- big_data %>%
  select(-Total.Revenue, -Total.Cost)%>%
  mutate(Unit.Comb = Unit.Price + Unit.Cost)%>%
  select(-Unit.Price, -Unit.Cost)

big_data_total_cor_matrix<- big_data_total%>%
  select(Total.Profit, Units.Sold,Unit.Comb, Order.ID)

big_data_total_cor_matrix <- cor(big_data_total_cor_matrix)

corrplot(big_data_total_cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

For now, let’s exclude Order.Date and Ship.Date from the analysis. We can utilize them for time series analysis of the data with Total.Profit, but since this wasn’t covered in class, I’ll refrain from incorporating this model-making algorithm. Additionally, Order.Id serves merely as an identifier and can be disregarded for the current analysis.

Perform One-Hot encoding on Order.Priority, Sales.Channel, and Region. I’ve chosen to exclude Country and Item.Type due to the numerous types within these columns. Initially, I attempted this with the small dataset, but it consistently crashed my Rstudio. Additionally, it’s necessary to split the correlation matrix so we can investigate if there are any variables correlated with Total.Profit.

# Separate categorical and non-categorical variables
small_categorical_data <- small_data_total[, c( "Order.Priority", "Sales.Channel")]

small_continuous_data <- small_data_total[, c("Units.Sold", "Unit.Comb", "Total.Profit", "Order.ID")]

# Perform one-hot encoding for categorical data
small_encoded_categorical_data <- model.matrix(~ . - 1, data = small_categorical_data)

# Combine encoded categorical data with continuous data
small_combined_data <- cbind(small_encoded_categorical_data , small_continuous_data)

small_combined_data_cor_matrix <- cor(small_combined_data)

corrplot(small_combined_data_cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

# Separate categorical and non-categorical variables
big_categorical_data <- big_data_total[, c( "Order.Priority", "Sales.Channel")]

big_continuous_data <- big_data_total[, c("Units.Sold", "Unit.Comb", "Total.Profit", "Order.ID")]

# Perform one-hot encoding for categorical data
big_encoded_categorical_data <- model.matrix(~ . - 1, data = big_categorical_data)

# Combine encoded categorical data with continuous data
big_combined_data <- cbind(big_encoded_categorical_data , big_continuous_data)

big_combined_data_cor_matrix <- cor(big_combined_data)

corrplot(big_combined_data_cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

# Perform one-hot encoding for categorical data
small_encoded_categorical_data2 <- model.matrix(~ Region - 1, data = small_data_total)

df <- as.data.frame(small_encoded_categorical_data2)%>%
  rename(R1 = RegionAsia,
         R2 ='RegionAustralia and Oceania',
         R3 = "RegionCentral America and the Caribbean",
         R4 = RegionEurope,
         R5 = 'RegionMiddle East and North Africa',
         R6 = "RegionNorth America",
         R7 = 'RegionSub-Saharan Africa')

# Combine encoded categorical data with continuous data
small_combined_data2 <- cbind(df , small_continuous_data)

small_combined_data_cor_matrix2 <- cor(small_combined_data2)

corrplot(small_combined_data_cor_matrix2, 
         method="color",
         addCoef.col = "black", 
         type="upper")

# Perform one-hot encoding for categorical data
big_encoded_categorical_data2 <- model.matrix(~ Region - 1, data = big_data_total)

df2 <- as.data.frame(big_encoded_categorical_data2)%>%
  rename(R1 = RegionAsia,
         R2 ='RegionAustralia and Oceania',
         R3 = "RegionCentral America and the Caribbean",
         R4 = RegionEurope,
         R5 = 'RegionMiddle East and North Africa',
         R6 = "RegionNorth America",
         R7 = 'RegionSub-Saharan Africa')

# Combine encoded categorical data with continuous data
big_combined_data2 <- cbind(df2 , big_continuous_data)

big_combined_data_cor_matrix2 <- cor(big_combined_data2)

corrplot(big_combined_data_cor_matrix2, 
         method="color",
         addCoef.col = "black", 
         type="upper")

Despite One-Hot encoding, there wasn’t a significant correlation observed between any of the categorical data from Region, Order.Priority, or Sales.Channel and Total.Profits. Consequently, I’ve decided to opt for a Linear Regression model. This choice is informed by the fact that only two variables show a high correlation with Total.Profits: Units.Sold and the composite model comprising Unit.Price and Unit.Cost which are numeric.

The variables are normalized, centered, and scaled for processing.

small_trans <- preProcess(small_combined_data, method = c("BoxCox", "center", "scale"))
small_preprocessed_data <- predict(small_trans, newdata = small_combined_data)

# Gather the data into a long format
small_gather <- small_preprocessed_data%>%
  gather()

ggplot(small_gather, aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Histogram of Variables in Data Set Glass")

big_trans <- preProcess(big_combined_data, method = c("BoxCox", "center", "scale"))
big_preprocessed_data <- predict(big_trans, newdata = big_combined_data)

# Gather the data into a long format
big_gather <- big_preprocessed_data%>%
  gather()

ggplot(big_gather, aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Histogram of Variables in Data Set Glass")

Linear model

# Fit linear regression model on preprocessed data
small_lm_model <- lm(Total.Profit~ ., data = small_preprocessed_data )

summary(small_lm_model)

## 
## Call:
## lm(formula = Total.Profit ~ ., data = small_preprocessed_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03035 -0.29278  0.00376  0.32335  0.76548 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.032e-16  4.252e-02   0.000    1.000    
## Order.PriorityC     -1.734e-02  5.589e-02  -0.310    0.757    
## Order.PriorityH      2.025e-02  5.678e-02   0.357    0.722    
## Order.PriorityL     -6.908e-02  5.578e-02  -1.238    0.219    
## Order.PriorityM             NA         NA      NA       NA    
## Sales.ChannelOnline  1.153e-02  4.447e-02   0.259    0.796    
## Units.Sold           6.085e-01  4.496e-02  13.533   <2e-16 ***
## Unit.Comb            7.280e-01  4.501e-02  16.174   <2e-16 ***
## Order.ID             2.413e-02  4.477e-02   0.539    0.591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4252 on 92 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.8192 
## F-statistic: 65.09 on 7 and 92 DF,  p-value: < 2.2e-16

# Fit linear regression model on preprocessed data
small_lm_model2 <- lm(Total.Profit~ Units.Sold + Unit.Comb, data = small_preprocessed_data)

summary(small_lm_model2)

## 
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Comb, data = small_preprocessed_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08354 -0.28245  0.00939  0.32461  0.72572 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.027e-16  4.224e-02    0.00        1    
## Units.Sold  5.962e-01  4.256e-02   14.01   <2e-16 ***
## Unit.Comb   7.291e-01  4.256e-02   17.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4224 on 97 degrees of freedom
## Multiple R-squared:  0.8252, Adjusted R-squared:  0.8216 
## F-statistic:   229 on 2 and 97 DF,  p-value: < 2.2e-16

# Fit linear regression model on preprocessed data
big_lm_model <- lm(Total.Profit~ ., data = big_preprocessed_data )

summary(big_lm_model)

## 
## Call:
## lm(formula = Total.Profit ~ ., data = big_preprocessed_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97058 -0.26157  0.07529  0.30352  0.70693 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.031e-16  1.343e-03   0.000    1.000    
## Order.PriorityC     -1.034e-03  1.643e-03  -0.629    0.529    
## Order.PriorityH     -1.518e-03  1.643e-03  -0.924    0.356    
## Order.PriorityL     -3.477e-04  1.643e-03  -0.212    0.832    
## Order.PriorityM             NA         NA      NA       NA    
## Sales.ChannelOnline  2.284e-04  1.343e-03   0.170    0.865    
## Units.Sold           6.329e-01  1.343e-03 471.368   <2e-16 ***
## Unit.Comb            6.450e-01  1.343e-03 480.404   <2e-16 ***
## Order.ID            -1.792e-03  1.343e-03  -1.335    0.182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99992 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8197 
## F-statistic: 6.497e+04 on 7 and 99992 DF,  p-value: < 2.2e-16

# Fit linear regression model on preprocessed data
big_lm_model2 <- lm(Total.Profit~ Units.Sold + Unit.Comb, data = big_preprocessed_data)

summary(big_lm_model2)

## 
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Comb, data = big_preprocessed_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97408 -0.26173  0.07523  0.30373  0.70222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.164e-16  1.343e-03     0.0        1    
## Units.Sold  6.329e-01  1.343e-03   471.4   <2e-16 ***
## Unit.Comb   6.450e-01  1.343e-03   480.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99997 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8198 
## F-statistic: 2.274e+05 on 2 and 99997 DF,  p-value: < 2.2e-16

HW1 Data622

Melissa Bowman

2024-03-10

Essay

Exploratory Data Analysis (EDA)