Exploratory Data Analysis and Model Building for Profit Prediction
Exploratory data analysis (EDA) is a crucial step in understanding and preparing data for predictive modeling. In this essay, we delve into the exploration of two datasets: one containing 100 sales records (small data) and the other with 1,000,000 sales records (big data). Our objective is to predict future profits using linear regression, employing techniques such as One-Hot encoding, Box-Cox transformation, data standardization, and correlation analysis.
Initially, we assessed the summary statistics of numerical data and the frequency distribution of categorical data. Fortunately, no missing information was detected, eliminating the need for imputation. Subsequently, we examined the histograms of both datasets to check for normality and standardized the data where necessary. Outliers were identified using box plots, with a decision made to ignore outliers in the predicting variable.
To ensure the accuracy of profit calculations, we validated that Total Revenue minus Total Cost equaled Total Profit. Surprisingly, discrepancies arose when comparing the values, prompting an investigation into floating-point precision issues. Implementing a tolerance check resolved the inconsistencies, highlighting the importance of meticulous data validation.
Next, we analyzed the correlation between variables and the predicting variable (Total Profit). Numeric and categorical data were analyzed separately, with a focus on identifying highly correlated variables for model inclusion. Unit Price and Unit Cost exhibited strong correlations with Total Profit, leading to the creation of a composite variable, Unit Comb. Variables such as Order Date and Ship Date were excluded from the analysis due to their potential for time series modeling.
One-Hot encoding was performed on categorical variables such as Order Priority, Sales Channel, and Region. Despite this transformation, no significant correlation with Total Profit was observed. Consequently, we opted for a Linear Regression model due to the high correlation between Total Profit and numeric variables.
Now, addressing the questions:
The columns of our data exhibited correlation, particularly Total Revenue and Total Cost, as well as Unit Price and Unit Cost.
Yes, the presence of labels (categorical data) influenced our choice of algorithm. A Random Forest model would have been chosen if categorical variables showed high correlation with the predicting variable. However, since the variables were numeric, a Multilinear Regression model was selected.
The pros of using a small dataset included facilitating initial EDA and modeling tests, while the larger dataset provided more accurate correlation relationships. The choice of algorithm was impacted by dataset size, with the larger dataset deemed less prone to overfitting.
The choice of algorithm was directly influenced by the datasets. For instance, the selection of a Linear Regression model was based on the numeric nature of the variables and their correlation with Total Profit.
In making a business decision, the results from the bigger dataset would be trusted due to its higher accuracy and reduced risk of overfitting.
Analyzing too much data can increase the likelihood of errors, particularly in complex models prone to overfitting.
The analysis between datasets revealed comparable results, with R-squared values close to 80%. However, the bigger dataset was favored due to its potential for greater accuracy and reduced overfitting.
In conclusion, thorough EDA and model selection are crucial steps in leveraging data for predictive analytics. By understanding the characteristics of the dataset and employing appropriate techniques, we can extract valuable insights to inform decisions effectively.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
Used the excel sample csv files to explore the range of sizes of this dataset. The small_data are 100 Sales Records. The big_data are 1000000 Sales Records.
https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
small_data = read.csv('https://raw.githubusercontent.com/melbow2424/Data-622-HW1/main/100%20Sales%20Records.csv')
big_data = read.csv('https://raw.githubusercontent.com/melbow2424/Data-622-HW1/main/100000%20Sales%20Records.csv')
Summaries of the two files.
small_data_factor <- small_data %>%
mutate_if(is.integer, as.factor)%>%
mutate_if(is.character, as.factor)
big_data_factor <- big_data %>%
mutate_if(is.integer, as.factor)%>%
mutate_if(is.character, as.factor)
Checking for the frequency of the categorical variables.
summary(small_data_factor)
## Region Country
## Asia :11 The Gambia : 4
## Australia and Oceania :11 Australia : 3
## Central America and the Caribbean: 7 Djibouti : 3
## Europe :22 Mexico : 3
## Middle East and North Africa :10 Sao Tome and Principe: 3
## North America : 3 Sierra Leone : 3
## Sub-Saharan Africa :36 (Other) :81
## Item.Type Sales.Channel Order.Priority Order.Date
## Clothes :13 Offline:50 C:22 1/11/2012: 1
## Cosmetics :13 Online :50 H:30 1/13/2017: 1
## Office Supplies:12 L:27 1/14/2017: 1
## Fruits :10 M:21 1/16/2011: 1
## Personal Care :10 1/16/2015: 1
## Household : 9 1/4/2011 : 1
## (Other) :33 (Other) :94
## Order.ID Ship.Date Units.Sold Unit.Price Unit.Cost
## 114606559: 1 11/17/2010: 2 8656 : 2 Min. : 9.33 Min. : 6.92
## 115456712: 1 1/13/2012 : 1 124 : 1 1st Qu.: 81.73 1st Qu.: 35.84
## 122583663: 1 1/20/2011 : 1 171 : 1 Median :179.88 Median :107.28
## 135425221: 1 1/21/2011 : 1 273 : 1 Mean :276.76 Mean :191.05
## 142278373: 1 1/23/2017 : 1 282 : 1 3rd Qu.:437.20 3rd Qu.:263.33
## 158535134: 1 1/28/2014 : 1 522 : 1 Max. :668.27 Max. :524.96
## (Other) :94 (Other) :93 (Other):93
## Total.Revenue Total.Cost Total.Profit
## Min. : 4870 Min. : 3612 Min. : 1258
## 1st Qu.: 268721 1st Qu.: 168868 1st Qu.: 121444
## Median : 752314 Median : 363566 Median : 290768
## Mean :1373488 Mean : 931806 Mean : 441682
## 3rd Qu.:2212045 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :5997055 Max. :4509794 Max. :1719922
##
summary(big_data_factor)
## Region Country
## Asia :14547 Sudan : 623
## Australia and Oceania : 8113 New Zealand : 593
## Central America and the Caribbean:10731 Vatican City: 590
## Europe :25877 Malta : 589
## Middle East and North Africa :12580 Mozambique : 589
## North America : 2133 Cambodia : 584
## Sub-Saharan Africa :26019 (Other) :96432
## Item.Type Sales.Channel Order.Priority Order.Date
## Office Supplies: 8426 Offline:49946 C:24951 11/27/2010: 57
## Cereal : 8421 Online :50054 H:24945 10/3/2010 : 56
## Baby Food : 8407 L:25016 3/23/2011 : 56
## Cosmetics : 8370 M:25088 5/22/2017 : 56
## Personal Care : 8364 10/22/2016: 55
## Meat : 8320 12/6/2016 : 55
## (Other) :49692 (Other) :99665
## Order.ID Ship.Date Units.Sold Unit.Price
## 100008904: 1 10/4/2015 : 61 172 : 23 Min. : 9.33
## 100009763: 1 8/4/2013 : 60 1679 : 22 1st Qu.:109.28
## 100035941: 1 10/26/2010: 59 5955 : 22 Median :205.70
## 100043666: 1 11/6/2014 : 57 1222 : 21 Mean :266.70
## 100050961: 1 12/20/2011: 56 1409 : 21 3rd Qu.:437.20
## 100051820: 1 12/13/2013: 55 2655 : 21 Max. :668.27
## (Other) :99994 (Other) :99652 (Other):99870
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## Min. : 6.92 Min. : 19 Min. : 14 Min. : 4.8
## 1st Qu.: 56.67 1st Qu.: 279753 1st Qu.: 162928 1st Qu.: 95900.0
## Median :117.11 Median : 789892 Median : 467937 Median : 283657.5
## Mean :188.02 Mean :1336067 Mean : 941975 Mean : 394091.2
## 3rd Qu.:364.69 3rd Qu.:1836490 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :524.96 Max. :6682700 Max. :5249075 Max. :1738700.0
##
Histogram to check for normality of variables and boxplot to check for outliers in variables.
# Separate numeric
small_numeric_vars <- select_if(small_data, is.numeric)
big_numeric_vars <- select_if(big_data, is.numeric)
# Convert numeric variables to long format for both datasets
small_numeric_long <- small_numeric_vars %>%
gather(key = "variable", value = "value")
big_numeric_long <- big_numeric_vars %>%
gather(key = "variable", value = "value")
# Plot histograms for numeric variables in both datasets
ggplot(small_numeric_long, aes(x = value)) +
geom_histogram() +
facet_wrap(~variable, scales = "free") +
labs(title = "Histogram of Numeric Columns - Small Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(small_numeric_long, aes(x = variable, y = value)) +
geom_boxplot(outlier.color = "red") +
facet_wrap(~ variable, scales = 'free')+
labs(title = "Boxplot of Numeric Columns - Small Data")
ggplot(big_numeric_long, aes(x = value)) +
geom_histogram() +
facet_wrap(~variable, scales = "free") +
labs(title = "Histogram of Numeric Columns - Big Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(big_numeric_long, aes(x = variable, y = value)) +
geom_boxplot(outlier.color = "red") +
facet_wrap(~ variable, scales = 'free')+
labs(title = "Boxplot of Numeric Columns - Big Data")
To forecast the overall profit, it’s essential to ensure that
the figures in the Total.Revenue column minus those in the Total.Cost
column match the values in the Total.Profit column.
# Add a new column to check if column1 + column2 equals total_column
small_data_total_profit_check <- small_data %>%
mutate(matches_total = Total.Revenue - Total.Cost == Total.Profit)%>%
mutate(inital_value =Total.Revenue - Total.Cost)%>%
mutate(matches_total_difference = abs(inital_value-Total.Profit))
# Check if all values in matches_total are TRUE
if (all(small_data_total_profit_check$matches_total)) {
cat("All columns are true.\n")
} else {
count_data <- small_data_total_profit_check %>%
group_by(matches_total) %>%
summarize(count = n())
ggplot(count_data, aes(x = matches_total, y = count, fill = matches_total)) +
geom_bar(stat = "identity") +
labs(x = "Matches Total", y = "Count", title = "Counts of Matches Total")
}
# Add a new column to check if column1 + column2 equals total_column
big_data_total_profit_check <- big_data %>%
mutate(matches_total = Total.Revenue - Total.Cost == Total.Profit)
# Check if all values in matches_total are TRUE
if (all(big_data_total_profit_check$matches_total)) {
cat("All columns are true.\n")
} else {
count_data <- big_data_total_profit_check %>%
group_by(matches_total) %>%
summarize(count = n())
ggplot(count_data, aes(x = matches_total, y = count, fill = matches_total)) +
geom_bar(stat = "identity") +
labs(x = "Matches Total", y = "Count", title = "Counts of Matches Total")
}
There seems to be an issue here. The Total.Revenue column
minus the Total.Cost column should align with the values in the
Total.Profit column. However, upon examining the graphs of both the
small and large datasets, they aren’t matching perfectly. What could be
causing this inconsistency?
In many programming languages, directly comparing floating-point numbers for equality can lead to unexpected outcomes due to precision issues inherent in representing these numbers. This is because floating-point arithmetic can introduce small rounding errors.
To accurately compare floating-point numbers, it’s often advisable to assess if the absolute difference between the two numbers falls within a certain tolerance range.
Let’s apply this method to both datasets and investigate further.
# Define a tolerance level
tolerance <- 1e-9
# Add a new column to check if Total.Revenue - Total.Cost is approximately equal to Total.Profit
small_data_check_tolerance <- small_data %>%
mutate(matches_total = abs(Total.Revenue - Total.Cost - Total.Profit) < tolerance)
# Check if all values in matches_total are TRUE
if (all(small_data_check_tolerance$matches_total)) {
cat("All rows satisfy the condition.\n")
} else {
print(small_data_check_tolerance$matches_total)
}
## All rows satisfy the condition.
# Add a new column to check if Total.Revenue - Total.Cost is approximately equal to Total.Profit
big_data_check_tolerance <- big_data %>%
mutate(matches_total = abs(Total.Revenue - Total.Cost - Total.Profit) < tolerance)
# Check if all values in matches_total are TRUE
if (all(big_data_check_tolerance$matches_total)) {
cat("All rows satisfy the condition.\n")
} else {
print(big_data_check_tolerance$matches_total)
}
## All rows satisfy the condition.
Now that we’ve established the equation Total.Revenue - Total.Cost = Total.Profit, let’s examine the correlations among the numerical values of Total.Profit. We’ll address the correlations involving categorical values later.
small_data_cor <- small_data%>%
select(Total.Revenue, Total.Cost, Total.Profit, Units.Sold,Unit.Price, Unit.Cost)
cor_matrix <- cor(small_data_cor)
corrplot(cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
big_data_cor <- big_data%>%
select(Total.Revenue, Total.Cost, Total.Profit, Units.Sold,Unit.Price, Unit.Cost)
big_cor_matrix <- cor(big_data_cor)
corrplot(big_cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
Given that we’ve confirmed Total.Revenue - Total.Cost = Total.Profit, we can eliminate Total.Revenue and Total.Cost since they are duplicative. Additionally, it’s necessary to merge the Unit.Price and Unit.Cost variables as they exhibit high correlation with each other.
small_data_total <- small_data %>%
select(-Total.Revenue, -Total.Cost)%>%
mutate(Unit.Comb = Unit.Price + Unit.Cost)%>%
select(-Unit.Price, -Unit.Cost)
small_data_total_cor_matrix<- small_data_total%>%
select(Total.Profit, Units.Sold,Unit.Comb, Order.ID)
small_data_total_cor_matrix <- cor(small_data_total_cor_matrix)
corrplot(small_data_total_cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
big_data_total <- big_data %>%
select(-Total.Revenue, -Total.Cost)%>%
mutate(Unit.Comb = Unit.Price + Unit.Cost)%>%
select(-Unit.Price, -Unit.Cost)
big_data_total_cor_matrix<- big_data_total%>%
select(Total.Profit, Units.Sold,Unit.Comb, Order.ID)
big_data_total_cor_matrix <- cor(big_data_total_cor_matrix)
corrplot(big_data_total_cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
For now, let’s exclude Order.Date and Ship.Date from the analysis. We can utilize them for time series analysis of the data with Total.Profit, but since this wasn’t covered in class, I’ll refrain from incorporating this model-making algorithm. Additionally, Order.Id serves merely as an identifier and can be disregarded for the current analysis.
Perform One-Hot encoding on Order.Priority, Sales.Channel, and Region. I’ve chosen to exclude Country and Item.Type due to the numerous types within these columns. Initially, I attempted this with the small dataset, but it consistently crashed my Rstudio. Additionally, it’s necessary to split the correlation matrix so we can investigate if there are any variables correlated with Total.Profit.
# Separate categorical and non-categorical variables
small_categorical_data <- small_data_total[, c( "Order.Priority", "Sales.Channel")]
small_continuous_data <- small_data_total[, c("Units.Sold", "Unit.Comb", "Total.Profit", "Order.ID")]
# Perform one-hot encoding for categorical data
small_encoded_categorical_data <- model.matrix(~ . - 1, data = small_categorical_data)
# Combine encoded categorical data with continuous data
small_combined_data <- cbind(small_encoded_categorical_data , small_continuous_data)
small_combined_data_cor_matrix <- cor(small_combined_data)
corrplot(small_combined_data_cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
# Separate categorical and non-categorical variables
big_categorical_data <- big_data_total[, c( "Order.Priority", "Sales.Channel")]
big_continuous_data <- big_data_total[, c("Units.Sold", "Unit.Comb", "Total.Profit", "Order.ID")]
# Perform one-hot encoding for categorical data
big_encoded_categorical_data <- model.matrix(~ . - 1, data = big_categorical_data)
# Combine encoded categorical data with continuous data
big_combined_data <- cbind(big_encoded_categorical_data , big_continuous_data)
big_combined_data_cor_matrix <- cor(big_combined_data)
corrplot(big_combined_data_cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
# Perform one-hot encoding for categorical data
small_encoded_categorical_data2 <- model.matrix(~ Region - 1, data = small_data_total)
df <- as.data.frame(small_encoded_categorical_data2)%>%
rename(R1 = RegionAsia,
R2 ='RegionAustralia and Oceania',
R3 = "RegionCentral America and the Caribbean",
R4 = RegionEurope,
R5 = 'RegionMiddle East and North Africa',
R6 = "RegionNorth America",
R7 = 'RegionSub-Saharan Africa')
# Combine encoded categorical data with continuous data
small_combined_data2 <- cbind(df , small_continuous_data)
small_combined_data_cor_matrix2 <- cor(small_combined_data2)
corrplot(small_combined_data_cor_matrix2,
method="color",
addCoef.col = "black",
type="upper")
# Perform one-hot encoding for categorical data
big_encoded_categorical_data2 <- model.matrix(~ Region - 1, data = big_data_total)
df2 <- as.data.frame(big_encoded_categorical_data2)%>%
rename(R1 = RegionAsia,
R2 ='RegionAustralia and Oceania',
R3 = "RegionCentral America and the Caribbean",
R4 = RegionEurope,
R5 = 'RegionMiddle East and North Africa',
R6 = "RegionNorth America",
R7 = 'RegionSub-Saharan Africa')
# Combine encoded categorical data with continuous data
big_combined_data2 <- cbind(df2 , big_continuous_data)
big_combined_data_cor_matrix2 <- cor(big_combined_data2)
corrplot(big_combined_data_cor_matrix2,
method="color",
addCoef.col = "black",
type="upper")
Despite One-Hot encoding, there wasn’t a significant correlation observed between any of the categorical data from Region, Order.Priority, or Sales.Channel and Total.Profits. Consequently, I’ve decided to opt for a Linear Regression model. This choice is informed by the fact that only two variables show a high correlation with Total.Profits: Units.Sold and the composite model comprising Unit.Price and Unit.Cost which are numeric.
The variables are normalized, centered, and scaled for processing.
small_trans <- preProcess(small_combined_data, method = c("BoxCox", "center", "scale"))
small_preprocessed_data <- predict(small_trans, newdata = small_combined_data)
# Gather the data into a long format
small_gather <- small_preprocessed_data%>%
gather()
ggplot(small_gather, aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~key, scales = 'free')+
labs(title = "Histogram of Variables in Data Set Glass")
big_trans <- preProcess(big_combined_data, method = c("BoxCox", "center", "scale"))
big_preprocessed_data <- predict(big_trans, newdata = big_combined_data)
# Gather the data into a long format
big_gather <- big_preprocessed_data%>%
gather()
ggplot(big_gather, aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~key, scales = 'free')+
labs(title = "Histogram of Variables in Data Set Glass")
Linear model
# Fit linear regression model on preprocessed data
small_lm_model <- lm(Total.Profit~ ., data = small_preprocessed_data )
summary(small_lm_model)
##
## Call:
## lm(formula = Total.Profit ~ ., data = small_preprocessed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03035 -0.29278 0.00376 0.32335 0.76548
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.032e-16 4.252e-02 0.000 1.000
## Order.PriorityC -1.734e-02 5.589e-02 -0.310 0.757
## Order.PriorityH 2.025e-02 5.678e-02 0.357 0.722
## Order.PriorityL -6.908e-02 5.578e-02 -1.238 0.219
## Order.PriorityM NA NA NA NA
## Sales.ChannelOnline 1.153e-02 4.447e-02 0.259 0.796
## Units.Sold 6.085e-01 4.496e-02 13.533 <2e-16 ***
## Unit.Comb 7.280e-01 4.501e-02 16.174 <2e-16 ***
## Order.ID 2.413e-02 4.477e-02 0.539 0.591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4252 on 92 degrees of freedom
## Multiple R-squared: 0.832, Adjusted R-squared: 0.8192
## F-statistic: 65.09 on 7 and 92 DF, p-value: < 2.2e-16
# Fit linear regression model on preprocessed data
small_lm_model2 <- lm(Total.Profit~ Units.Sold + Unit.Comb, data = small_preprocessed_data)
summary(small_lm_model2)
##
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Comb, data = small_preprocessed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08354 -0.28245 0.00939 0.32461 0.72572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.027e-16 4.224e-02 0.00 1
## Units.Sold 5.962e-01 4.256e-02 14.01 <2e-16 ***
## Unit.Comb 7.291e-01 4.256e-02 17.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4224 on 97 degrees of freedom
## Multiple R-squared: 0.8252, Adjusted R-squared: 0.8216
## F-statistic: 229 on 2 and 97 DF, p-value: < 2.2e-16
# Fit linear regression model on preprocessed data
big_lm_model <- lm(Total.Profit~ ., data = big_preprocessed_data )
summary(big_lm_model)
##
## Call:
## lm(formula = Total.Profit ~ ., data = big_preprocessed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97058 -0.26157 0.07529 0.30352 0.70693
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.031e-16 1.343e-03 0.000 1.000
## Order.PriorityC -1.034e-03 1.643e-03 -0.629 0.529
## Order.PriorityH -1.518e-03 1.643e-03 -0.924 0.356
## Order.PriorityL -3.477e-04 1.643e-03 -0.212 0.832
## Order.PriorityM NA NA NA NA
## Sales.ChannelOnline 2.284e-04 1.343e-03 0.170 0.865
## Units.Sold 6.329e-01 1.343e-03 471.368 <2e-16 ***
## Unit.Comb 6.450e-01 1.343e-03 480.404 <2e-16 ***
## Order.ID -1.792e-03 1.343e-03 -1.335 0.182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99992 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8197
## F-statistic: 6.497e+04 on 7 and 99992 DF, p-value: < 2.2e-16
# Fit linear regression model on preprocessed data
big_lm_model2 <- lm(Total.Profit~ Units.Sold + Unit.Comb, data = big_preprocessed_data)
summary(big_lm_model2)
##
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Comb, data = big_preprocessed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97408 -0.26173 0.07523 0.30373 0.70222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.164e-16 1.343e-03 0.0 1
## Units.Sold 6.329e-01 1.343e-03 471.4 <2e-16 ***
## Unit.Comb 6.450e-01 1.343e-03 480.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99997 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8198
## F-statistic: 2.274e+05 on 2 and 99997 DF, p-value: < 2.2e-16