Data Dive — Confidence Intervals

Loading the “Supermart” CSV file

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Three pairs of variables:
1: Sales and Profit
2: Calculated Sales (Sales + (Sales * Discount)) and Region
3: Calculated Revenue (Sales - Profit) and Region

Plotting a visualization for each relationship

library(ggplot2)

# For Sales and Profit
ggplot(data, aes(x = Sales, y = Profit)) +
  geom_point() +
  labs(title = "Relationship between Sales and Profit",
       x = "Sales",
       y = "Profit")

# For Calculated Sales and Region
data$Calculated_Sales <- data$Sales + (data$Sales * data$Discount)
ggplot(data, aes(x = Calculated_Sales, y = Region)) +
  geom_boxplot() +
  labs(title = "Distribution of Calculated Sales by Region",
       x = "Calculated Sales",
       y = "Region")

# For Calculated Revenue and Region
data$Calculated_Revenue <- data$Sales - data$Profit
ggplot(data, aes(x = Calculated_Revenue, y = Region)) +
  geom_boxplot() +
  labs(title = "Distribution of Calculated Revenue by Region",
       x = "Calculated Revenue",
       y = "Region")

Conclusions based on the plots

1. Relationship between Sales and Profit:
The profit kept increasing as the sales did.
Conclusion: There seems to be a positive correlation between Sales and Profit. As Sales increase, Profit tends to increase as well. This indicates a potentially strong positive relationship between these two variables.

2. Distribution of Calculated Sales by Region: All the regions except North have the same interquartile range and median. North does not have an interquartile range, but the median lies somewhere between 1000 and 1500, whereas for others, the median is lying between 1500 and 2000.
Conclusion: The distribution of Calculated Sales across different regions shows consistency in interquartile range and median for all regions except North. The unique characteristics of North may indicate a different pattern or behavior in terms of calculated sales. The absence of an interquartile range in North might indicate that the calculated sales values in North are more concentrated within a narrower range compared to other regions. The lower median in North could suggest that, on average, the calculated sales are lower in this region compared to the others.

3. Distribution of Calculated Revenue by Region:
All the regions except North have the same interquartile range and median. North does not have an interquartile range, but the median lies somewhere between 750 and 1000, whereas for others, the median is lying between 1000 and 1250.
Conclusion: Similar to Calculated Sales, the distribution of Calculated Revenue across different regions shows consistency in interquartile range and median for all regions except North. The difference in North may suggest distinct characteristics or factors affecting calculated revenue in that region. The absence of an interquartile range in North for calculated revenue indicates a concentration of values within a specific range. The lower median in North suggests that, on average, the calculated revenue is lower in this region compared to the others.

Possible Reasons for North region’s Distinct Behavior:
-The North region might have unique market dynamics, customer preferences, or economic conditions that influence both sales and revenue.
- Customer behavior, purchasing power, or demand in the North region could be distinct, leading to variations in sales and revenue.

Recommendations for Further Investigation:
-We can maybe conduct a more detailed market analysis specific to the North region to identify factors contributing to the observed patterns.
- It could be helpful if we gathered customer feedback through surveys to understand their preferences, expectations, and reasons behind their purchasing decisions in the North region.

Calculating the appropriate correlation coefficient for each of the combinations

cor(data$Sales, data$Profit)

## [1] 0.6053486

The correlation coefficient between Sales and Profit is approximately 0.605, indicating a moderate positive correlation. As Sales increase, Profit tends to increase as well.

cor(data$Calculated_Sales, as.numeric(factor(data$Region)))

## [1] 0.003277388

The correlation coefficient between Calculated_Sales and Region is approximately 0.0033, suggesting a very weak positive correlation. As Calculated_Sales change, there is little evidence of a consistent change in the numerical encoding of Region.

cor(data$Calculated_Revenue, as.numeric(factor(data$Region)))

## [1] 0.004943766

The correlation coefficient between Calculated_Revenue and Region is approximately 0.0049, indicating a very weak positive correlation. Changes in Calculated_Revenue are not strongly associated with consistent changes in the numerical encoding of Region.

Using ANOVA for comparison of means across different regions

# Calculated Sales and Region 
anova(lm(Calculated_Sales ~ Region, data = data))

## Analysis of Variance Table
## 
## Response: Calculated_Sales
##             Df     Sum Sq Mean Sq F value Pr(>F)
## Region       4     535630  133908  0.2598 0.9038
## Residuals 9989 5148803142  515447

# Calculated Revenue and Region 
anova(lm(Calculated_Revenue ~ Region, data = data))

## Analysis of Variance Table
## 
## Response: Calculated_Revenue
##             Df     Sum Sq Mean Sq F value Pr(>F)
## Region       4     295803   73951   0.331 0.8573
## Residuals 9989 2231834473  223429

-The p-value associated with the ANOVA for Calculated Sales and Region is 0.9038, which is greater than the significance level of 0.05. This suggests that there is no significant difference in the means of Calculated Sales across different regions.
-The p-value associated with the ANOVA for Calculated Revenue and Region is 0.8573, which is greater than the significance level of 0.05. This indicates that there is no significant difference in the means of Calculated Revenue across different regions.

Explanations on why the values make sense based on the visualization(s)

-For Pair 1 (Sales and Profit), the positive correlation coefficient aligns with the visual observation that as Sales increase, Profit tends to increase. This makes sense given the context of a typical business scenario.
-For Pair 2 (Calculated Sales and Region), the lack of significance in ANOVA suggests that there is no substantial evidence to conclude that the means of Calculated Sales differ significantly across regions. This aligns with the visual observation that regions, in terms of Calculated Sales, do not show significant differences.
-For Pair 3 (Calculated Revenue and Region), the lack of significance in ANOVA suggests that there is no substantial evidence to conclude that the means of Calculated Revenue differ significantly across regions.This aligns with the visual observation that regions, in terms of Calculated Revenue, do not show significant differences.

Building a confidence interval for the response variable, Profit.

# Confidence interval for Profit
t.test(data$Profit)$conf.int

## [1] 370.2325 379.6417
## attr(,"conf.level")
## [1] 0.95

The confidence interval for Profit is [370.2325, 379.6417] with a confidence level of 95%.
-The point estimate for the mean Profit is the midpoint of the confidence interval, which is approximately (370.2325 + 379.6417) / 2 = 374.9371.
-We are 95% confident that the true population mean Profit lies between $370.2325 and $379.6417.

Conclusion:
With 95% confidence, we can estimate that the average Profit for the entire population falls within the range of $370.2325 and $379.6417. However, external factors or biases in data collection could influence the interpretation. We must always consider the context and limitations of the analysis.