library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Three pairs of variables:
1: Sales and Profit
2: Calculated Sales (Sales + (Sales * Discount)) and Region
3: Calculated Revenue (Sales - Profit) and Region
library(ggplot2)
# For Sales and Profit
ggplot(data, aes(x = Sales, y = Profit)) +
geom_point() +
labs(title = "Relationship between Sales and Profit",
x = "Sales",
y = "Profit")
# For Calculated Sales and Region
data$Calculated_Sales <- data$Sales + (data$Sales * data$Discount)
ggplot(data, aes(x = Calculated_Sales, y = Region)) +
geom_boxplot() +
labs(title = "Distribution of Calculated Sales by Region",
x = "Calculated Sales",
y = "Region")
# For Calculated Revenue and Region
data$Calculated_Revenue <- data$Sales - data$Profit
ggplot(data, aes(x = Calculated_Revenue, y = Region)) +
geom_boxplot() +
labs(title = "Distribution of Calculated Revenue by Region",
x = "Calculated Revenue",
y = "Region")
1. Relationship between Sales and Profit:
The profit kept increasing as the sales did.
Conclusion: There seems to be a positive correlation
between Sales and Profit. As Sales increase, Profit tends to increase as
well. This indicates a potentially strong positive relationship between
these two variables.
2. Distribution of Calculated Sales by Region: All
the regions except North have the same interquartile range and median.
North does not have an interquartile range, but the median lies
somewhere between 1000 and 1500, whereas for others, the median is lying
between 1500 and 2000.
Conclusion: The distribution of Calculated Sales across
different regions shows consistency in interquartile range and median
for all regions except North. The unique characteristics of North may
indicate a different pattern or behavior in terms of calculated sales.
The absence of an interquartile range in North might indicate that the
calculated sales values in North are more concentrated within a narrower
range compared to other regions. The lower median in North could suggest
that, on average, the calculated sales are lower in this region compared
to the others.
3. Distribution of Calculated Revenue by Region:
All the regions except North have the same interquartile range
and median. North does not have an interquartile range, but the median
lies somewhere between 750 and 1000, whereas for others, the median is
lying between 1000 and 1250.
Conclusion: Similar to Calculated Sales, the
distribution of Calculated Revenue across different regions shows
consistency in interquartile range and median for all regions except
North. The difference in North may suggest distinct characteristics or
factors affecting calculated revenue in that region. The absence of an
interquartile range in North for calculated revenue indicates a
concentration of values within a specific range. The lower median in
North suggests that, on average, the calculated revenue is lower in this
region compared to the others.
Possible Reasons for North region’s Distinct Behavior:
-The North region might have unique market dynamics, customer
preferences, or economic conditions that influence both sales and
revenue.
- Customer behavior, purchasing power, or demand in the North region
could be distinct, leading to variations in sales and revenue.
Recommendations for Further Investigation:
-We can maybe conduct a more detailed market analysis specific
to the North region to identify factors contributing to the observed
patterns.
- It could be helpful if we gathered customer feedback through surveys
to understand their preferences, expectations, and reasons behind their
purchasing decisions in the North region.
cor(data$Sales, data$Profit)
## [1] 0.6053486
The correlation coefficient between Sales and Profit is approximately 0.605, indicating a moderate positive correlation. As Sales increase, Profit tends to increase as well.
cor(data$Calculated_Sales, as.numeric(factor(data$Region)))
## [1] 0.003277388
The correlation coefficient between Calculated_Sales and Region is approximately 0.0033, suggesting a very weak positive correlation. As Calculated_Sales change, there is little evidence of a consistent change in the numerical encoding of Region.
cor(data$Calculated_Revenue, as.numeric(factor(data$Region)))
## [1] 0.004943766
The correlation coefficient between Calculated_Revenue and Region is approximately 0.0049, indicating a very weak positive correlation. Changes in Calculated_Revenue are not strongly associated with consistent changes in the numerical encoding of Region.
# Calculated Sales and Region
anova(lm(Calculated_Sales ~ Region, data = data))
## Analysis of Variance Table
##
## Response: Calculated_Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Region 4 535630 133908 0.2598 0.9038
## Residuals 9989 5148803142 515447
# Calculated Revenue and Region
anova(lm(Calculated_Revenue ~ Region, data = data))
## Analysis of Variance Table
##
## Response: Calculated_Revenue
## Df Sum Sq Mean Sq F value Pr(>F)
## Region 4 295803 73951 0.331 0.8573
## Residuals 9989 2231834473 223429
-The p-value associated with the ANOVA for Calculated Sales and
Region is 0.9038, which is greater than the significance level of 0.05.
This suggests that there is no significant difference in the means of
Calculated Sales across different regions.
-The p-value associated with the ANOVA for Calculated Revenue and Region
is 0.8573, which is greater than the significance level of 0.05. This
indicates that there is no significant difference in the means of
Calculated Revenue across different regions.
-For Pair 1 (Sales and Profit), the positive correlation coefficient
aligns with the visual observation that as Sales increase, Profit tends
to increase. This makes sense given the context of a typical business
scenario.
-For Pair 2 (Calculated Sales and Region), the lack of significance in
ANOVA suggests that there is no substantial evidence to conclude that
the means of Calculated Sales differ significantly across regions. This
aligns with the visual observation that regions, in terms of Calculated
Sales, do not show significant differences.
-For Pair 3 (Calculated Revenue and Region), the lack of significance in
ANOVA suggests that there is no substantial evidence to conclude that
the means of Calculated Revenue differ significantly across regions.This
aligns with the visual observation that regions, in terms of Calculated
Revenue, do not show significant differences.
# Confidence interval for Profit
t.test(data$Profit)$conf.int
## [1] 370.2325 379.6417
## attr(,"conf.level")
## [1] 0.95
The confidence interval for Profit is [370.2325, 379.6417] with a
confidence level of 95%.
-The point estimate for the mean Profit is the midpoint of the
confidence interval, which is approximately (370.2325 + 379.6417) / 2 =
374.9371.
-We are 95% confident that the true population mean Profit lies between
$370.2325 and $379.6417.
Conclusion:
With 95% confidence, we can estimate that the average Profit
for the entire population falls within the range of $370.2325 and
$379.6417. However, external factors or biases in data collection could
influence the interpretation. We must always consider the context and
limitations of the analysis.