Part 1: Build at least three sets of variable combinations
Part 2: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot
Part 3: Calculate the appropriate correlation coefficient for each of these combinations
Part 4: Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
Build at least three sets of variable combinations
superstore_set_1 <- Superstore_data |>
select(Sales, Quantity, Discount, Profit)
head(superstore_set_1)
## Sales Quantity Discount Profit
## 1 261.9600 2 0.00 41.9136
## 2 731.9400 3 0.00 219.5820
## 3 14.6200 2 0.00 6.8714
## 4 957.5775 5 0.45 -383.0310
## 5 22.3680 2 0.20 2.5164
## 6 48.8600 7 0.00 14.1694
Within set1, Visualization for each response-explanatory relationship. Further calculation of Correlation and identifying the confidence Interval:
ggplot(data = superstore_set_1, aes(x = Sales, y = Profit))+
geom_point() +
labs(x = "Sales", y = "Profit") +
ggtitle("Profit vs. Sales") +
theme_hc()
Not a clear linear relation between Profit and sales. But can see for most it follows the linear trend, i.e. as the sale prices increase the profit also increases.
Calculate the appropriate correlation coefficient between two variables,i.e. between Profit and Sales
correlation_coefficient <- round(cor(superstore_set_1$Profit, superstore_set_1$Sales),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.48
Positive value of Correlation indicates that there is somewhat linear relationship between Profit and Sales.
So can say that the visualization and Correlation speak the same that there is somewhat linear correlation between Profit and Sales.
ggplot(data = superstore_set_1, aes(x = Quantity, y = Profit))+
geom_point() +
labs(x = "Quantity", y = "Profit") +
ggtitle("Profit vs. Quantity") +
theme_hc()
Not a clear relation between Profit and Quantity. All points are scattered all over.
Calculate the appropriate correlation coefficient between two variables,i.e. between Profit and Quantity
correlation_coefficient <- round(cor(superstore_set_1$Profit, superstore_set_1$Quantity),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.07
Here correlation between Profit and quantity is near zero,i.e. points are randomly scattered.
So can say that the visualization and Correlation speak the same that there is somewhat no particular relation which can be predicted from Profit and quantity.
ggplot(data = superstore_set_1, aes(x = Discount, y = Profit))+
geom_point() +
labs(x = "Discount", y = "Profit") +
ggtitle("Profit vs. Discount") +
theme_hc()
Can see it follows the negative linear trend, i.e. as the discount increases the profit descreases
Calculate the appropriate correlation coefficient between two variables,i.e. between Profit and Discount
correlation_coefficient <- round(cor(superstore_set_1$Profit, superstore_set_1$Discount),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] -0.22
Here correlation between Profit and Discount is negative,i.e. strong information about the how negatively discount affects profit.
So can say that the visualization and Correlation speak the same that there is negative correlation that can be predicted between Profit and Discount. So it is not a good idea to give discounts if the superstore is looking for Profits.
Assuming, If we want to calculate a 95% confidence interval for the “Profit” variable:
print(paste("Min of Profit",min(superstore_set_1$Profit)))
## [1] "Min of Profit -6599.978"
print(paste("Max of Profit",max(superstore_set_1$Profit)))
## [1] "Max of Profit 8399.976"
# Calculate the sample mean and standard error of the mean for the Profit variable
profit_mean <- mean(superstore_set_1$Profit)
profit_se <- sd(superstore_set_1$Profit)/sqrt(length(superstore_set_1$Profit))
# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)
# Calculate the margin of error
margin_of_error <- t_critical * profit_se
# Construct the 95% confidence interval for the Profit variable
profit_ci_upper <- profit_mean + margin_of_error
profit_ci_lower <- profit_mean - margin_of_error
print(paste("Confidence level of Profit: ",profit_ci_lower," to ",profit_ci_upper))
## [1] "Confidence level of Profit: 23.4356892364713 to 33.878103379098"
#Region Category to numbers
unique(Superstore_data$Category)
## [1] "Furniture" "Office Supplies" "Technology"
# Create a new column for region_numeric
Superstore_data <- Superstore_data %>%
mutate(Category_numeric = recode(Category, "Furniture" = 1, "Office Supplies" = 2, "Technology" = 3))
Superstore_data |> select(Category,Category_numeric) |> head()
## Category Category_numeric
## 1 Furniture 1
## 2 Furniture 1
## 3 Office Supplies 2
## 4 Furniture 1
## 5 Office Supplies 2
## 6 Furniture 1
superstore_set_2 <- Superstore_data |>
select(Quantity,Discount,Category,Category_numeric, Sales)
head(superstore_set_2)
## Quantity Discount Category Category_numeric Sales
## 1 2 0.00 Furniture 1 261.9600
## 2 3 0.00 Furniture 1 731.9400
## 3 2 0.00 Office Supplies 2 14.6200
## 4 5 0.45 Furniture 1 957.5775
## 5 2 0.20 Office Supplies 2 22.3680
## 6 7 0.00 Furniture 1 48.8600
In this above set, will try to predict the sales (response variable) based on quantity, discount, and category.
Within set2, Visualization for each response-explanatory relationship. Further calculation of Correlation and identifying the confidence Interval:
ggplot(data = superstore_set_2, aes(x = Quantity, y = Sales))+
geom_point() +
labs(x = "Quantity", y = "Sales") +
ggtitle("Sales vs. Quantity") +
theme_hc()
Not a clear relation between Sales and Quantity. All points are scattered over the region where sale is less, while quantity vary in there.
Calculate the appropriate correlation coefficient between two variables,i.e. between Quantity and Sales
correlation_coefficient <- round(cor(superstore_set_2$Quantity, superstore_set_2$Sales),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.2
Positive value of Correlation indicates that there is somewhat linear relationship between Quantity and Sales. But since the value is nearby 0 i.e. it is 0.2. I would say that not much can be said. There is not much strength in the relationship between Sales and Quantity
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Quantity and sales.
ggplot(data = superstore_set_2, aes(x = Discount, y = Sales))+
geom_point() +
labs(x = "Discount", y = "Sales") +
ggtitle("Sales vs. Discount") +
theme_hc()
Not a clear relation between Sales and Discount. All points are somewhat scattered over the region but if we closely look as the discount increases the sale decreases.
Calculate the appropriate correlation coefficient between two variables,i.e. between Discount and Sales
correlation_coefficient <- round(cor(superstore_set_2$Discount, superstore_set_2$Sales),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] -0.03
Negative value of Correlation indicates that there is not a a support between the parameters. But since the value is small negative value which is nearby 0 i.e. it is -0.3. I would say that not much can be said. There is not much strength in the relationship between Sales and Discount, stating that they dont go together well. Majorly, can say that it is not giving much information.
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Discount and sales.
ggplot(data = superstore_set_2, aes(x = Category_numeric, y = Sales))+
geom_point() +
labs(x = "Category_numeric (Furniture = 1, Office Supplies = 2, Technology = 3 )", y = "Sales") +
ggtitle("Sales vs. Category_numeric") +
theme_hc()
Here 1,2 and 3 in category_numeric indicate the following categories - “Furniture” = 1, “Office Supplies” = 2, “Technology” = 3
Not a clear relation between Sales and Category. All points are somewhat scattered over the region but if we closely look as the Category tends to be Technology the Sales keep increasing. Stating that there can be a positive relation when the category of products bought are of the order Technology > Office Supplies > Furniture.
Calculate the appropriate correlation coefficient between two variables,i.e. between Quantity and Sales
correlation_coefficient <- round(cor(superstore_set_2$Category_numeric, superstore_set_2$Sales),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.04
Positive value of Correlation indicates that there is slight relation between the parameters. But since the value is small positive value which is nearby 0 i.e. it is 0.04. I would say that not much can be said. There is not much strength in the relationship between Sales and Category. Majorly, can say that it is not giving much information.
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Category and sales. Maybe can state that with respect to certain category of products the sales may go up but not much can be figured out with such a correlation value.
Assuming, If we want to calculate a 95% confidence interval for the “Sales” variable:
print(paste("Min of Sales",min(superstore_set_2$Sales)))
## [1] "Min of Sales 0.444"
print(paste("Max of Sales",max(superstore_set_2$Sales)))
## [1] "Max of Sales 22638.48"
# Calculate the sample mean and standard error of the mean for the Profit variable
sales_mean <- mean(superstore_set_2$Sales)
sales_se <- sd(superstore_set_2$Sales)/sqrt(length(superstore_set_2$Sales))
# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)
# Calculate the margin of error
margin_of_error <- t_critical * sales_se
# Construct the 95% confidence interval for the Profit variable
sales_ci_upper <- sales_mean + margin_of_error
sales_ci_lower <- sales_mean - margin_of_error
print(paste("Confidence level of Sales: ",sales_ci_lower," to ",sales_ci_upper))
## [1] "Confidence level of Sales: 215.967066697444 to 243.748934963553"
Build at least three sets of variable combinations
#Region convert to numbers for correlation purpose
unique(Superstore_data$Region)
## [1] "South" "West" "Central" "East"
# Create a new column for region_numeric
Superstore_data <- Superstore_data %>%
mutate(Region_numeric = recode(Region, "South" = 1, "West" = 2, "East" = 3, "Central" = 4))
Superstore_data |> select(Region,Region_numeric) |> head()
## Region Region_numeric
## 1 South 1
## 2 South 1
## 3 West 2
## 4 South 1
## 5 South 1
## 6 West 2
# ShipMode convert to numbers for correlation purpose
unique(Superstore_data$Ship.Mode)
## [1] "Second Class" "Standard Class" "First Class" "Same Day"
Superstore_data <- Superstore_data %>%
mutate(ShipMode_numeric = recode(Ship.Mode, "Same Day"= 1, "First Class" = 2, "Second Class" = 3, "Standard Class" = 4))
Superstore_data |> select(Ship.Mode,ShipMode_numeric) |> head()
## Ship.Mode ShipMode_numeric
## 1 Second Class 3
## 2 Second Class 3
## 3 Second Class 3
## 4 Standard Class 4
## 5 Standard Class 4
## 6 Standard Class 4
superstore_set_3 <- Superstore_data |>
select(Quantity,Region,Region_numeric,Ship.Mode,ShipMode_numeric,Discount)
head(superstore_set_3)
## Quantity Region Region_numeric Ship.Mode ShipMode_numeric Discount
## 1 2 South 1 Second Class 3 0.00
## 2 3 South 1 Second Class 3 0.00
## 3 2 West 2 Second Class 3 0.00
## 4 5 South 1 Standard Class 4 0.45
## 5 2 South 1 Standard Class 4 0.20
## 6 7 West 2 Standard Class 4 0.00
In the above set, will try to find a relation between each explanatory variable i.e. Discount, Region and ShipMode with Quantity
Within set3, Visualization for each response-explanatory relationship. Further calculation of Correlation and identifying the confidence Interval:
ggplot(data = superstore_set_3, aes(x = Quantity, y = Discount))+
geom_point() +
labs(x = "Quantity", y = "Discount") +
ggtitle("Discount vs. Quantity") +
theme_hc()
Not a clear relation between Discount and Quantity. All points are scattered over the plot.
Calculate the appropriate correlation coefficient between two variables,i.e. between Quantity and Discount
correlation_coefficient <- round(cor(superstore_set_3$Quantity, superstore_set_3$Discount),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.01
Positive value of Correlation indicates that there is somewhat linear relationship between Quantity and Discount. But since the value is nearby 0 i.e. it is 0.01. I would say that not much can be said. There is not much strength in the relationship between Discount and Quantity. Or even there is no corelation between those 2 parameters
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Quantity and Discount.
ggplot(data = superstore_set_3, aes(x =Region_numeric , y = Discount))+
geom_point() +
labs(x = "Region_numeric (South = 1, West = 2, East = 3, Central = 4)", y = "Discount") +
ggtitle("Discount vs. Region_numeric ") +
theme_hc()
Not a clear relation between Region and Discount. All points are somewhat scattered over the region.
Calculate the appropriate correlation coefficient between two variables,i.e. between Discount and Region
correlation_coefficient <- round(cor(superstore_set_3$Discount, superstore_set_3$Region_numeric),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.18
Positive value of Correlation indicates that there is somewhat linear relationship between Quantity and Discount. But since the value is nearby 0 i.e. it is 0.18. I would say that not much can be said. There is not much strength in the relationship between Discount and Region Or even there is no corelation between those 2 parameters
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Discount and Region.
ggplot(data = superstore_set_3, aes(x = ShipMode_numeric, y = Discount))+
geom_point() +
labs(x = "ShipMode_numeric (Same Day= 1, First Class = 2, Second Class = 3, Standard Class = 4)", y = "Discount") +
ggtitle("Discount vs. ShipMode_numeric") +
theme_hc()
Here 1,2, 3 and 4 in ShipMode_numeric indicate the following categories -Same Day= 1, First Class = 2, Second Class = 3, Standard Class = 4
Not a clear relation between ShipMode_numeric and Discount, All points are somewhat scattered over the plot.
Calculate the appropriate correlation coefficient between two variables,i.e. between ShipMode and Discount.
correlation_coefficient <- round(cor(superstore_set_3$ShipMode_numeric, superstore_set_3$Discount),2)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.01
Positive value of Correlation indicates that there is slight relation between the parameters. But since the value is small positive value which is nearby 0 i.e. it is 0.01. I would say that not much can be said. There is not much strength in the relationship between Discount and ShipMode. Majorly, can say that it is not giving much information.
So conclusively, can say that the visualization and Correlation speak the same that there is not much which can be figured between the parameters - Discount and ShipMode.
Assuming, If we want to calculate a 95% confidence interval for the “Discount” variable:
print(paste("Min of Discount",min(superstore_set_3$Discount)))
## [1] "Min of Discount 0"
print(paste("Max of Discount",max(superstore_set_3$Discount)))
## [1] "Max of Discount 0.8"
# Calculate the sample mean and standard error of the mean for the Profit variable
dist_mean <- mean(superstore_set_3$Discount)
dist_se <- sd(superstore_set_3$Discount)/sqrt(length(superstore_set_3$Discount))
# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)
# Calculate the margin of error
margin_of_error <- t_critical * dist_se
# Construct the 95% confidence interval for the Profit variable
Discount_ci_upper <- dist_mean + margin_of_error
Discount_ci_lower <- dist_mean - margin_of_error
print(paste("Confidence level of Sales: ",Discount_ci_lower," to ",Discount_ci_upper))
## [1] "Confidence level of Sales: 0.151601304494897 to 0.160804138771062"