ASSIGNMENT 6

Task(s)

  • Part 1: Build at least three sets of variable combinations

    • For each set of variables, include at least one column that you created (i.e., calculated based on others)
    • All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., [‘small’, ‘medium’, ‘large’] is okay, but [“apples”, “oranges”, “bananas”] is not)
    • For each set, there should be one response variable with the others as explanatory variables
  • Part 2: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

    • Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)
  • Part 3: Calculate the appropriate correlation coefficient for each of these combinations

    • Explain why the value makes sense (or doesn’t) based on the visualization(s)
  • Part 4: Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.


Read the Data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
##        Ship.Mode   Segment       Country            City      State Postal.Code
## 1   Second Class  Consumer United States       Henderson   Kentucky       42420
## 2   Second Class  Consumer United States       Henderson   Kentucky       42420
## 3   Second Class Corporate United States     Los Angeles California       90036
## 4 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 5 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 6 Standard Class  Consumer United States     Los Angeles California       90032
##   Region        Category Sub.Category    Sales Quantity Discount    Profit
## 1  South       Furniture    Bookcases 261.9600        2     0.00   41.9136
## 2  South       Furniture       Chairs 731.9400        3     0.00  219.5820
## 3   West Office Supplies       Labels  14.6200        2     0.00    6.8714
## 4  South       Furniture       Tables 957.5775        5     0.45 -383.0310
## 5  South Office Supplies      Storage  22.3680        2     0.20    2.5164
## 6   West       Furniture  Furnishings  48.8600        7     0.00   14.1694

1. Set 1: - Predicting Profit


superstore_set_1 <- Superstore_data |>
  select(Sales, Quantity, Discount, Profit)
head(superstore_set_1)
##      Sales Quantity Discount    Profit
## 1 261.9600        2     0.00   41.9136
## 2 731.9400        3     0.00  219.5820
## 3  14.6200        2     0.00    6.8714
## 4 957.5775        5     0.45 -383.0310
## 5  22.3680        2     0.20    2.5164
## 6  48.8600        7     0.00   14.1694


Assuming, If we want to calculate a 95% confidence interval for the “Profit” variable:

print(paste("Min of Profit",min(superstore_set_1$Profit)))
## [1] "Min of Profit -6599.978"
print(paste("Max of Profit",max(superstore_set_1$Profit)))
## [1] "Max of Profit 8399.976"
# Calculate the sample mean and standard error of the mean for the Profit variable
profit_mean <- mean(superstore_set_1$Profit)
profit_se <- sd(superstore_set_1$Profit)/sqrt(length(superstore_set_1$Profit))

# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)

# Calculate the margin of error
margin_of_error <- t_critical * profit_se

# Construct the 95% confidence interval for the Profit variable
profit_ci_upper <- profit_mean + margin_of_error
profit_ci_lower <- profit_mean - margin_of_error

print(paste("Confidence level of Profit: ",profit_ci_lower," to ",profit_ci_upper))
## [1] "Confidence level of Profit:  23.4356892364713  to  33.878103379098"

2. Set 2: - Predicting Sales


#Region Category to numbers   
unique(Superstore_data$Category)
## [1] "Furniture"       "Office Supplies" "Technology"
# Create a new column for region_numeric
Superstore_data <- Superstore_data %>%
  mutate(Category_numeric = recode(Category, "Furniture" = 1, "Office Supplies" = 2, "Technology" = 3))


Superstore_data |> select(Category,Category_numeric) |> head()
##          Category Category_numeric
## 1       Furniture                1
## 2       Furniture                1
## 3 Office Supplies                2
## 4       Furniture                1
## 5 Office Supplies                2
## 6       Furniture                1
superstore_set_2 <- Superstore_data |>
  select(Quantity,Discount,Category,Category_numeric, Sales)
head(superstore_set_2)
##   Quantity Discount        Category Category_numeric    Sales
## 1        2     0.00       Furniture                1 261.9600
## 2        3     0.00       Furniture                1 731.9400
## 3        2     0.00 Office Supplies                2  14.6200
## 4        5     0.45       Furniture                1 957.5775
## 5        2     0.20 Office Supplies                2  22.3680
## 6        7     0.00       Furniture                1  48.8600

In this above set, will try to predict the sales (response variable) based on quantity, discount, and category.


Here 1,2 and 3 in category_numeric indicate the following categories - “Furniture” = 1, “Office Supplies” = 2, “Technology” = 3

correlation_coefficient <- round(cor(superstore_set_2$Category_numeric, superstore_set_2$Sales),2)

# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.04

Assuming, If we want to calculate a 95% confidence interval for the “Sales” variable:

print(paste("Min of Sales",min(superstore_set_2$Sales)))
## [1] "Min of Sales 0.444"
print(paste("Max of Sales",max(superstore_set_2$Sales)))
## [1] "Max of Sales 22638.48"
# Calculate the sample mean and standard error of the mean for the Profit variable
sales_mean <- mean(superstore_set_2$Sales)
sales_se <- sd(superstore_set_2$Sales)/sqrt(length(superstore_set_2$Sales))

# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)

# Calculate the margin of error
margin_of_error <- t_critical * sales_se

# Construct the 95% confidence interval for the Profit variable
sales_ci_upper <- sales_mean + margin_of_error
sales_ci_lower <- sales_mean - margin_of_error

print(paste("Confidence level of Sales: ",sales_ci_lower," to ",sales_ci_upper))
## [1] "Confidence level of Sales:  215.967066697444  to  243.748934963553"

3. Set 3: - Predicting Discount at which products are sold

#Region convert to numbers for correlation  purpose  
unique(Superstore_data$Region)
## [1] "South"   "West"    "Central" "East"
# Create a new column for region_numeric
Superstore_data <- Superstore_data %>%
  mutate(Region_numeric = recode(Region, "South" = 1, "West" = 2, "East" = 3, "Central" = 4))


Superstore_data |> select(Region,Region_numeric) |> head()
##   Region Region_numeric
## 1  South              1
## 2  South              1
## 3   West              2
## 4  South              1
## 5  South              1
## 6   West              2
# ShipMode convert to numbers for correlation  purpose
unique(Superstore_data$Ship.Mode)
## [1] "Second Class"   "Standard Class" "First Class"    "Same Day"
Superstore_data <- Superstore_data %>%
  mutate(ShipMode_numeric = recode(Ship.Mode, "Same Day"= 1, "First Class" = 2, "Second Class" = 3, "Standard Class" = 4))


Superstore_data |> select(Ship.Mode,ShipMode_numeric) |> head()
##        Ship.Mode ShipMode_numeric
## 1   Second Class                3
## 2   Second Class                3
## 3   Second Class                3
## 4 Standard Class                4
## 5 Standard Class                4
## 6 Standard Class                4
superstore_set_3 <- Superstore_data |>
  select(Quantity,Region,Region_numeric,Ship.Mode,ShipMode_numeric,Discount)
head(superstore_set_3)
##   Quantity Region Region_numeric      Ship.Mode ShipMode_numeric Discount
## 1        2  South              1   Second Class                3     0.00
## 2        3  South              1   Second Class                3     0.00
## 3        2   West              2   Second Class                3     0.00
## 4        5  South              1 Standard Class                4     0.45
## 5        2  South              1 Standard Class                4     0.20
## 6        7   West              2 Standard Class                4     0.00

In the above set, will try to find a relation between each explanatory variable i.e. Discount, Region and ShipMode with Quantity


Here 1,2, 3 and 4 in ShipMode_numeric indicate the following categories -Same Day= 1, First Class = 2, Second Class = 3, Standard Class = 4

correlation_coefficient <- round(cor(superstore_set_3$ShipMode_numeric, superstore_set_3$Discount),2)

# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.01

Assuming, If we want to calculate a 95% confidence interval for the “Discount” variable:

print(paste("Min of Discount",min(superstore_set_3$Discount)))
## [1] "Min of Discount 0"
print(paste("Max of Discount",max(superstore_set_3$Discount)))
## [1] "Max of Discount 0.8"
# Calculate the sample mean and standard error of the mean for the Profit variable
dist_mean <- mean(superstore_set_3$Discount)
dist_se <- sd(superstore_set_3$Discount)/sqrt(length(superstore_set_3$Discount))

# Calculate the t-critical value for the 95% confidence interval with 10 degrees of freedom
t_critical <- qt(0.975, df = 10)

# Calculate the margin of error
margin_of_error <- t_critical * dist_se

# Construct the 95% confidence interval for the Profit variable
Discount_ci_upper <- dist_mean + margin_of_error
Discount_ci_lower <- dist_mean - margin_of_error

print(paste("Confidence level of Sales: ",Discount_ci_lower," to ",Discount_ci_upper))
## [1] "Confidence level of Sales:  0.151601304494897  to  0.160804138771062"