Project 1

2023-02-06

Data Set Background

A grocery shopping and delivery software called Instacart seeks to make it simple to stock your cupboard and fridge with your preferred foods and essentials whenever you need them. Personal shoppers assess your order once you place it using the Instacart app, browse for you in-store, and deliver your items. A relational collection of files describing the orders of consumers over time makes up the data set for this event. The competition’s objective is to foretell the goods that will be included in a user’s upcoming order. An anonymous sample of more than 3 million grocery orders from more than 200,000 Instacart users make up the data set.

This data set titled “Instacart Market Basket Orders” has 5 columns and 500 rows
We will try to model this data set through regression and machine learning.
We want to see whether reorders are effected by the purchase frequency and number of ordered products.
The data set can be found here

Problem Definition

The problem at hand is that the data set does not illustrate the relationship between, the reorders, products being ordered at once, and days difference between orders. We also do not know if the target reorder mean is met (6 reorders).
I want to model the data using regression and statistical interpretation in order to better understand the data.
We want to determine whether the sample data indicates that the frequency of purchases and the quantity of orders have an impact on reorders.
We will try to model this data set through regression and machine learning.
We will also use hypothesis testing to determine whether the target mean goal is satisfied.

df <- read.csv(url
("https://raw.githubusercontent.com/ykaih/AGEC317/master/Random_500users_train.csv"))

#reading the data set through URL  FORMAT for easy ALT

df <- data.frame(df) # data frame data set on orders

Preparing & Visualizing The Data

Here we are modifying, visualizing and cleaning the data set. This will help prepare the data set for the latter stages so that we can use regression, graphical representations, and statistical procedures in a manner that will not cause any trouble. This will also help for hypothesis testing. Modifying the data can help understand the relationship between the purchase frequency, products per order, and numbers of reorders. This header can help us visualize the data set.

# We are cleaning the data and removing the unnecessary columns. 
cleandata <- df[,-c(1,2)]  #Removing the order ID for simplicity 
header = list(enabled = T, background = "orange")
Tgraphic = cleandata %>% group_by(days_since_prior_order) %>% 
  summarize(reorders = sum(reorders), products = sum(products))

Now we are ready to display the data set:

Tgraphic %>% kbl(col.names = c("Days Since Order","Reorders","# Products"))  %>% 
  kable_styling(fixed_thead = header, font_size = 15)  %>%
  column_spec(1, width = "9em", bold = T, border_right = T, border_left = T, 
              underline = T, color = "black", background = "yellow") %>%
  column_spec(2, width = "3em", bold = T, border_right = T, border_left = T, 
              underline = T, color = "black", background = "yellow") %>% 
  column_spec(3, width = "6em", bold = T, border_right = T, border_left = T, 
              underline = T, color = "black", background = "yellow") %>%
  kable_styling(full_width = F) %>% scroll_box(width = "450px", height = "200px")

Days Since Order	Reorders	# Products
0	44	48
1	92	132
2	61	99
3	52	72
4	112	165
5	122	204
6	117	178
7	342	457
8	142	212
9	146	263
10	119	167
11	56	103
12	78	128
13	117	187
14	95	133
15	60	102
16	60	92
17	56	91
18	36	47
19	50	101
20	39	57
21	34	50
22	47	73
23	33	57
24	15	40
25	31	71
26	44	71
27	89	150
28	77	134
29	33	55
30	767	1516

Preparing & Visualizing The Data

Here we will take a look at the data set in a form of a 3D plot and a density histogram in order to better understand the distribution of the data. This will help in conducting statistical analysis later on. From the results shown below we can see that the data is heavily right skewed with the mean and other important variables falling within the region.

#3D Plot
PPP = plot_ly(x=cleandata$reorders, y=cleandata$products, 
        z=cleandata$days_since_prior_order, type="scatter3d", 
        mode="markers", marker = list(size = 1)) %>%
        layout(title = '3D Data Distribution', plot_bgcolor = "pink") 

PPP %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Reorders'),
                     yaxis = list(title = 'Products'),
                     zaxis = list(title = 'Days Since Reorder')))

Preparing & Visualizing The Data

We can now take a look at the density histogram which will help us better understand the distribution which is clearly right skewed.

# Density Histogram
ggplot(cleandata, aes(x=reorders), bindwidth = 30) + 
geom_histogram(aes(y= after_stat(density)), colour="grey", fill="pink") +
labs(x = "# Reordered Items", y="Count", title="Product Reorders / Order") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5), 
                   text = element_text(lineheight = 15))  + 
  geom_vline(aes(xintercept=mean(reorders)),
            color="red", linetype="dashed", linewidth = 1.1) +
 geom_density(alpha=.2, fill="green")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Analysis

Here we form a 2d graphic with a regression line to best understand the relationship between the data and the correlation. We can now regress the number of reorders on the quantity of products after editing and inspecting the data set, and the updated data shows that we will utilize information from the data set to carry out the stated regression procedure.

# Here we can view the regression results
RL <- lm(reorders ~ products, data = cleandata)
summary(RL)

## 
## Call:
## lm(formula = reorders ~ products, data = cleandata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4887  -1.7826   0.3717   1.7656  12.2934 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.80192    0.23646  -3.391 0.000751 ***
## products     0.67877    0.01794  37.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.191 on 498 degrees of freedom
## Multiple R-squared:  0.7419, Adjusted R-squared:  0.7414 
## F-statistic:  1432 on 1 and 498 DF,  p-value: < 2.2e-16

Data Analysis

The regression equation is y =.68x - .8 with r^2 = 0.742. We can take a look at the regression line below:

RGplot

## `geom_smooth()` using formula = 'y ~ x'

Data Analysis

Here we will conduct hypothesis testing to determine whether the average number of reorders within the sample data set provided by Instacart is a success given the projected goals (average of 6 reorders) of the company (for the average number of reorders). We will utilize hypothesis testing to determine whether the mean goal of the average number of reorders lies within the data set. The reorders number that is planned to be tested and utilized as a benchmark is 6. We use a level of significance of 0.05. This is a z score of +- 1.96.

The formula for conducting the hypothesis test is \[ z = {x̅ - μ \over {σ \over \sqrt{n}}} \]
We can get the standard deviation, mean, and sample size by doing the following:

x = mean(df$reorders)
u = 6
p = sd(df$reorders)
n = length(df$reorders)

Data Analysis

Given the data, we can now compute the hypothesis test:

teststatistic <- (x - u) / (p * 1/sqrt(n))
print(paste('The test statistic equal to ',(teststatistic)))

## [1] "The test statistic equal to  1.1831136744869"

The result is statistically significant as the value of 1.18 < 1.96 meaning that it is within the appropriate boundary. The mean through the sample data of has met the goal of 6 reorders. This is significant. Conducting a confidence interval furthermore proves our testing.

confidenceinterval1 <- x - ((1.96 * p) / sqrt(n) )
confidenceinterval2 <- x + ((1.96 * p) / sqrt(n) )
interval <- c(confidenceinterval1,confidenceinterval2)
interval

## [1] 5.781994 6.882006

The value of 6 is in between the calculated confidence interval above which is further proof that the data sample has met the Instacart objectives.

Data Analysis

We form a bell shaped curve for visualization of graphic. The Z score boundary of (-1.96, 1.96) - represented by the green dotted lines includes the value of 1.18 which is represented by the solid red line.

    a = seq(-4,4,.01) 
    dense = dnorm(a, 0,1)  
    df = data.frame(a, dense) 
    x = t.test(a, conf.level = 0.05)$conf.int 
    x2 = qnorm(c(1 - 0.05, 0.05), mean = mean(a), sd = sd(a)) 
    ggplot(data = df, aes(x = a, y = dense)) + geom_point() + 
      geom_vline(xintercept = c(-1.96, 1.96), 
    linetype="dotted", lwd = 3, colour = 'green') + 
    labs(title = "Instacart Reorder Significance", y = "", x = "Z Value") + geom_vline(xintercept=teststatistic,lwd=3,colour="red")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Conclusion

An anonymous sample of more than 3 million grocery orders from more than 200,000 Instacart users made up the data set. We were able to understand the relationship between, the reorders, products being ordered at once, and days difference between orders. Using statistical linear regression it was determined that a strong and linear relationship exists between the two variables. Using a 3D plot and histogram we were also able to understand that the data was heavily right skewed. We also found out whether the target mean was met, which it was through hypothesis testing. Utilizing regression and machine learning procedures I was able to better understand the relationship between the data and this was also supported by the hypothesis testing, and additional graphics created.