The late_shipments dataset contains supply chain data on
the delivery of medical supplies. Each row represents one delivery of a
part. The late columns denotes whether or not the part was
delivered late. A value of "Yes" means that the part was
delivered late, and a value of "No" means the part was
delivered on time.
Let’s begin our analysis by calculating a point estimate (sample statistic), namely the proportion of late shipments.
# Load libraries
library(tidyr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
library(fst)
fst package v0.9.8
library(tibble)
library(readr)
path_ltshp <- "C:/Users/JuanFer Mosquera/Documents/datasets/late_shipments.fst"
late_shipments <- read_fst(path_ltshp)
fstcore package v0.9.18
(OpenMP detected, using 20 threads)
glimpse(late_shipments)
Rows: 1,000
Columns: 26
$ id <dbl> 73003, 41222, 52354, 28471, 16901, 27238, 9100, 7183, 16784, 75595, 78576, 23957, 47937, 16232, 32402, 19263, 1618, 20264, 4123…
$ country <chr> "Vietnam", "Kenya", "Zambia", "Nigeria", "Vietnam", "Sudan", "Nigeria", "Rwanda", "Vietnam", "Haiti", "C\xaate d'Ivoire", "Viet…
$ managed_by <chr> "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - US", "PMO - …
$ fulfill_via <chr> "Direct Drop", "Direct Drop", "Direct Drop", "Direct Drop", "Direct Drop", "Direct Drop", "Direct Drop", "Direct Drop", "Direct…
$ vendor_inco_term <chr> "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "EXW", "…
$ shipment_mode <chr> "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "Air", "…
$ late_delivery <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ late <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
$ product_group <chr> "ARV", "HRDT", "HRDT", "HRDT", "ARV", "HRDT", "HRDT", "HRDT", "ARV", "MRDT", "HRDT", "ARV", "ARV", "ARV", "ARV", "HRDT", "HRDT"…
$ sub_classification <chr> "Adult", "HIV test", "HIV test", "HIV test", "Adult", "HIV test", "HIV test", "HIV test", "Adult", "Malaria", "HIV test", "Adul…
$ vendor <chr> "HETERO LABS LIMITED", "Orgenics, Ltd", "Orgenics, Ltd", "Orgenics, Ltd", "Aurobindo Pharma Limited", "Trinity Biotech, Plc", "…
$ item_description <chr> "Efavirenz/Lamivudine/Tenofovir Disoproxil Fumarate 600/300/300mg, tablets, 30 Tabs", "HIV 1/2, Determine Complete HIV Kit, 100…
$ molecule_test_type <chr> "Efavirenz/Lamivudine/Tenofovir Disoproxil Fumarate", "HIV 1/2, Determine Complete HIV Kit", "HIV 1/2, Determine Complete HIV K…
$ brand <chr> "Generic", "Determine", "Determine", "Determine", "Generic", "Uni-Gold", "Determine", "First Response", "Generic", "Generic", "…
$ dosage <chr> "600/300/300mg", "N/A", "N/A", "N/A", "300mg", "N/A", "N/A", "N/A", "300mg", "N/A", "N/A", "600/300/300mg", "600/300/300mg", "5…
$ dosage_form <chr> "Tablet - FDC", "Test kit", "Test kit", "Test kit", "Tablet", "Test kit", "Test kit", "Test kit", "Tablet", "Test kit", "Test k…
$ unit_of_measure_per_pack <dbl> 30, 100, 100, 100, 60, 20, 100, 30, 30, 25, 100, 30, 30, 30, 240, 100, 20, 1, 100, 1, 25, 30, 60, 180, 30, 20, 30, 60, 30, 100,…
$ line_item_quantity <dbl> 19200, 6100, 1364, 2835, 112, 53, 1500, 6180, 34640, 4, 3500, 19200, 24192, 2894, 1905, 1500, 9000, 2988, 32, 1600, 15304, 3715…
$ line_item_value <dbl> 201600.00, 542900.00, 109120.00, 252315.00, 1618.40, 1696.00, 120000.00, 101970.00, 122972.00, 45.00, 245000.00, 201600.00, 254…
$ pack_price <dbl> 10.50, 89.00, 80.00, 89.00, 14.45, 32.00, 80.00, 16.50, 3.55, 11.25, 70.00, 10.50, 10.50, 1.25, 1.43, 80.00, 32.00, 24.50, 25.0…
$ unit_price <dbl> 0.35, 0.89, 0.80, 0.89, 0.24, 1.60, 0.80, 0.55, 0.12, 0.45, 0.70, 0.35, 0.35, 0.04, 0.01, 0.80, 1.60, 24.50, 0.25, 24.50, 0.80,…
$ manufacturing_site <chr> "Hetero Unit III Hyderabad IN", "Alere Medical Co., Ltd.", "Alere Medical Co., Ltd.", "Alere Medical Co., Ltd.", "Aurobindo Uni…
$ first_line_designation <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ weight_kilograms <dbl> 2719, 3497, 553, 1352, 1701, 18, 445, 2130, 2804, 2, 794, 2877, 3385, 1486, 786, 714, 9000, 1992, 68, 1111, 5102, 385, 4007, 14…
$ freight_cost_usd <dbl> 4084.56, 40917.19, 7845.09, 31284.02, 4288.83, 569.11, 7120.00, 17331.66, 9759.55, 702.42, 12506.12, 4312.92, 8764.53, 12749.39…
$ line_item_insurance_usd <dbl> 207.24, 895.78, 112.18, 353.75, 2.67, 2.80, 235.20, 163.15, 152.12, 0.06, 528.22, 207.24, 314.22, 5.07, 2.80, 198.00, 460.80, 9…
late_shipments
colnames(late_shipments)
[1] "id" "country" "managed_by" "fulfill_via" "vendor_inco_term"
[6] "shipment_mode" "late_delivery" "late" "product_group" "sub_classification"
[11] "vendor" "item_description" "molecule_test_type" "brand" "dosage"
[16] "dosage_form" "unit_of_measure_per_pack" "line_item_quantity" "line_item_value" "pack_price"
[21] "unit_price" "manufacturing_site" "first_line_designation" "weight_kilograms" "freight_cost_usd"
[26] "line_item_insurance_usd"
summary(late_shipments)
id country managed_by fulfill_via vendor_inco_term shipment_mode late_delivery late
Min. : 92 Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Min. :0.000 Length:1000
1st Qu.:19492 Class :character Class :character Class :character Class :character Class :character 1st Qu.:0.000 Class :character
Median :39631 Mode :character Mode :character Mode :character Mode :character Mode :character Median :0.000 Mode :character
Mean :40308 Mean :0.067
3rd Qu.:63148 3rd Qu.:0.000
Max. :82005 Max. :1.000
product_group sub_classification vendor item_description molecule_test_type brand dosage dosage_form
Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Length:1000
Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
unit_of_measure_per_pack line_item_quantity line_item_value pack_price unit_price manufacturing_site first_line_designation weight_kilograms
Min. : 1.00 Min. : 1 Min. : 0 Min. : 0.00 Min. : 0.000 Length:1000 Length:1000 Min. : 1.0
1st Qu.: 30.00 1st Qu.: 450 1st Qu.: 10067 1st Qu.: 6.85 1st Qu.: 0.110 Class :character Class :character 1st Qu.: 136.2
Median : 60.00 Median : 2744 Median : 62318 Median : 24.50 Median : 0.485 Mode :character Mode :character Median : 844.0
Mean : 82.06 Mean : 14291 Mean : 151129 Mean : 39.93 Mean : 1.366 Mean : 2102.0
3rd Qu.: 100.00 3rd Qu.: 10000 3rd Qu.: 219520 3rd Qu.: 70.05 3rd Qu.: 0.890 3rd Qu.: 2403.0
Max. :1000.00 Max. :333334 Max. :2801262 Max. :1242.53 Max. :24.850 Max. :154780.0
freight_cost_usd line_item_insurance_usd
Min. : 14.36 Min. : 0.00
1st Qu.: 1899.72 1st Qu.: 14.58
Median : 5886.93 Median : 95.48
Mean : 11342.15 Mean : 232.86
3rd Qu.: 15533.10 3rd Qu.: 329.22
Max. :289653.20 Max. :4939.20
late_shipments <- late_shipments %>%
mutate(shipment_mode = na_if(shipment_mode, "N/A"))
# Count NA values in each column
na_counts <- colSums(is.na(late_shipments))
# Sort the results in descending order
sorted_na_counts <- sort(na_counts, decreasing = TRUE)
# See the result
sorted_na_counts
shipment_mode id country managed_by fulfill_via vendor_inco_term
1 0 0 0 0 0
late_delivery late product_group sub_classification vendor item_description
0 0 0 0 0 0
molecule_test_type brand dosage dosage_form unit_of_measure_per_pack line_item_quantity
0 0 0 0 0 0
line_item_value pack_price unit_price manufacturing_site first_line_designation weight_kilograms
0 0 0 0 0 0
freight_cost_usd line_item_insurance_usd
0 0
as.factor(late_shipments$vendor_inco_term)
# Calculate the proportion of late shipments
late_prop_samp <- late_shipments %>%
summarise(prop_late_shipments = mean(late == "Yes")) %>%
pull(prop_late_shipments)
# see the results
late_prop_samp
[1] 0.067
Since variables have arbitrary ranges and units, we need to standardize them. For example, it would be silly if a hypothesis test gave a different answer if your variables were in Euros instead of US dollars. Standardization avoids that.
One standardized value of interest in a hypothesis test is called a z-score. To calculate it, we need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the standard error of the statistic (which we estimate from the bootstrap distribution).
late_shipments_boot_sample <- replicate(
n = 5000,
expr = {
late_shipments %>%
slice_sample(prop = 1, replace = TRUE) %>%
summarise(prop_late_shpmt = mean(late == "Yes")) %>%
pull(prop_late_shpmt)
}
)
head(late_shipments_boot_sample)
[1] 0.072 0.081 0.064 0.065 0.062 0.075
late_shipments_boot_distn <- tibble(late_prop = late_shipments_boot_sample) %>%
mutate(replicate = row_number())
late_shipments_boot_distn
nrow(late_shipments_boot_distn)
[1] 5000
# Hypothesize that the proportion is 6%
late_prop_hyp <- 0.06
# Calculate the standard error
std_error <- late_shipments_boot_distn %>%
summarise(sd = sd(late_prop)) %>%
pull(sd)
# Find z-score of late_prop_samp
z_score <- (late_prop_samp - late_prop_hyp) / std_error
# See the results
z_score
[1] 0.8872453
In order to determine whether to choose the null hypothesis or the alternative hypothesis, you need to calculate a p-value from the z-score.
Let’s return to the late shipments dataset and the proportion of late shipments.
The null hypothesis, H0, is that the proportion of late shipments is 6%
The alternative hypothesis, Ha, is that the proportion of late shipments is greater than 6%
# Calculate the z-score of late_prop_samp
z_score <- (late_prop_samp - late_prop_hyp) / std_error
# Calculate the p-value
p_value <- pnorm(z_score, lower.tail = FALSE)
# See the result
p_value
[1] 0.1874734
If you give a single estimate of a sample statistic, you are bound to be wrong by some amount. For example, the hypothesized proportion of late shipments was 6%. Even if evidence suggests the null hypothesis that the proportion of late shipments is equal to this, for any new sample of shipments, the proportion is likely to be a little different. Consequently, it’s a good idea to state a confidence interval. That is, you say “we are 95% ‘confident’ the proportion of late shipments is between A and B” (for some value of A and B).
Here, you’ll use quantiles of the bootstrap distribution to calculate the confidence interval.
conf_int_quantile <- late_shipments_boot_distn %>%
summarise(lower = quantile(late_prop, 0.025),
upper = quantile(late_prop, 0.975))
# See the result
conf_int_quantile
The hypothesis test for determining if there is a difference between the means of two populations uses a different type of test statistic to the z-scores. It’s called “t”, and can be calculated from three values from each sample using this equation.\[ t = \frac{(\bar{x}_{No} \ - \bar{x}_{Yes})}{\sqrt{\frac{s^2_{No}}{n_{No}} \ + \frac{s^2_{Yes}}{n_{Yes}}}}\]
While trying to determine why some shipments are late, you may wonder
if the weight of the shipments that were late is different from the
weight of the shipments that were on time.
The late_shipments dataset has been split into a “yes”
group, where late == "Yes" and a “no” group
where late == "No". The weight of the shipment is given in
the weight_kilograms variable.
grp_tests <- late_shipments %>%
group_by(late) %>%
summarise(x_bar = mean(weight_kilograms), s = sd(weight_kilograms), n = n())
# See the results
grp_tests
# Calculate the numerator of the test statistic
numerator <- grp_tests$x_bar[1] - grp_tests$x_bar[2]
# Calculate the denominator of the test statistic
denominator <- sqrt((grp_tests$s[1]^2 / grp_tests$n[1]) + (grp_tests$s[2]^2 / grp_tests$n[2]))
# Calculate the test statistics
tstat <- numerator / denominator
# See the result
tstat
[1] -0.872321
Previously, you calculated the test statistic for the two-sample
problem of whether the mean weight of shipments is lower for shipments
that weren’t late (late == "No") compared to shipments that
were late (late == "Yes"). In order to make decisions about
it, you need to transform the test statistic with a cumulative
distribution function to get a p-value.
Recall the hypotheses:
\(H_{0}\): The mean weight of shipments that weren’t late is the same as the mean weight of shipments that were late.
\(H_{A}\): The mean weight of shipments that weren’t late is less than the mean weight of shipments that were late.
# Calculate the degrees of freedom
degreesof_freedom <- (grp_tests$n[1] + grp_tests$n[2]) - 2
# Calculate the p-value fro, the test stat
pvalue <- pt(tstat, df = degreesof_freedom, lower.tail = FALSE)
# See the result
pvalue
[1] 0.8083785
So far, we’ve only considered the case of differences in a numeric variable between two categories. Of course, many datasets contain more categories. Before you get to conducting tests on many categories, it’s often helpful to perform exploratory data analysis. That is, calculating summary statistics for each group and visualizing the distributions of the numeric variable for each category using box plots.
Here, we’ll see how the price of each package
(pack_price) varies between the three shipment modes
(shipment_mode): "Air",
"Air Charter", and "Ocean".
# Using late_shipments, group by shipment_mode, and calculate the mean and std dev of pack price
late_shipments %>%
filter(!is.na(shipment_mode)) %>%
group_by(shipment_mode) %>%
summarise(xbar_pack_price = mean(pack_price, na.rm = TRUE), s_pack_price = sd(pack_price, na.rm = TRUE))
unique(late_shipments$shipment_mode)
[1] "Air" "Ocean" "Air Charter" NA
# Using late shipments, plot pack_price vs. shipment_mode as a box plot with flipped x and y coordinates
late_shipments %>%
filter(!is.na(shipment_mode)) %>%
ggplot(aes(shipment_mode, pack_price)) +
geom_boxplot() +
coord_flip()
The box plots made it look like the distribution of pack price was different for each of the three shipment modes. However, it didn’t tell us whether the mean pack price was different in each category. To determine that, we can use an ANOVA test. The null and alternative hypotheses can be written as follows.
\(H_{0}\) : Pack prices for every category of shipment mode are the same.
\(H_{A}\) : Pack prices for some categories of shipment mode are different.
We’ll set a significance level of 0.1
# Run a linear regression of pack_price vs. shipment_mode
mdl_pack_pice_vs_shipment_mode <- lm(pack_price ~ shipment_mode, data = late_shipments)
# See the results
summary(mdl_pack_pice_vs_shipment_mode)
Call:
lm(formula = pack_price ~ shipment_mode, data = late_shipments)
Residuals:
Min 1Q Median 3Q Max
-43.15 -34.56 -12.33 26.95 1199.38
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.145 2.086 20.678 < 2e-16 ***
shipment_modeAir Charter -39.752 25.766 -1.543 0.123
shipment_modeOcean -35.330 7.174 -4.925 9.88e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 62.91 on 996 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.02572, Adjusted R-squared: 0.02376
F-statistic: 13.15 on 2 and 996 DF, p-value: 2.317e-06
# Perform ANOVA on the regression model
anova(mdl_pack_pice_vs_shipment_mode)
Analysis of Variance Table
Response: pack_price
Df Sum Sq Mean Sq F value Pr(>F)
shipment_mode 2 104042 52021 13.146 2.317e-06 ***
Residuals 996 3941466 3957
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA test didn´t tell us which categories of shipment mode had significant differences in pack_prices. To pinpoint which categories had differences, we could instead use pairwise t-tests.
# Perform pairwise t-tests on pack price, grouped by shipment_mode, no p-value adjustment.
test_results <- pairwise.t.test(late_shipments$pack_price, late_shipments$shipment_mode, p.adjust.method = "none")
# See the results
test_results
Pairwise comparisons using t tests with pooled SD
data: late_shipments$pack_price and late_shipments$shipment_mode
Air Air Charter
Air Charter 0.12 -
Ocean 9.9e-07 0.87
P value adjustment method: none
# Modify the pairwise t-tests to use Bonferroni p-value adjustment
test_results <- pairwise.t.test(late_shipments$pack_price, late_shipments$shipment_mode, p.adjust.method = "bonferroni")
# See the results
test_results
Pairwise comparisons using t tests with pooled SD
data: late_shipments$pack_price and late_shipments$shipment_mode
Air Air Charter
Air Charter 0.37 -
Ocean 3e-06 1.00
P value adjustment method: bonferroni
You may wonder if the amount paid for freight affects whether or not
the shipment was late. Recall that
in late_shipments dataset, whether or not the shipment was
late is stored in the late column. Freight costs are stored
in the freight_cost_group column, and the categories
are "expensive" and "reasonable".
We can form the hypotheses to test:
\[ H_{0}: late_{expensive} \ - late_{reasonable} \ = 0 \\ H_{A}: late_{expensive} \ - late_{reasonable} \ > 0 \]
p_hats contains the estimates of population proportions
(sample proportions) for the “expensive” and “reasonable”
groups. ns contains the sample sizes for these groups.
\(late_{expensive} \ and \ late_{reasonable}\)
late_shipments <- late_shipments %>%
mutate(freight_cost_group = ifelse(freight_cost_usd > 5000, "expensive", "reasonable"))
prop_abs_yes <- late_shipments %>%
count(late, freight_cost_group) %>%
mutate(abs_prop = n / nrow(late_shipments)) %>%
filter(late == "Yes" & (freight_cost_group == "reasonable" | freight_cost_group == "expensive")) %>%
select(abs_prop)
prop_abs_yes
p_hats <- as.vector(prop_abs_yes$abs_prop)
names(p_hats) <- c("expensive", "reasonable")
p_hats
expensive reasonable
0.052 0.015
smpl_size <- late_shipments %>%
count(freight_cost_group) %>%
mutate(count = n) %>%
#filter(freight_cost_group == "expensive") %>%
select(count)
smpl_size
ns <- as.vector(smpl_size$count)
names(ns) <- c("expensive", "reasonable")
ns
expensive reasonable
541 459
We may wonder if the amount paid for freight affects whether or not
the shipment was late. Recall that
in late_shipments dataset, whether or not the shipment was
late is stored in the late column. Freight costs are stored
in the freight_cost_group column, and the categories
are "expensive" and "reasonable".
# Calculate the pooled estimate of the population proportion
p_hat <- weighted.mean(p_hats, ns)
# See the result
p_hat
[1] 0.035017
# Calculate sample prop´n times one minus sample prop´n
p_hat_times_not_p_hat <- p_hat * (1 - p_hat)
# Divide this by the sample sizes
p_hat_times_not_p_hat_over_ns <- p_hat_times_not_p_hat / ns
# Calculate std. error
std_error <- sqrt(sum(p_hat_times_not_p_hat_over_ns))
# See the result
std_error
[1] 0.01166526
# Calculate the z-score
z_score <- (p_hats["expensive"] - p_hats["reasonable"]) / std_error
# See the result
z_score
expensive
3.171812
That took a lot of effort to calculate the p-value, so while it is
useful to see how the calculations work, it isn’t practical to do in
real-world analyses. For daily usage, it’s better to use
the infer package.
Recall the hypotheses.\[ H_{0}: late_{expensive} \ - late_{reasonable} \ = 0 \\ H_{A}: late_{expensive} \ - late_{reasonable} \ > 0 \]
Using the late_shipments dataset,
use prop_test() to perform a proportion test appropriate to
the hypotheses.
library(infer)
# Perform a proportion test appropiate to the hypothesis
test_results <- late_shipments %>%
prop_test(late ~ freight_cost_group, order = c("expensive", "reasonable"), success = "Yes",
alternative = "greater", correct = FALSE)
# See the results
test_results
The chi-square independence test compares proportions of successes of a categorical variable across categories of another categorical variable.
Trade deals often use a form of business shorthand in order to specify the exact details of their contract. These are International Chamber of Commerce (ICC) international commercial terms, or incoterms for short.
The late_shipments dataset includes a
vendor_inco_term that describes the incoterms that applied
to a given shipment. The choices are:
EXW:
“Ex works”. The buyer pays for transportation of the goods.
CIP:
“Carriage and insurance paid to”. The seller pays for freight and
insurance until the goods board a ship.
DDP:
“Delivered duty paid”. The seller pays for transportation of the goods
until they reach a destination port.
FCA:
“Free carrier”. The seller pays for transportation of the
goods.
Perhaps the incoterms affect whether or not the
freight costs are expensive. Test these hypotheses with a significance
level of 0.01.
\(H_{0}\) :
vendor_inco_term and freight_cost_group are
independent
\(H_{A}\) :
vendor_inco_term and freight_cost_group are
associated
# Plot vendor_inco_term filled by freight_cost_group
# Make it a proportional stacked bar plot
ggplot(late_shipments, aes(vendor_inco_term, fill = freight_cost_group)) +
geom_bar(position = "fill") +
ylab("proportion")
# Perform a chi-square test of independence on freight_cost_group and vendor_inco_term
test_results_so <- late_shipments %>%
chisq_test(vendor_inco_term ~ freight_cost_group)
Warning: Chi-squared approximation may be incorrect
# See results
test_results_so
The chi-square goodness of fit test compares proportions of each level of a categorical variable to hypothesized values. Before running such a test, it can be helpful to visually compare the distribution in the sample to the hypothesized distribution.
Recall the vendor incoterms in the late_shipments
dataset. Let’s hypothesize that the four values occur with these
frequencies in the population of shipments.
EXW: 0.7
CIP: 0.05
CIF: 0.05
DAP: 0.05
DDU: 0.05
DDP: 0.05
FCA: 0.05
# Using late_shipments,m count the vendor incoterms
vendor_inco_term_counts <- late_shipments %>%
count(vendor_inco_term)
vendor_inco_term_counts
# Get the number of rows in the whole sample
n_total <- nrow(late_shipments)
n_total
[1] 1000
hypothesized_incot <- tribble(~ vendor_inco_term, ~ prop, "EXW", 0.75, "CIP", 0.05, "CIF", 0.0, "DAP", 0.0, "DDU", 0.0, "DDP", 0.1, "FCA", 0.1) %>%
# Add a column of hypothesized counts for the incoterms
mutate(n = prop * n_total)
hypothesized_incot
# Using vendor_inco_term_counts, plot n vs. vendor_inco_term
ggplot(vendor_inco_term_counts, aes(vendor_inco_term, n)) +
# Make it a (precalculated) bar plot
geom_col() +
# Add points from hypothesized
geom_point(data=hypothesized_incot, color = "green")
The bar plot of vendor_inco_term suggested that its
distribution across the four categories was quite close to the
hypothesized distribution. You’ll need to perform a chi-square
goodness of fit test to see whether the differences are
statistically significant.
To decide which hypothesis to choose, we’ll set a significance level
of 0.1.
hypothesized_props <- c("EXW" = 0.75, "CIP" = 0.05, "CIF"= 0.0, "DAP" = 0.0, "DDU" = 0.0, "DDP"= 0.1, "FCA" = 0.1)
hypothesized_props
EXW CIP CIF DAP DDU DDP FCA
0.75 0.05 0.00 0.00 0.00 0.10 0.10
# Run chi-square goodness of fit test on vendor_inco_term
test_results <- late_shipments %>%
chisq_test(response = vendor_inco_term, p = hypothesized_props)
Warning: Chi-squared approximation may be incorrect
test_results
Each hypothesis test makes assumptions about the data. It’s only when these assumptions are met that it is appropriate to use that hypothesis test.
Whether it uses one or two samples, every hypothesis test assumes that each sample is randomly sourced from its population. If you don’t have a random sample, then it won’t be representative of the population. To check this assumption, you need to know where your data came from. There are no statistical or coding tests you can perform to check this. If in doubt, ask the people involved in collecting the data, or a domain expert that understands the population being sampled.
Tests also assume that each observation is independent. There are some special cases like paired t-tests where dependencies between two samples are allowed, but these change the calculations so you need to understand where such dependencies occur. As you saw with the paired t-test, not accounting for dependencies results in an increased chance of false negative and false positive errors. This is also a difficult problem to diagnose after you have the data. It needs to be discussed before data collection.
Hypothesis tests also assume that your sample is big enough. Smaller samples incur greater uncertainty, and mean that the Central Limit Theorem doesn’t apply, which in turn means that the sampling distribution might not be normally distributed. The increased uncertainty means you get wider confidence intervals on the parameter you are trying to estimate. The Central Limit Theorem not applying means the calculations on the sample could be nonsense, which increases the chance of false negative and positive errors. The check for “big enough” depends on the test and that’s where we’ll head next.
For one sample t-tests, a popular heuristic is that you need at least thirty observations in your sample. For the two sample case or ANOVA, you need thirty observations in each. That means you can’t compensate for one small group sample by making the other one bigger. In the paired case, you need thirty pairs of observations.
1 Sometimes you can get away with less than 30; the important thing is that the null distribution appears normal.
\(n \geq 30\)
\(Numbers \ of \ rows \ in \ your \ data \geq 30\)
\(n_{1} \geq 30 \ , n_{2} \geq 30\)
\(n_{i} \geq 30 for \ all \ values \ of \ i\)
For one sample proportion tests, the sample is considered big enough if it contains at least ten successes and ten failures. Notice that if the probability of success is close to zero or close to one, then you need a bigger sample. In the two sample case the size requirements apply to each sample separately.
\(n \times \hat{p} \geq 10\)
\(n \times (1 - \hat{p}) \geq 10\)
\(n:\) Sample size
\(\hat{p}:\) Proportion of successes on sample
\(n_{1} \times \hat{p}_{1} \geq 10\)
\(n_{2} \times \hat{p}_{2} \geq 10\)
\(n_{1} \times (1 - \hat{p}_{1}) \geq 10\)
\(n_{2} \times (1 - \hat{p}_{2}) \geq 10\)
The chi-square test is slightly more forgiving and only requires five successes and failures in each group rather than ten.
\(n_{i} \times \hat{p}_{i} \geq 5\) for all values in \(i\)
\(n_{i} \times (1 - \hat{p}_{i}) \geq 5\) for all values in \(i\)
\(n_{i}:\) sample size for group \(i\)
\(\hat{p}_{i}:\) proportion of successes in sample group \(i\)
One more check you can perform is to calculate a bootstrap distribution and visualize it with a histogram. If you don’t see a bell-shaped normal curve, then one of the assumptions hasn’t been met. In that case, you should revisit the data collection process, and see if any of the three assumptions of randomness, independence, and sample size do not hold.
Recall the hypotheses.
\[ H_{0}: late_{expensive} \ - late_{reasonable} \ = 0 \\ H_{A}: late_{expensive} \ - late_{reasonable} \ > 0 \]
Let’s compare that traditional approach
using prop_test() with a simulation-based infer
pipeline.
# Specify thet we are interested in late proportions across freight_cost_groups, where "Yes" denotes success
specified <- late_shipments %>%
specify(late ~ freight_cost_group, success = "Yes")
# See the result
specified
Response: late (factor)
Explanatory: freight_cost_group (factor)
# Extend the pipeline to declare the null hypothesis that the variables are independent
hypothesized <- late_shipments %>%
specify(late ~ freight_cost_group, success = "Yes") %>%
hypothesize(null = "independence")
# See the result
hypothesized
Response: late (factor)
Explanatory: freight_cost_group (factor)
Null Hypothesis: independence
The infer pipeline for hypothesis testing requires four steps to calculate the null distribution: specify, hypothesize, generate, and calculate.
Let’s continue the pipeline you began in the previous coding exercise. We’ll get a set of differences in proportions that are distributed as though the null hypothesis, that the proportion of late shipments is the same across freight cost groups, is true.
# Extend the pipeline to generate 2000 permutations
generated <- late_shipments %>%
specify(late ~ freight_cost_group, success = "Yes") %>%
hypothesize(null = "independence") %>%
generate(reps = 2000, type = "permute")
# See the results
generated
Response: late (factor)
Explanatory: freight_cost_group (factor)
Null Hypothesis: independence
# Extend the pipeline to caculate the difference in proportions (expensive minus reasonable)
null_distn <- late_shipments %>%
specify(late ~ freight_cost_group, success = "Yes") %>%
hypothesize(null = "independence") %>%
generate(reps = 2000, type = "permute") %>%
calculate(stat = "diff in props", order = c("expensive", "reasonable"))
null_distn
Response: late (factor)
Explanatory: freight_cost_group (factor)
Null Hypothesis: independence
visualize(null_distn)
You now have a null distribution. In order to get a p-value and weigh
up the evidence against the null hypothesis, you need to calculate the
difference in proportions that is observed in the
late_shipments sample.
# Copy, paste & modify the pipeline to get the observed statistic
obs_stat <- late_shipments %>%
specify(late ~ freight_cost_group, success = "Yes") %>%
hypothesize(null = "independence") %>%
calculate(stat = "diff in props", order = c("expensive", "reasonable"))
Message: The independence null hypothesis does not inform calculation of the observed statistic (a difference in proportions) and will be ignored.
# See the results
obs_stat
Response: late (factor)
Explanatory: freight_cost_group (factor)
Null Hypothesis: independence
# Visualize the null dist´n, adding a vertical line to the observed statistic
visualize(null_distn) +
geom_vline(aes(xintercept = stat), data=obs_stat, color = "red")
# Get the p-value
p_value <- get_p_value(null_distn, obs_stat, direction = "two sided")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation based on the number of `reps` chosen in the `generate()` step.
ℹ See ]8;;ide:help:infer::get_p_value`get_p_value()`]8;; for more information.
# See the result
p_value
W manually performed the steps for a t-test to explore these hypotheses.
\(H_{0}\) : The mean weight of shipments that weren’t late is the same as the mean weight of shipments that were late.
\(H_{A}\) : The mean weight of shipments that weren’t late is less than the mean weight of shipments that were late.
You can run the test more concisely using infer’s
t_test().
t_test() assumes that the null distribution is normal.
We can avoid assumptions by using a simulation-based non-parametric
equivalent.
# Fill out the null distribution pipeline
null_distn <- late_shipments %>%
# Specify weight_kilograms vs. late
specify(weight_kilograms ~ late) %>%
# Declare a null hypothesis of independence
hypothesize(null = "independence") %>%
# Generate a 100 permutation replicates
generate(reps = 1000, type = "permute") %>%
# Calculate the difference in means ("No", "miuns", "Yes")
calculate(stat = "diff in means", order = c("No", "Yes"))
# See the results
null_distn
Response: weight_kilograms (numeric)
Explanatory: late (factor)
Null Hypothesis: independence
# Calculate the observed difference in means
obs_stat <- late_shipments %>%
specify(weight_kilograms ~ late) %>%
calculate(stat = "diff in means", order = c("No", "Yes"))
# See the results
obs_stat
Response: weight_kilograms (numeric)
Explanatory: late (factor)
# Get the p-value
p_value <- get_p_value(null_distn, obs_stat, direction = "less")
# See the result
p_value
Another class of non-parametric hypothesis tests are called rank sum tests. Ranks are the positions of numeric values from smallest to largest. Think of them as positions in running events: whoever has the fastest (smallest) time is rank 1, second fastest is rank 2, and so on.
By calculating on the ranks of data instead of the actual values, you can avoid making assumptions about the distribution of the test statistic. It’s most robust in the same way that a median is more robust than a mean.
Two commonly used rank-based tests are the Wilcoxon-Mann-Whitney test, which is like a non-parametric t-test, and the Kruskal-Wallis test, which is like a non-parametric ANOVA.
# Run a Wilcoxon-Mann-Whitney test on weight_kilograms ~ late
test_results <- wilcox.test(weight_kilograms ~ late, data = late_shipments)
# See the results
test_results
Wilcoxon rank sum test with continuity correction
data: weight_kilograms by late
W = 21480, p-value = 1.861e-05
alternative hypothesis: true location shift is not equal to 0
# Run a Kruskal-Wallace test on weight_kilograms vs. shipment_mode
test_results <- kruskal.test(weight_kilograms ~ shipment_mode, data = late_shipments)
# See the results
test_results
Kruskal-Wallis rank sum test
data: weight_kilograms by shipment_mode
Kruskal-Wallis chi-squared = 159.09, df = 2, p-value < 2.2e-16