ls()
## [1] "Experts"    "Forecast"   "Newspapers"

Let’s Start


  1. Find the plus/minus z [the values of x below] such that the area in red is exactly equal to 0.6827, as shown below. Equivalently, the central probability in a standard normal distribution between -1 and 1 is 0.6827.
result <- prob_norm(mean = 0, stdev = 1, plb = 0.1587, pub = 0.8413)
summary(result, type = "probs")
## Probability calculator
## Distribution: Normal
## Mean        : 0 
## St. dev     : 1 
## Lower bound : 0.1587 
## Upper bound : 0.8413 
## 
## P(X < -1) = 0.1587
## P(X > -1) = 0.841
## P(X < 1) = 0.8413
## P(X > 1) = 0.159
## P(-1 < X < 1)     = 0.6826
## 1 - P(-1 < X < 1) = 0.317
plot(result, type = "probs")

The Case

A chemical manufacturer will build a new epoxy plant in one of two locations – one adjacent to an extant facility in Tianjin, China, and one near the company’s headquarters in the United States near Henry Hub in Louisiana. The plant is a first-of-its-kind, entirely automated plant that hosts only a tiny number of supervisors, and labor costs across the two locations do not differ. The only input that might reasonably differ in cost is energy. Where to locate the factory and potential differences in energy cost will play a crucial role in our decision.

An Energy Cost Comparison

An intern has collected two random samples of energy prices from local newspapers for comparison. Over the last six months, the intern has a random sample of size 25 from Tianjin, and a random sample of size 36 from Henry Hub reported in Newspapers – a sheet in the Excel Workbook on WISE referenced above. You are to undertake an initial summary of these data. You should note that both prices are reported in United States dollars per Million BTU (million British thermal units). The dates that the prices (spot) were recorded are listed in the adjacent column for each series. The worksheet contains the data in both a stacked and unstacked format.
  1. Provide a single plot combining the two samples of data. Which of the two series is more symmetric? What information in the plot led you to this choice?
ggplot(Newspapers) +
 aes(x = Price, fill = Location) +
 geom_density(adjust = 2) +
 scale_fill_manual(values = alpha(c("blue", "red"), .3)) +
 labs(title = "Price/Location") +
 theme_gray()

HH’s data is more symmetric than Tianjin. This last one is skewed to the left, which is a result of a higher boundary.

  1. What is the sample mean and sample standard deviation of natural gas prices in Tianjin?

Mean = 3.994 dollars, Standard Deviation = 0.051 dollars

result <- explore(
  Newspapers, 
  vars = c("HHPrice", "TianjinPrice"), 
  fun = c("n_obs", "mean", "min", "max", "sd", "p25", "p75"), 
  data_filter = "vars = 'TianjinPrice'", 
  nr = Inf
)
summary(result)
## Explore
## Data        : Newspapers 
## Filter      : vars = 'TianjinPrice' 
## Functions   : n_obs, mean, min, max, sd, p25, p75 
## Top         : Function 
## 
##      variable n_obs  mean   min   max    sd   p25   p75
##       HHPrice    61 3.796 3.520 4.070 0.140 3.690 3.902
##  TianjinPrice    61 3.994 3.880 4.050 0.051 3.980 4.040
# dtab(result) %>% render()
  1. What is a 95% confidence interval for Tianjin’s average price, given this precise sample? Provide both boundaries.
result <- single_mean(
  Newspapers, 
  var = "TianjinPrice", 
  data_filter = "vars = c('HHPrice', 'TianjinPrice')"
)
summary(result)
## Single mean test
## Data      : Newspapers 
## Filter    : vars = c('HHPrice', 'TianjinPrice') 
## Variable  : TianjinPrice 
## Confidence: 0.95 
## Null hyp. : the mean of TianjinPrice = 0 
## Alt. hyp. : the mean of TianjinPrice is not equal to 0 
## 
##   mean  n n_missing    sd    se    me
##  3.994 61        36 0.051 0.010 0.020
## 
##   diff   se t.value p.value df  2.5% 97.5%    
##  3.994 0.01 394.167  < .001 24 3.973 4.015 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Assume that the underlying population is normal with the population mean, and population standard deviation exactly equal to the sample mean and standard deviation that you have estimated from the sample from Tianjin. What is the probability of a daily price of $4.00 or higher at Tianjin?

P(X > 4) = 0.453

result <- prob_norm(mean = 3.994, stdev = 0.051, ub = 4)
summary(result)
## Probability calculator
## Distribution: Normal
## Mean        : 3.994 
## St. dev     : 0.051 
## Lower bound : -Inf 
## Upper bound : 4 
## 
## P(X < 4) = 0.547
## P(X > 4) = 0.453
plot(result)

  1. What is the probability of a mean price of $4.00 or lower, given the Tianjin data?
result <- single_mean(
  Newspapers, 
  var = "TianjinPrice", 
  comp_value = 4, 
  alternative = "greater"
)
summary(result)
## Single mean test
## Data      : Newspapers 
## Variable  : TianjinPrice 
## Confidence: 0.95 
## Null hyp. : the mean of TianjinPrice = 4 
## Alt. hyp. : the mean of TianjinPrice is > 4 
## 
##   mean  n n_missing    sd    se    me
##  3.994 61        36 0.051 0.010 0.020
## 
##    diff   se t.value p.value df    5% 100%  
##  -0.006 0.01  -0.553   0.707 24 3.977  Inf  
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "hist", custom = FALSE)

The Probability of an average price is 0.707

  1. Speculative: What is a 95% confidence interval of the difference in average prices. This calculation requires an assumption about the variances/standard deviations of the two samples that you have. What have you assumed? Why? Is there any information in the plot that you provided that might inform this decision?

HH is cheaper 0.147 to 0.25 dollars

result <- compare_means(
  Newspapers, 
  var1 = "Location", 
  var2 = "Price", 
)
summary(result, show = TRUE)
## Pairwise mean comparisons (t-test)
## Data      : Newspapers 
## Variables : Location, Price 
## Samples   : independent 
## Confidence: 0.95 
## Adjustment: None 
## 
##  Location  mean  n n_missing    sd    se    me
##  HenryHub 3.796 36         0 0.140 0.023 0.047
##   Tianjin 3.994 25         0 0.051 0.010 0.021
## 
##  Null hyp.            Alt. hyp.                       diff   p.value se   
##  HenryHub = Tianjin   HenryHub not equal to Tianjin   -0.199 < .001  0.025
##  t.value df     2.5%  97.5%     
##  -7.814  47.036 -0.25 -0.147 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "box", custom = FALSE)

Profit Forecasting by Location

The division that is principally entrusted with forecasting and the supply chain has calculated a series of profit forecasts in dollars for the first 52 weeks after completing the Tianjin factory and the factory in Henry Hub. Because the production process is one-of-a-kind, it is known that the first 52 weeks will involve planned calibration of the factory; this is accounted for in the weekly forecasts and is the same pattern across both locations. These forecasts appear in the worksheet Forecast of the workbook on WISE.
  1. Summarize profits by location numerically and graphically.
visualize(
  Forecast, 
  xvar = "Period", 
  yvar = "Forecast", 
  type = "line", 
  color = "Location", 
  custom = FALSE
)

result <- explore(
  Forecast, 
  vars = "Forecast", 
  byvar = "Location", 
  fun = c("n_obs", "mean", "min", "max", "sd", "p25", "p75"), 
  data_filter = "vars == Forecast", 
  nr = Inf
)
summary(result)
## Explore
## Data        : Forecast 
## Filter      : vars == Forecast 
## Grouped by  : Location 
## Functions   : n_obs, mean, min, max, sd, p25, p75 
## Top         : Function 
## 
##  Location variable n_obs          mean       min       max            sd
##  HenryHub Forecast    52 4,204,474.865 1,712,033 7,166,874 1,590,364.990
##   Tianjin Forecast    52 3,860,590.442 1,525,784 6,761,850 1,561,618.097
##            p25           p75
##  2,726,455.750 5,520,862.250
##  2,336,259.000 5,197,883.000
# dtab(result) %>% render()
  1. If profits in Henry Hub are normally distributed with mean and standard deviation as calculated above, what is the probability of profits between 4 million and 4.5 million?

The probability is 0.125

result <- prob_norm(
  mean = 4204474.865, 
  stdev = 1590364.99, 
  lb = 4000000, 
  ub = 4500000
)
summary(result)
## Probability calculator
## Distribution: Normal
## Mean        : 4204475 
## St. dev     : 1590365 
## Lower bound : 4e+06 
## Upper bound : 4500000 
## 
## P(X < 4e+06) = 0.449
## P(X > 4e+06) = 0.551
## P(X < 4500000) = 0.574
## P(X > 4500000) = 0.426
## P(4e+06 < X < 4500000)     = 0.125
## 1 - P(4e+06 < X < 4500000) = 0.875
plot(result)

  1. If profits in Tianjin are normally distributed with mean and standard deviation as calculated above, what is the range of profits to be obtained with central probability 0.5 [the interquartile range]?

P(2807295.04 < X < 4913885.84) = 0.5

result <- prob_norm(
  mean = 3860590.442, 
  stdev = 1561618.097, 
  plb = 0.25, 
  pub = 0.75, 
  dec = 2
)
summary(result, type = "probs")
## Probability calculator
## Distribution: Normal
## Mean        : 3860590 
## St. dev     : 1561618 
## Lower bound : 0.25 
## Upper bound : 0.75 
## 
## P(X < 2807295.04) = 0.25
## P(X > 2807295.04) = 0.75
## P(X < 4913885.84) = 0.75
## P(X > 4913885.84) = 0.25
## P(2807295.04 < X < 4913885.84)     = 0.5
## 1 - P(2807295.04 < X < 4913885.84) = 0.5
plot(result, type = "probs")

  1. Provide a two-sided 95% confidence interval for average profits per week over the 52 week period in Tianjin, China.

From 3425833 dollars to 4295348 dollars

result <- single_mean(
  Forecast, 
  var = "TianjinForecast", 
  comp_value = 2, 
  data_filter = "vars == Forecast"
)
summary(result)
## Single mean test
## Data      : Forecast 
## Filter    : vars == Forecast 
## Variable  : TianjinForecast 
## Confidence: 0.95 
## Null hyp. : the mean of TianjinForecast = 2 
## Alt. hyp. : the mean of TianjinForecast is not equal to 2 
## 
##           mean   n n_missing            sd          se          me
##  3,860,590.442 104        52 1,561,618.097 216,557.466 429,490.658
## 
##     diff       se t.value p.value df    2.5%   97.5%    
##  3860588 216557.5  17.827  < .001 51 3425833 4295348 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "hist", custom = FALSE)

  1. Provide a two-sided 95% confidence interval for average profits per week over the 52 week period in Henry Hub.

From 3761714 dollars to 4647235 dollars

result <- single_mean(
  Forecast, 
  var = "HenryHubForecast", 
  comp_value = 2, 
  data_filter = "vars == Forecast"
)
summary(result)
## Single mean test
## Data      : Forecast 
## Filter    : vars == Forecast 
## Variable  : HenryHubForecast 
## Confidence: 0.95 
## Null hyp. : the mean of HenryHubForecast = 2 
## Alt. hyp. : the mean of HenryHubForecast is not equal to 2 
## 
##           mean   n n_missing            sd          se          me
##  4,204,474.865 104        52 1,590,364.990 220,543.943 437,396.894
## 
##     diff       se t.value p.value df    2.5%   97.5%    
##  4204473 220543.9  19.064  < .001 51 3761714 4647235 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "hist", custom = FALSE)

  1. Speculative: Provide a two-sided 95% confidence interval for the average difference in profits per week over the 52 week period.

HH’s averages is 316072.2 to 371696.7 dollars per period.

result <- compare_means(
  Forecast, 
  var1 = "Location", 
  var2 = "Forecast", 
  samples = "paired", 
  data_filter = "vars == Forecast"
)
summary(result, show = TRUE)
## Pairwise mean comparisons (t-test)
## Data      : Forecast 
## Filter    : vars == Forecast 
## Variables : Location, Forecast 
## Samples   : paired 
## Confidence: 0.95 
## Adjustment: None 
## 
##  Location          mean  n n_missing            sd          se          me
##  HenryHub 4,204,474.865 52         0 1,590,364.990 220,543.943 442,760.441
##   Tianjin 3,860,590.442 52         0 1,561,618.097 216,557.466 434,757.255
## 
##  Null hyp.            Alt. hyp.                       diff     p.value se      
##  HenryHub = Tianjin   HenryHub not equal to Tianjin   343884.4 < .001  13853.59
##  t.value df 2.5%     97.5%       
##  24.823  51 316072.2 371696.7 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "bar", custom = FALSE)

  1. Speculative: Is either of the two locations forecast to have higher profits with at least 95% confidence? Which one?

HenryHub

result <- compare_means(
  Forecast, 
  var1 = "TianjinForecast", 
  var2 = "HenryHubForecast", 
  alternative = "greater", 
  comb = "TianjinForecast:TianjinForecast"
)
summary(result, show = TRUE)
## Pairwise mean comparisons (t-test)
## Data      : Forecast 
## Variables : TianjinForecast, HenryHubForecast 
## Samples   : independent 
## Confidence: 0.95 
## Adjustment: None 
## 
##                            mean   n n_missing            sd          se
##   TianjinForecast 3,860,590.442 104        52 1,561,618.097 216,557.466
##  HenryHubForecast 4,204,474.865 104        52 1,590,364.990 220,543.943
##           me
##  429,490.658
##  437,396.894
## 
##  Null hyp.                            Alt. hyp.                           
##  TianjinForecast = HenryHubForecast   TianjinForecast > HenryHubForecast  
##  diff      p.value se       t.value df      5%        100%  
##  -343884.4 0.866   309090.2 -1.113  101.966 -856954.1 Inf   
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "bar", custom = FALSE)

  1. If the building cost were $12 million more for the more profitable location, provide a 95% confidence interval for the average recovery time of cost difference expressed in weeks.
12000000 / c(316072.2, 371696.7)
## [1] 37.96601 32.28439

It will take us between 32.2 to 38 weeks

Expert Assessment of Location

As management closes on a decision, word has leaked about the factory and the location decision. Bloomberg Television and BCNC have reported on the plans, and industry experts now well understand the decision facing management. The Sall Tweet Kernel conducted a survey of independent samples of industry experts regarding the decision, and their responses are recorded as worksheet Experts in the workbook on WISE. A Build represents a belief that the facility should be built in the given location (Tianjin/Henry Hub), while a No represents an opinion that the facility should not be built in the given location. Concerning these data, you are to answer the following questions.

  1. Tabulate opinions by location. What is the proportion of analysts favoring each location?
result <- cross_tabs(
  Experts, 
  var1 = "Location", 
  var2 = "Build", 
 
)
summary(result, check = c("observed", "row_perc"))
## Cross-tabs
## Data     : Experts 
## Variables: Location, Build 
## Null hyp.: there is no association between Location and Build 
## Alt. hyp.: there is an association between Location and Build 
## 
## Observed:
##           Build
## Location   Build No  Total
##   HenryHub  66    15  81  
##   Tianjin   31    33  64  
##   Total     97    48 145  
## 
## Row percentages:
##           Build
## Location   Build   No Total
##   HenryHub  0.81 0.19     1
##   Tianjin   0.48 0.52     1
##   Total     0.67 0.33     1
## 
## Chi-squared: 17.628 df(1), p.value < .001 
## 
## 0.0 % of cells have expected values below 5
plot(result, check = c("observed", "row_perc"), custom = FALSE) 

  1. The true proportion of analysts believing that the facility should be built in Henry Hub, Louisiana, is 0.5 with 95% confidence. True or False, when compared to the alternative that it is less than 0.5.

95% Of the time, it ranges from 0.713 to 0.892. Both are clearly greater than 0.5

  1. Prospective: How large a sample would be required, with 95% confidence, to estimate the true proportion to within � 0.15 assuming, for planning purposes, that the true proportion is 0.5?

43 is enough; we have 64 and 81.

result <- sample_size(
  type = "proportion", 
  err_prop = 0.15, 
  conf_lev = 95
)
summary(result)
## Sample size calculation
## Calculation type     : Proportion
## Acceptable Error     : 0.15 
## Proportion           : 0.5 
## Confidence level     : 0.95 
## Incidence rate       : 1 
## Response rate        : 1 
## Population correction: None
## 
## Required sample size     : 43
## Required contact attempts: 43
  1. The survey was, in both cases, sent to samples of size 100. As is clear, not everyone that was sent the survey actually responded. First, suppose that the probability of responding is 0.8. 95% of the time, we should receive 73 or more responses to this survey.
result <- prob_binom(n = 100, p = 0.8, plb = 0.05)
summary(result, type = "probs")
## Probability calculator
## Distribution: Binomial
## n           : 100 
## p           : 0.8 
## Mean        : 80 
## St. dev     : 4 
## Lower bound : 0.05 (73)
## Upper bound : 
## 
## P(X  = 73) = 0.022
## P(X  < 73) = 0.034
## P(X <= 73) = 0.056
## P(X  > 73) = 0.944
## P(X >= 73) = 0.966
plot(result, type = "probs")

  1. Assume that responding to the survey or not is a binomial random variable. What is the 95% lower bound for the probability of responding to Tianjin? What is the 95% lower bound for the probability of responding to Henry Hub?
binom.test(64,100, alt="g")
## 
##  Exact binomial test
## 
## data:  64 and 100
## number of successes = 64, number of trials = 100, p-value = 0.003319
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.5536572 1.0000000
## sample estimates:
## probability of success 
##                   0.64
binom.test(81,100, alt="g")
## 
##  Exact binomial test
## 
## data:  81 and 100
## number of successes = 81, number of trials = 100, p-value = 1.351e-10
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.7337953 1.0000000
## sample estimates:
## probability of success 
##                   0.81
  1. Do analysts prefer either of the locations?

Yes. Henry Hub by 0.182 to 0.479 difference in probability of Build.

result <- compare_props(
  Experts, 
  var1 = "Location", 
  var2 = "Build", 
  levs = "Build"
)
summary(result, show = TRUE)
## Pairwise proportion comparisons
## Data      : Experts 
## Variables : Location, Build 
## Level     : Build in Build 
## Confidence: 0.95 
## Adjustment: None 
## 
##  Location Build No     p  n n_missing    sd    se    me
##  HenryHub    66 15 0.815 81         0 0.388 0.043 0.085
##   Tianjin    31 33 0.484 64         0 0.500 0.062 0.122
## 
##  Null hyp.            Alt. hyp.                       diff p.value chisq.value
##  HenryHub = Tianjin   HenryHub not equal to Tianjin   0.33 < .001  17.628     
##  df 2.5%  97.5%    
##  1  0.182 0.479 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "dodge", custom = FALSE)

Weather Concerns

Both locations [Tianjin and Henry Hub] are adjacent to large bodies of water. If the probability of more than 3 hurricanes in a year exceeds 0.05 in either location, this is a game-changing fact. Management is unwilling to build in a location with such a high probability of a disastrous loss of investment and possible collateral damage.

  1. If the arrival rate of hurricanes at Henry Hub is 1 per year (as it is known to be), what is the probability of more than 3 hurricanes in a given year at Henry Hub?

The result is less than 0.02. P(X > 3) = 0.019

result <- prob_pois(lambda = 1, ub = 3, dec = 3)
summary(result)
## Probability calculator
## Distribution: Poisson
## Lambda      : 1 
## Mean        : 1 
## Variance    : 1 
## Lower bound :  
## Upper bound : 3 
## 
## P(X  = 3) = 0.061
## P(X  < 3) = 0.92
## P(X <= 3) = 0.981
## P(X  > 3) = 0.019
## P(X >= 3) = 0.08
plot(result)

  1. If the arrival rate of hurricanes at Tianjin is 0.6 (as it is known to be), what is the probability of more than 3 hurricanes in a given year at Tianjin?

P(X > 3) = 0.003

result <- prob_pois(lambda = 0.6, ub = 3, dec = 3)
summary(result)
## Probability calculator
## Distribution: Poisson
## Lambda      : 0.6 
## Mean        : 0.6 
## Variance    : 0.6 
## Lower bound :  
## Upper bound : 3 
## 
## P(X  = 3) = 0.02
## P(X  < 3) = 0.977
## P(X <= 3) = 0.997
## P(X  > 3) = 0.003
## P(X >= 3) = 0.023
plot(result)

The result is less than 0.05.

Final Problem

Defend a decision to build the factory in one of the two locations and relate each previous section and your analysis of the relevant data to your chosen alternative. If necessary, weigh the value of conflicting evidence in light of the source. You are only to faithfully report your analysis results and bring no external considerations to bear beyond the information in the data and the issues raised in this investigation.

The data supports Henry Hub.

Henry Hub Natural Gas /Photo from The Wall Street Journal

The End