## [1] "Experts" "Forecast" "Newspapers"
Let’s Start
- Find the plus/minus z [the values of x below] such that the area in red is exactly equal to 0.6827, as shown below. Equivalently, the central probability in a standard normal distribution between -1 and 1 is 0.6827.
result <- prob_norm(mean = 0, stdev = 1, plb = 0.1587, pub = 0.8413)
summary(result, type = "probs")## Probability calculator
## Distribution: Normal
## Mean : 0
## St. dev : 1
## Lower bound : 0.1587
## Upper bound : 0.8413
##
## P(X < -1) = 0.1587
## P(X > -1) = 0.841
## P(X < 1) = 0.8413
## P(X > 1) = 0.159
## P(-1 < X < 1) = 0.6826
## 1 - P(-1 < X < 1) = 0.317
The Case
A chemical manufacturer will build a new epoxy plant in one of two locations – one adjacent to an extant facility in Tianjin, China, and one near the company’s headquarters in the United States near Henry Hub in Louisiana. The plant is a first-of-its-kind, entirely automated plant that hosts only a tiny number of supervisors, and labor costs across the two locations do not differ. The only input that might reasonably differ in cost is energy. Where to locate the factory and potential differences in energy cost will play a crucial role in our decision.
An Energy Cost Comparison
An intern has collected two random samples of energy prices from local newspapers for comparison. Over the last six months, the intern has a random sample of size 25 from Tianjin, and a random sample of size 36 from Henry Hub reported in Newspapers – a sheet in the Excel Workbook on WISE referenced above. You are to undertake an initial summary of these data. You should note that both prices are reported in United States dollars per Million BTU (million British thermal units). The dates that the prices (spot) were recorded are listed in the adjacent column for each series. The worksheet contains the data in both a stacked and unstacked format.
- Provide a single plot combining the two samples of data. Which of the two series is more symmetric? What information in the plot led you to this choice?
ggplot(Newspapers) +
aes(x = Price, fill = Location) +
geom_density(adjust = 2) +
scale_fill_manual(values = alpha(c("blue", "red"), .3)) +
labs(title = "Price/Location") +
theme_gray()HH’s data is more symmetric than Tianjin. This last one is skewed to the left, which is a result of a higher boundary.
- What is the sample mean and sample standard deviation of natural gas prices in Tianjin?
Mean = 3.994 dollars, Standard Deviation = 0.051 dollars
result <- explore(
Newspapers,
vars = c("HHPrice", "TianjinPrice"),
fun = c("n_obs", "mean", "min", "max", "sd", "p25", "p75"),
data_filter = "vars = 'TianjinPrice'",
nr = Inf
)
summary(result)## Explore
## Data : Newspapers
## Filter : vars = 'TianjinPrice'
## Functions : n_obs, mean, min, max, sd, p25, p75
## Top : Function
##
## variable n_obs mean min max sd p25 p75
## HHPrice 61 3.796 3.520 4.070 0.140 3.690 3.902
## TianjinPrice 61 3.994 3.880 4.050 0.051 3.980 4.040
- What is a 95% confidence interval for Tianjin’s average price, given this precise sample? Provide both boundaries.
result <- single_mean(
Newspapers,
var = "TianjinPrice",
data_filter = "vars = c('HHPrice', 'TianjinPrice')"
)
summary(result)## Single mean test
## Data : Newspapers
## Filter : vars = c('HHPrice', 'TianjinPrice')
## Variable : TianjinPrice
## Confidence: 0.95
## Null hyp. : the mean of TianjinPrice = 0
## Alt. hyp. : the mean of TianjinPrice is not equal to 0
##
## mean n n_missing sd se me
## 3.994 61 36 0.051 0.010 0.020
##
## diff se t.value p.value df 2.5% 97.5%
## 3.994 0.01 394.167 < .001 24 3.973 4.015 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Assume that the underlying population is normal with the population mean, and population standard deviation exactly equal to the sample mean and standard deviation that you have estimated from the sample from Tianjin. What is the probability of a daily price of $4.00 or higher at Tianjin?
P(X > 4) = 0.453
## Probability calculator
## Distribution: Normal
## Mean : 3.994
## St. dev : 0.051
## Lower bound : -Inf
## Upper bound : 4
##
## P(X < 4) = 0.547
## P(X > 4) = 0.453
- What is the probability of a mean price of $4.00 or lower, given the Tianjin data?
result <- single_mean(
Newspapers,
var = "TianjinPrice",
comp_value = 4,
alternative = "greater"
)
summary(result)## Single mean test
## Data : Newspapers
## Variable : TianjinPrice
## Confidence: 0.95
## Null hyp. : the mean of TianjinPrice = 4
## Alt. hyp. : the mean of TianjinPrice is > 4
##
## mean n n_missing sd se me
## 3.994 61 36 0.051 0.010 0.020
##
## diff se t.value p.value df 5% 100%
## -0.006 0.01 -0.553 0.707 24 3.977 Inf
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Probability of an average price is 0.707
- Speculative: What is a 95% confidence interval of the difference in average prices. This calculation requires an assumption about the variances/standard deviations of the two samples that you have. What have you assumed? Why? Is there any information in the plot that you provided that might inform this decision?
HH is cheaper 0.147 to 0.25 dollars
result <- compare_means(
Newspapers,
var1 = "Location",
var2 = "Price",
)
summary(result, show = TRUE)## Pairwise mean comparisons (t-test)
## Data : Newspapers
## Variables : Location, Price
## Samples : independent
## Confidence: 0.95
## Adjustment: None
##
## Location mean n n_missing sd se me
## HenryHub 3.796 36 0 0.140 0.023 0.047
## Tianjin 3.994 25 0 0.051 0.010 0.021
##
## Null hyp. Alt. hyp. diff p.value se
## HenryHub = Tianjin HenryHub not equal to Tianjin -0.199 < .001 0.025
## t.value df 2.5% 97.5%
## -7.814 47.036 -0.25 -0.147 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Profit Forecasting by Location
The division that is principally entrusted with forecasting and the supply chain has calculated a series of profit forecasts in dollars for the first 52 weeks after completing the Tianjin factory and the factory in Henry Hub. Because the production process is one-of-a-kind, it is known that the first 52 weeks will involve planned calibration of the factory; this is accounted for in the weekly forecasts and is the same pattern across both locations. These forecasts appear in the worksheet Forecast of the workbook on WISE.
- Summarize profits by location numerically and graphically.
visualize(
Forecast,
xvar = "Period",
yvar = "Forecast",
type = "line",
color = "Location",
custom = FALSE
)result <- explore(
Forecast,
vars = "Forecast",
byvar = "Location",
fun = c("n_obs", "mean", "min", "max", "sd", "p25", "p75"),
data_filter = "vars == Forecast",
nr = Inf
)
summary(result)## Explore
## Data : Forecast
## Filter : vars == Forecast
## Grouped by : Location
## Functions : n_obs, mean, min, max, sd, p25, p75
## Top : Function
##
## Location variable n_obs mean min max sd
## HenryHub Forecast 52 4,204,474.865 1,712,033 7,166,874 1,590,364.990
## Tianjin Forecast 52 3,860,590.442 1,525,784 6,761,850 1,561,618.097
## p25 p75
## 2,726,455.750 5,520,862.250
## 2,336,259.000 5,197,883.000
- If profits in Henry Hub are normally distributed with mean and standard deviation as calculated above, what is the probability of profits between 4 million and 4.5 million?
The probability is 0.125
result <- prob_norm(
mean = 4204474.865,
stdev = 1590364.99,
lb = 4000000,
ub = 4500000
)
summary(result)## Probability calculator
## Distribution: Normal
## Mean : 4204475
## St. dev : 1590365
## Lower bound : 4e+06
## Upper bound : 4500000
##
## P(X < 4e+06) = 0.449
## P(X > 4e+06) = 0.551
## P(X < 4500000) = 0.574
## P(X > 4500000) = 0.426
## P(4e+06 < X < 4500000) = 0.125
## 1 - P(4e+06 < X < 4500000) = 0.875
- If profits in Tianjin are normally distributed with mean and standard deviation as calculated above, what is the range of profits to be obtained with central probability 0.5 [the interquartile range]?
P(2807295.04 < X < 4913885.84) = 0.5
result <- prob_norm(
mean = 3860590.442,
stdev = 1561618.097,
plb = 0.25,
pub = 0.75,
dec = 2
)
summary(result, type = "probs")## Probability calculator
## Distribution: Normal
## Mean : 3860590
## St. dev : 1561618
## Lower bound : 0.25
## Upper bound : 0.75
##
## P(X < 2807295.04) = 0.25
## P(X > 2807295.04) = 0.75
## P(X < 4913885.84) = 0.75
## P(X > 4913885.84) = 0.25
## P(2807295.04 < X < 4913885.84) = 0.5
## 1 - P(2807295.04 < X < 4913885.84) = 0.5
- Provide a two-sided 95% confidence interval for average profits per week over the 52 week period in Tianjin, China.
From 3425833 dollars to 4295348 dollars
result <- single_mean(
Forecast,
var = "TianjinForecast",
comp_value = 2,
data_filter = "vars == Forecast"
)
summary(result)## Single mean test
## Data : Forecast
## Filter : vars == Forecast
## Variable : TianjinForecast
## Confidence: 0.95
## Null hyp. : the mean of TianjinForecast = 2
## Alt. hyp. : the mean of TianjinForecast is not equal to 2
##
## mean n n_missing sd se me
## 3,860,590.442 104 52 1,561,618.097 216,557.466 429,490.658
##
## diff se t.value p.value df 2.5% 97.5%
## 3860588 216557.5 17.827 < .001 51 3425833 4295348 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Provide a two-sided 95% confidence interval for average profits per week over the 52 week period in Henry Hub.
From 3761714 dollars to 4647235 dollars
result <- single_mean(
Forecast,
var = "HenryHubForecast",
comp_value = 2,
data_filter = "vars == Forecast"
)
summary(result)## Single mean test
## Data : Forecast
## Filter : vars == Forecast
## Variable : HenryHubForecast
## Confidence: 0.95
## Null hyp. : the mean of HenryHubForecast = 2
## Alt. hyp. : the mean of HenryHubForecast is not equal to 2
##
## mean n n_missing sd se me
## 4,204,474.865 104 52 1,590,364.990 220,543.943 437,396.894
##
## diff se t.value p.value df 2.5% 97.5%
## 4204473 220543.9 19.064 < .001 51 3761714 4647235 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Speculative: Provide a two-sided 95% confidence interval for the average difference in profits per week over the 52 week period.
HH’s averages is 316072.2 to 371696.7 dollars per period.
result <- compare_means(
Forecast,
var1 = "Location",
var2 = "Forecast",
samples = "paired",
data_filter = "vars == Forecast"
)
summary(result, show = TRUE)## Pairwise mean comparisons (t-test)
## Data : Forecast
## Filter : vars == Forecast
## Variables : Location, Forecast
## Samples : paired
## Confidence: 0.95
## Adjustment: None
##
## Location mean n n_missing sd se me
## HenryHub 4,204,474.865 52 0 1,590,364.990 220,543.943 442,760.441
## Tianjin 3,860,590.442 52 0 1,561,618.097 216,557.466 434,757.255
##
## Null hyp. Alt. hyp. diff p.value se
## HenryHub = Tianjin HenryHub not equal to Tianjin 343884.4 < .001 13853.59
## t.value df 2.5% 97.5%
## 24.823 51 316072.2 371696.7 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Speculative: Is either of the two locations forecast to have higher profits with at least 95% confidence? Which one?
HenryHub
result <- compare_means(
Forecast,
var1 = "TianjinForecast",
var2 = "HenryHubForecast",
alternative = "greater",
comb = "TianjinForecast:TianjinForecast"
)
summary(result, show = TRUE)## Pairwise mean comparisons (t-test)
## Data : Forecast
## Variables : TianjinForecast, HenryHubForecast
## Samples : independent
## Confidence: 0.95
## Adjustment: None
##
## mean n n_missing sd se
## TianjinForecast 3,860,590.442 104 52 1,561,618.097 216,557.466
## HenryHubForecast 4,204,474.865 104 52 1,590,364.990 220,543.943
## me
## 429,490.658
## 437,396.894
##
## Null hyp. Alt. hyp.
## TianjinForecast = HenryHubForecast TianjinForecast > HenryHubForecast
## diff p.value se t.value df 5% 100%
## -343884.4 0.866 309090.2 -1.113 101.966 -856954.1 Inf
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- If the building cost were $12 million more for the more profitable location, provide a 95% confidence interval for the average recovery time of cost difference expressed in weeks.
## [1] 37.96601 32.28439
It will take us between 32.2 to 38 weeks
Expert Assessment of Location
As management closes on a decision, word has leaked about the factory and the location decision. Bloomberg Television and BCNC have reported on the plans, and industry experts now well understand the decision facing management. The Sall Tweet Kernel conducted a survey of independent samples of industry experts regarding the decision, and their responses are recorded as worksheet Experts in the workbook on WISE. A Build represents a belief that the facility should be built in the given location (Tianjin/Henry Hub), while a No represents an opinion that the facility should not be built in the given location. Concerning these data, you are to answer the following questions.
- Tabulate opinions by location. What is the proportion of analysts favoring each location?
result <- cross_tabs(
Experts,
var1 = "Location",
var2 = "Build",
)
summary(result, check = c("observed", "row_perc"))## Cross-tabs
## Data : Experts
## Variables: Location, Build
## Null hyp.: there is no association between Location and Build
## Alt. hyp.: there is an association between Location and Build
##
## Observed:
## Build
## Location Build No Total
## HenryHub 66 15 81
## Tianjin 31 33 64
## Total 97 48 145
##
## Row percentages:
## Build
## Location Build No Total
## HenryHub 0.81 0.19 1
## Tianjin 0.48 0.52 1
## Total 0.67 0.33 1
##
## Chi-squared: 17.628 df(1), p.value < .001
##
## 0.0 % of cells have expected values below 5
- The true proportion of analysts believing that the facility should be built in Henry Hub, Louisiana, is 0.5 with 95% confidence. True or False, when compared to the alternative that it is less than 0.5.
95% Of the time, it ranges from 0.713 to 0.892. Both are clearly greater than 0.5
- Prospective: How large a sample would be required, with 95% confidence, to estimate the true proportion to within � 0.15 assuming, for planning purposes, that the true proportion is 0.5?
43 is enough; we have 64 and 81.
## Sample size calculation
## Calculation type : Proportion
## Acceptable Error : 0.15
## Proportion : 0.5
## Confidence level : 0.95
## Incidence rate : 1
## Response rate : 1
## Population correction: None
##
## Required sample size : 43
## Required contact attempts: 43
- The survey was, in both cases, sent to samples of size 100. As is clear, not everyone that was sent the survey actually responded. First, suppose that the probability of responding is 0.8. 95% of the time, we should receive 73 or more responses to this survey.
## Probability calculator
## Distribution: Binomial
## n : 100
## p : 0.8
## Mean : 80
## St. dev : 4
## Lower bound : 0.05 (73)
## Upper bound :
##
## P(X = 73) = 0.022
## P(X < 73) = 0.034
## P(X <= 73) = 0.056
## P(X > 73) = 0.944
## P(X >= 73) = 0.966
- Assume that responding to the survey or not is a binomial random variable. What is the 95% lower bound for the probability of responding to Tianjin? What is the 95% lower bound for the probability of responding to Henry Hub?
##
## Exact binomial test
##
## data: 64 and 100
## number of successes = 64, number of trials = 100, p-value = 0.003319
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.5536572 1.0000000
## sample estimates:
## probability of success
## 0.64
##
## Exact binomial test
##
## data: 81 and 100
## number of successes = 81, number of trials = 100, p-value = 1.351e-10
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.7337953 1.0000000
## sample estimates:
## probability of success
## 0.81
- Do analysts prefer either of the locations?
Yes. Henry Hub by 0.182 to 0.479 difference in probability of Build.
result <- compare_props(
Experts,
var1 = "Location",
var2 = "Build",
levs = "Build"
)
summary(result, show = TRUE)## Pairwise proportion comparisons
## Data : Experts
## Variables : Location, Build
## Level : Build in Build
## Confidence: 0.95
## Adjustment: None
##
## Location Build No p n n_missing sd se me
## HenryHub 66 15 0.815 81 0 0.388 0.043 0.085
## Tianjin 31 33 0.484 64 0 0.500 0.062 0.122
##
## Null hyp. Alt. hyp. diff p.value chisq.value
## HenryHub = Tianjin HenryHub not equal to Tianjin 0.33 < .001 17.628
## df 2.5% 97.5%
## 1 0.182 0.479 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Weather Concerns
Both locations [Tianjin and Henry Hub] are adjacent to large bodies of water. If the probability of more than 3 hurricanes in a year exceeds 0.05 in either location, this is a game-changing fact. Management is unwilling to build in a location with such a high probability of a disastrous loss of investment and possible collateral damage.
- If the arrival rate of hurricanes at Henry Hub is 1 per year (as it is known to be), what is the probability of more than 3 hurricanes in a given year at Henry Hub?
The result is less than 0.02. P(X > 3) = 0.019
## Probability calculator
## Distribution: Poisson
## Lambda : 1
## Mean : 1
## Variance : 1
## Lower bound :
## Upper bound : 3
##
## P(X = 3) = 0.061
## P(X < 3) = 0.92
## P(X <= 3) = 0.981
## P(X > 3) = 0.019
## P(X >= 3) = 0.08
- If the arrival rate of hurricanes at Tianjin is 0.6 (as it is known to be), what is the probability of more than 3 hurricanes in a given year at Tianjin?
P(X > 3) = 0.003
## Probability calculator
## Distribution: Poisson
## Lambda : 0.6
## Mean : 0.6
## Variance : 0.6
## Lower bound :
## Upper bound : 3
##
## P(X = 3) = 0.02
## P(X < 3) = 0.977
## P(X <= 3) = 0.997
## P(X > 3) = 0.003
## P(X >= 3) = 0.023
The result is less than 0.05.
Final Problem
Defend a decision to build the factory in one of the two locations and relate each previous section and your analysis of the relevant data to your chosen alternative. If necessary, weigh the value of conflicting evidence in light of the source. You are only to faithfully report your analysis results and bring no external considerations to bear beyond the information in the data and the issues raised in this investigation.
The data supports Henry Hub.
- Prices are lower.
- Forecast profits are higher, and
- Experts agree.
Henry Hub Natural Gas /Photo from The Wall Street Journal