Danilo Martinez STA 5206 Assignment#1

3.43 To control the risk of severe core damage during a commercial nuclear power station blackout accident, the reliability of the emergency diesel generators in starting on demand must be maintained at a high level. The paper “Empirical Bayes Estimation of the Reliability of Nuclear- Power Emergency Diesel Generators” [ Technometrics (1996) 38:11-23] contains data on the failure history of seven nuclear power plants. The following data are the number of successful demands between failures for the diesel generators at one of these plants from 1982 to 1988.

28 50 193 55 4 7 147 76 10 0 10 84 0 9 1 0 62 26 15 226 54 46 128 4 105 40 4 273 164 7 55 41 26 6

powerdata <- read.csv(file="3.43.csv", header = TRUE)
attach(powerdata)

(Note: The failure of the diesel generator does not necessarily result in damage to the nuclear core because all nuclear power plants have several emergency diesel generators.) a. Calculate the mean and median of the successful demands between failures. Mean is

mean(pd)

## [1] 57.52941

Median is

median(pd)

## [1] 34

Which measure appears to best represent the center of the data? Sorting the data, one can see that the Median (34) represents the center better than the Mean (57.52941)

sort(pd)

##  [1]   0   0   0   1   4   4   4   6   7   7   9  10  10  15  26  26  28
## [18]  40  41  46  50  54  55  55  62  76  84 105 128 147 164 193 226 273

Calculate the range and standard deviation, s. Range is

range(pd)

## [1]   0 273

Standard Deviation

sd(pd)

## [1] 70.1955

Use the range approximation to estimate s. How close is the approximation to the true value? Range is 273 Standard Deviation is 70.77148 Rule of thumb is that Range/4 is an approximation for the standard deviation, resulting in:

 (max(pd)-min(pd))/4

## [1] 68.25

The range approximation is

70.77148-68.25

## [1] 2.52148

away from the true standard deviation.

Construct the intervals y+/-s

slower = mean(pd) - sd(pd)
shigher = mean(pd) + sd(pd)
slower

## [1] -12.66609

shigher

## [1] 127.7249

y+/-2s

s2lower = mean(pd) - 2 * sd(pd)
s2higher = mean(pd) + 2 * sd(pd)
s2lower

## [1] -82.86159

s2higher

## [1] 197.9204

y+/-3s

s3lower = mean(pd) - 3 * sd(pd)
s3higher = mean(pd) + 3 * sd(pd)
s3lower

## [1] -153.0571

s3higher

## [1] 268.1159

Count the number of demands between failures falling in each of the three intervals.

y+/-s

 sum((mean(pd) - sd(pd)) < pd & pd < (mean(pd) + sd(pd)))

## [1] 28

y+/-2s

 sum((mean(pd) - 2*sd(pd)) < pd & pd < (mean(pd) + 2*sd(pd)))

## [1] 32

y+/-3s

 sum((mean(pd) - 3*sd(pd)) < pd & pd < (mean(pd) + 3*sd(pd)))

## [1] 33

Convert these numbers to percentages and compare your results to the Empirical Rule. y+/-s

sum((mean(pd) - sd(pd)) < pd & pd < (mean(pd) + sd(pd)))/length(pd)

## [1] 0.8235294

Empirical Rule says: 68% fall within 1 sd

y+/-2s

sum((mean(pd) - 2*sd(pd)) < pd & pd < (mean(pd) + 2*sd(pd)))/length(pd)

## [1] 0.9411765

Empirical Rule says: 95% fall within 2 sd

y+/-3s

sum((mean(pd) - 3*sd(pd)) < pd & pd < (mean(pd) + 3*sd(pd)))/length(pd)

## [1] 0.9705882

Empirical Rule says: 99.7% fall within 3 sd

Why do you think the Empirical Rule and your percentages do not match well? The percentages for the data are not similar to the empirical rule. The empirical rule says that the data should have a mound shaped histogram. This data does not, and one can see that it has some outliers.

3.48 The paper “Conditional Simulation of Waste-Site Performance” [ Technometrics (1994) 36: 129-161] discusses the evaluation of a pilot facility for demonstrating the safe management, storage, and disposal of defense-generated, radioactive, transuranic waste. Researchers have determined that one potential pathway for release of radionuclides is through contaminant transport in groundwater. Recent focus has been on the analysis of transmissivity, a function of the properties and the thickness of an aquifer that reflects the rate at which water is transmitted through the aquifer. The following table contains 41 measurements of transmissivity, T, made at the pilot facility.

9.354 6.302 24.609 10.093 0.939 354.81 15399.27 88.17 1253.43 0.75 312.10 1.94 3.28 1.32 7 .68 2.31 16.69 2772.68 0.92 10.75 0.000753 1.08 741.99 3.23 6.45 2.69 3.98 2876.07 12201.13 4273.66 207 .06 2.50 2.80 5.05 3.01 462.38 5515.69 118.28 10752.27 956.97 20.43

measurements <- read.csv(file="3.48.csv", header = TRUE)
attach(measurements)

Draw a relative frequency histogram for the 41 values of T.

h<-hist(t, plot=F)
h$counts <- h$counts / sum(h$counts)
plot(h, freq=TRUE, ylab="Relative Frequency")

Describe the shape of the histogram. The right tail is longer than the left tail so it is right skewed.
When the relative frequency histogram is highly skewed to the right, the Empirical Rule may not yield very accurate results. Verify this statement for the data given. Empirical rule states that the data will have a mound shape histogram. Let’s construct the intervals:

y+/-s

slower = mean(t) - sd(t)
shigher = mean(t) + sd(t)
slower

## [1] -2062.335

shigher

## [1] 4912.78

y+/-2s

s2lower = mean(t) - 2 * sd(t)
s2higher = mean(t) + 2 * sd(t)
s2lower

## [1] -5549.893

s2higher

## [1] 8400.338

y+/-3s

s3lower = mean(t) - 3 * sd(t)
s3higher = mean(t) + 3 * sd(t)
s3lower

## [1] -9037.451

s3higher

## [1] 11887.9

Let’s count the number of measurements falling in each of the three intervals:

y+/-s

 sum((mean(t) - sd(t)) < t & t < (mean(t) + sd(t)))

## [1] 37

y+/-2s

 sum((mean(t) - 2*sd(t)) < t & t < (mean(t) + 2*sd(t)))

## [1] 38

y+/-3s

 sum((mean(t) - 3*sd(t)) < t & t < (mean(t) + 3*sd(t)))

## [1] 39

Let’s convert these numbers to percentages and compare the results to the Empirical Rule. y+/-s

sum((mean(t) - sd(t)) < t & t < (mean(t) + sd(t)))/length(t)

## [1] 0.902439

Empirical Rule says: 68% fall within 1 sd

y+/-2s

sum((mean(t) - 2*sd(t)) < t & t < (mean(t) + 2*sd(t)))/length(t)

## [1] 0.9268293

Empirical Rule says: 95% fall within 2 sd

y+/-3s

sum((mean(t) - 3*sd(t)) < t & t < (mean(t) + 3*sd(t)))/length(t)

## [1] 0.9512195

Empirical Rule says: 99.7% fall within 3 sd

We can see that the percentages do not match to the Empirical Rule.

Data analysts often find it easier to work with mound-shaped relative frequency histograms. A transformation of the data will sometimes achieve this shape. Replace the given 41 T values with the logarithm base 10 of the values and reconstruct the relative frequency histogram. Is the shape more mound-shaped than the original data? Apply the Empirical Rule to the transformed data, and verify that it yields more accurate results than it did with the original data.

h<-hist(log10(t), plot=F)
h$counts <- h$counts / sum(h$counts)
plot(h, freq=TRUE, ylab="Relative Frequency")

The new shape is more mound shape than the original data. Let’s apply the Empirical rule to the new data. y+/-s

sum((mean(log10(t)) - sd(log10(t))) < log10(t) & log10(t) < (mean(log10(t)) + sd(log10(t))))/length(log10(t))

## [1] 0.7560976

Empirical Rule says: 68% fall within 1 sd

y+/-2s

sum((mean(log10(t)) - 2*sd(log10(t))) < log10(t) & log10(t) < (mean(log10(t)) + 2*sd(log10(t))))/length(log10(t))

## [1] 0.9756098

Empirical Rule says: 95% fall within 2 sd

y+/-3s

sum((mean(log10(t)) - 3*sd(log10(t))) < log10(t) & log10(t) < (mean(log10(t)) + 3*sd(log10(t))))/length(log10(t))

## [1] 1

Empirical Rule says: 99.7% fall within 3 sd The percentages are still not the same as the Empirical Rule, but they are closer than the original data.

3.55 Data are collected on the weekly expenditures of a sample of urban households on food (including restaurant expenditures). The data, obtained from diaries kept by each household, are grouped by number of members of the household. The expenditures are as follows:

1 member: 67 62 168 128 131 118 80 53 99 68 76 55 84 77 70 140 84 65 67 183

2 members: 129 116 122 70 141 102 120 75 114 81 106 95 94 98 85 81 67 69 119 105 94 94 92

3 members: 79 99 171 145 86 100 116 125 82 142 82 94 85 191 100 116

4 members: 139 251 93 155 158 114 108 111 106 99 132 62 129 91

51 members: 121 128 129 140 206 111 104 109 135 136

mem1 <- read.csv(file="mem1.csv", header = T)
attach(mem1)
mem2 <- read.csv(file="mem2.csv", header = T)
attach(mem2)
mem3 <- read.csv(file="mem3.csv", header = T)
attach(mem3)
mem4 <- read.csv(file="mem4.csv", header = T)
attach(mem4)
mem51 <- read.csv(file="mem51.csv", header = T)
attach(mem51)

Compute the mean expenditure separately for each of the five groups.

summary(mem1)

##       mems       
##  Min.   : 53.00  
##  1st Qu.: 67.00  
##  Median : 78.50  
##  Mean   : 93.75  
##  3rd Qu.:120.50  
##  Max.   :183.00

summary(mem2)

##       mems       
##  Min.   : 67.00  
##  1st Qu.: 83.00  
##  Median : 95.00  
##  Mean   : 98.65  
##  3rd Qu.:115.00  
##  Max.   :141.00

summary(mem3)

##       mems       
##  Min.   : 79.00  
##  1st Qu.: 85.75  
##  Median :100.00  
##  Mean   :113.31  
##  3rd Qu.:129.25  
##  Max.   :191.00

summary(mem4)

##       mems      
##  Min.   : 62.0  
##  1st Qu.:100.8  
##  Median :112.5  
##  Mean   :124.9  
##  3rd Qu.:137.2  
##  Max.   :251.0

summary(mem51)

##       mems      
##  Min.   :104.0  
##  1st Qu.:113.5  
##  Median :128.5  
##  Mean   :131.9  
##  3rd Qu.:135.8  
##  Max.   :206.0

Combine the five data sets into a single data set and then compute the mean expenditure.

total <- rbind(mem1, mem2, mem3, mem4, mem51)
summary(total)

##       mems      
##  Min.   : 53.0  
##  1st Qu.: 83.0  
##  Median :104.0  
##  Mean   :108.7  
##  3rd Qu.:128.5  
##  Max.   :251.0

Describe a method by which the mean for the combined data set could be obtained from the five individual means.

You can take all the means from each data set, add them together(564.61) and divide them by the amount of data sets (5). This yields an approximation of 112.502. When we combine the data sets together and then calculate the mean we get 108.07, so using the individual means will only give us an approximation.

Describe the relation (if any) among the mean expenditures for the five groups.

It seems that as the numbers of members increase, the mean also increase. Therefore, we can conlcude a few things like the more members, they more they spend individually on average.

3.69 Certain types of diseases tend to occur in clusters. In particular, persons affected with AIDS, syphilis, and tuberculosis may have some common characteristics and associations that increase their chances of contracting these diseases. The following table lists the number of reported cases by state in 2001.

State AIDS Syphilis Tuber. State AIDS Syphilis Tuber.

AL 438 720 265 MT 15 0 20 AK 18 9 54 NE 74 16 40 AZ 540 1,147 289 NV 252 62 96 AR 199 239 162 NH 40 20 20 CA 4,315 3,050 3,332 NJ 1,756 1,040 530 CO 288 149 138 NM 143 73 54 CT 584 165 121 NY 7,476 3,604 1,676 DE 248 79 33 NC 942 1,422 398

State AIDS Syphilis Tuber. State AIDS Syphilis Tuber.

DC 870 459 74 ND 3 2 6 FL 5,138 2,914 1,145 OH 581 297 306 GA 1,745 1,985 575 OK 243 288 194 HI 124 41 151 OR 259 48 123 ID 19 11 9 PA 1,840 726 350 IL 1,323 1,541 707 RI 103 39 60 IN 378 529 115 SC 729 913 263 IA 90 44 43 SD 25 1 13 KS 98 88 63 TN 602 1,478 313 KY 333 191 152 TX 2,892 3,660 1,643 LA 861 793 294 UT 124 25 35 ME 48 16 20 VT 25 8 7 MD 1,860 937 262 VA 951 524 306 MA 765 446 270 WA 532 174 261 MI 548 1,147 330 WV 100 7 32 MN 157 132 239 WI 193 131 86 MS 418 653 154 WY 5 4 3 MO 445 174 157 All States 41,868 32,221 15,989

diseases <- read.csv(file="diseases.csv", header = TRUE)
attach(diseases)
diseases

Construct a scatterplot of the number of AIDS cases versus the number of syphilis cases.

plot(AIDS, Syphilis, main="Scatterplot AIDS vs Syphilis", 
    xlab="AIDS", ylab="Syphilis")
# Add fit lines
abline(lm(Syphilis~AIDS), col="red") # regression line (y~x) 
lines(lowess(AIDS,Syphilis), col="blue") # lowess line (x,y)

Compute the correlation between the number of AIDS cases and the number of syphilis cases.

cor(AIDS,Syphilis)

## [1] 0.886156

Does the value of the correlation coefficient reflect the degree of association shown in the scatterplot?

Yes it does. One can see a postive correlation between AIDS and Syphilis. As AIDS cases increase, so does Syphilis cases(.88 Correlation Coefficient is closer to 1, so it is more related).

Why do you think there may be a correlation between these two diseases?

Looking at the data, this relation between AIDS and Syphilis could be do to the way in which the dieases are transmitted.

3.70 Refer to the data in Exercise 3.69. a. Construct a scatterplot of the number of AIDS cases versus the number of tuberculosis cases.

plot(AIDS, Tuber, main="Scatterplot AIDS vs Tuberculosis", 
    xlab="AIDS", ylab="Tuberculosis")
# Add fit lines
abline(lm(Tuber~AIDS), col="red") # regression line (y~x) 
lines(lowess(AIDS,Tuber), col="blue") # lowess line (x,y)

Compute the correlation between the number of AIDS cases and the number of tuberculosis cases.

cor(AIDS,Tuber)

## [1] 0.8100791

Why do you think there may be a correlation between these two diseases?

Since AIDS weakens the immune system, it leaves the body prone to infection and other diseases can enter the body, such as tuberculosis. One factor that can be examined to determine a true relationship is to look at the date of infection to see if someone who contracted AIDS, then contracted tuberculosis.

3.71 Refer to the data in Exercise 3.69. a. Construct a scatterplot of the number of syphilis cases versus the number of tuberculosis cases.

plot(Syphilis,Tuber, main="Scatterplot Syphilis vs Tuberculosis", 
    xlab="Syphilis", ylab="Tuberculosis")
# Add fit lines
abline(lm(Tuber~Syphilis), col="red") # regression line (y~x) 
lines(lowess(Syphilis,Tuber), col="blue") # lowess line (x,y)

Compute the correlation between the number of syphilis cases and the number of tuberculosiscases.

cor(Syphilis,Tuber)

## [1] 0.8495313

Why do you think there may be a correlation between these two diseases?

Since Syphilis weakens the immune system, it leaves the body prone to infection and other diseases can enter the body, such as tuberculosis. One factor that can be examined to determine a true relationship is to look at the date of infection to see if someone who contracted Syphilis, then contracted tuberculosis.

3.72 Refer to the data in Exercise 3.69. a. Construct a quantile plot of the number of syphilis cases.

qqnorm(Syphilis,main = "Syphilis Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(Syphilis)
qqline(quantile(Syphilis, c(.90)))

From the quantile plot, determine the 90th percentile for the number of syphilis cases.

quantile(Syphilis, c(.90))

##  90% 
## 1541

Identify the states in which the number of syphilis cases is above the 90th percentile.

Sabove90 = subset(diseases,Syphilis > (quantile(Syphilis, c(.90))))
Sabove90

3.73 Refer to the data in Exercise 3.69. a. Construct a quantile plot of the number of tuberculosis cases.

qqnorm(Tuber,main = "Tuberculosis Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(Tuber)
qqline(quantile(Tuber, c(.90)))

From the quantile plot, determine the 90th percentile for the number of tuberculosis cases.

quantile(Tuber, c(.90))

## 90% 
## 575

Identify the states in which the number of tuberculosis cases is above the 90th percentile.

Tabove90 = subset(diseases,Tuber > (quantile(Tuber, c(.90))))
Tabove90

3.74 Refer to the data in Exercise 3.69. a. Construct a quantile plot of the number of AIDS cases.

qqnorm(AIDS,main = "AIDS Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
qqline(AIDS)
qqline(quantile(AIDS, c(.90)))

From the quantile plot, determine the 90th percentile for the number of AIDS cases.

quantile(AIDS, c(.90))

##  90% 
## 1756

Identify the states in which the number of AIDS cases is above the 90th percentile.

Aabove90 = subset(diseases,AIDS > (quantile(AIDS, c(.90))))
Aabove90

3.75 Refer to the results from Exercises 3.72-3.74. a. How many states had numbers of AIDS, tuberculosis, and syphilis cases that were all above the 90th percentiles?

Identify these states and comment on any common elements among the states. CA,FL,NY, and TX. These 4 states seems to have high population in the US.
How could the U.S. government apply the results from Exercises 3.69-3.75 in making public health policy?

Now that we know which states have the highest nummber of cases (Above the 90th percentile), then the US governement can increase its efforts in these 4 states, by creating programs to increase awareness for example.