Hypothesis Testing

• Hypothesis Testing: One Sample Test

The principal of a charter school in Kolkata believes that the IQs of its students are above the national average of 100. From the past experience, IQ is normally distributed with a standard deviation of 10. A random sample of 20 students is selected from this school and their IQs are observed. The following are the observed values. 95 91 110 93 133 119 113 107 110 89 113 100 100 124 116 113 110 106 115 113

Do the IQs of students at the school run above the national average at alpha = 0.01?
Comment on the normality assumption on the data.

x=c(95,91,110,93,133,119,113,107,110,89,113,100,100,124,116,113,110,106,115,113)
t.test(x,alternative = 'greater',conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  x
## t = 43.182, df = 19, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0
## 99 percent confidence interval:
##  102.1193      Inf
## sample estimates:
## mean of x 
##     108.5

qqnorm(x, pch = 1, frame = FALSE)

print("Since the lower limit of the confidence interval is above the national average, the IQs of the students at this school are greater than the national average.")

## [1] "Since the lower limit of the confidence interval is above the national average, the IQs of the students at this school are greater than the national average."

print("Since the QQ plot is more or less a straight line, we can assume that the data is normally distributed.")

## [1] "Since the QQ plot is more or less a straight line, we can assume that the data is normally distributed."

In order to find out whether children with chronic diarrhea have the same average haemoglobin level (Hb) that is normally seen in healthy children in the same area, a random sample of 10 children with chronic diarrhea are selected and their Hb levels (g/dL) are obtained as follows.

12.3 11.4 14.2 15.3 14.8 13.8 11.1 15.1 15.8 13.2

Do the data provide sucient evidence to indicate that the mean Hb level for children with chronic diarrhea is less than that of the normal value of 14.6 g/dL? Test the appropriate hypothesis using alpha = 0.01.
Draw a box plot and Q-Q plot for this data, and comment.

library(TeachingDemos)

## Warning: package 'TeachingDemos' was built under R version 4.0.3

x <- c(12.3,11.4,14.2,15.3,14.8,13.8,11.1,15.1,15.8,13.2)
xbar = mean(x)
stdev =sd(x)
var(x)

## [1] 2.74

z.test(x,mu=xbar,sd=stdev,alternative ='less',conf.level = 0.99)

## 
##  One Sample z-test
## 
## data:  x
## z = 0, n = 10.00000, Std. Dev. = 1.65529, Std. Dev. of the sample mean
## = 0.52345, p-value = 0.5
## alternative hypothesis: true mean is less than 13.7
## 99 percent confidence interval:
##      -Inf 14.91773
## sample estimates:
## mean of x 
##      13.7

boxplot(x)

qqnorm(x, pch = 1, frame = FALSE)

print("We fail to reject the null hypothesis that the mean Hb level for children with chronic diarrhea is not less than that of the normal value of 14.6 g/dL")

## [1] "We fail to reject the null hypothesis that the mean Hb level for children with chronic diarrhea is not less than that of the normal value of 14.6 g/dL"

A manufacturer of washers provides a particular model in one of three colors, white, black, or ivory. Of the first 1500 washers sold, it is noticed that 550 were of ivory color. Would you conclude that customers have a preference for the ivory color? Justify your answer. Use alpha = 0.01.

prop.test(550,1500,p=1/3,alternative = 'greater',conf.level=0.99)

## 
##  1-sample proportions test with continuity correction
## 
## data:  550 out of 1500, null probability 1/3
## X-squared = 7.3508, df = 1, p-value = 0.003352
## alternative hypothesis: true p is greater than 0.3333333
## 99 percent confidence interval:
##  0.337922 1.000000
## sample estimates:
##         p 
## 0.3666667

print("We reject the null hypothesis and conclude that the customers have a preference for the colour ivory.")

## [1] "We reject the null hypothesis and conclude that the customers have a preference for the colour ivory."

A machine in a certain factory must be repaired if it produces more than 12% defectives among the large lot of items it produces in a week. A random sample of 175 items from a week’s production contains 45 defectives, and it is decided that the machine must be repaired. Does the sample evidence support this decision? Use alpha = 0.02.

prop.test(45,175,p=0.12,alternative = 'greater',conf.level=0.98)

## 
##  1-sample proportions test with continuity correction
## 
## data:  45 out of 175, null probability 0.12
## X-squared = 29.884, df = 1, p-value = 2.294e-08
## alternative hypothesis: true p is greater than 0.12
## 98 percent confidence interval:
##  0.1930145 1.0000000
## sample estimates:
##         p 
## 0.2571429

print("The decision is correct and the machine should be repaired.")

## [1] "The decision is correct and the machine should be repaired."

It is claimed that two of three Americans say that the chances of world peace are seriously threatened by the nuclear capabilities of other countries. If in a random sample of 400 Americans, it is found that only 252 hold this view, do you think the claim is correct? Use alpha = 0.05. State any assumptions you make in solving this problem.

prop.test(252,400,p=2/3)

## 
##  1-sample proportions test with continuity correction
## 
## data:  252 out of 400, null probability 2/3
## X-squared = 2.2578, df = 1, p-value = 0.1329
## alternative hypothesis: true p is not equal to 0.6666667
## 95 percent confidence interval:
##  0.5803883 0.6770735
## sample estimates:
##    p 
## 0.63

print("We fail to reject the null hypothesis and hence conclude that the claim may be correct.")

## [1] "We fail to reject the null hypothesis and hence conclude that the claim may be correct."

A physician claims that the variance in cholesterol levels of adult men in a certain laboratory is at least 100. A random sample of 25 adult males from this laboratory produced a sample standard deviation of cholesterol levels as 12. Test the physicians claim at 5% level of significance.

ssquare = 12^2
1-pchisq((24*ssquare)/100,24)

## [1] 0.07519706

print("We fail to reject the null hypothesis and hence conclude that the physician may be correct.")

## [1] "We fail to reject the null hypothesis and hence conclude that the physician may be correct."

A company that manufactures precision special-alloy steel shafts claims that the variance in the diameters of shafts is no more than 0.0003. A random sample of 10 shafts gave a sample variance of 0.00027. At the 5% level of significance, test whether the company’s claim can be substantiated.

1-pchisq((9)*(0.00027)/(0.0003),9)

## [1] 0.5241009

print("We fail to reject the null hypothesis an conclude that the company's claim may be correct.")

## [1] "We fail to reject the null hypothesis an conclude that the company's claim may be correct."

• Hypothesis Testing: Two Samples Test 1. In the academic year 1997–1998, two random samples of 25 male professors and 23 female professors from a large university produced a mean salary for male professors of $58,550 with a standard deviation of $4000 and an average for female professors of $53,700 with a standard deviation of $3200. (a) At the 5% significance level, can you conclude that the mean salary of all male professors for 1997–1998 was higher than that of all female professors? (b) Assume that the salaries of male and female professors are both normally distributed with equal standard deviations.

(58550-53700)/sqrt((4000)^2/25+(3200)^2/23)

## [1] 4.655683

print("The salary of all male professors are higher than that of female professors")

## [1] "The salary of all male professors are higher than that of female professors"

The following information was obtained from two independent samples selected from two normally distributed populations with unknown but equal variances.

Sample 1 14 15 11 14 10 8 13 10 12 16 15 Sample 2 17 16 21 12 20 18 16 14 21 20 13 20 13

Test at the 2% significance level whether µ1 is lower than µ2.

Sample1=c(14,15,11,14,10,8,13,10,12,16,15)
sample2=c(17,16,21,12,20,18,16,14,21,20,13,20,13)
t.test(Sample1,sample2,alternative ='less',conf.level = 0.98)

## 
##  Welch Two Sample t-test
## 
## data:  Sample1 and sample2
## t = -3.7528, df = 21.88, p-value = 0.0005541
## alternative hypothesis: true difference in means is less than 0
## 98 percent confidence interval:
##       -Inf -1.862584
## sample estimates:
## mean of x mean of y 
##  12.54545  17.00000

It is believed that the effects of smoking differ depending on race. The following table gives the results of a statistical study for this question. Number in the Average number of Number of lung study cigarettes per day cancer cases Chinese 400 15 78 American 280 15 70 Do the data indicate that Americans are more likely to develop lung cancer due to smoking? Use alpha = 0.05.

smoke <- matrix(c(70,280-70,78,400-78),ncol = 2,byrow = TRUE)
smoke = as.table(smoke)
smoke

##     A   B
## A  70 210
## B  78 322

prop.test(smoke,alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  smoke
## X-squared = 2.6119, df = 1, p-value = 0.05303
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.001640794  1.000000000
## sample estimates:
## prop 1 prop 2 
##  0.250  0.195

print("We fail to reject the null hypothesis that Americans are more likely to develop lung cancer due to smoking.")

## [1] "We fail to reject the null hypothesis that Americans are more likely to develop lung cancer due to smoking."

A supermarket chain is considering two sources A and B for the purchase of 50–pound bags of onions. The following table gives the results of a study. Source A Source B Number of bags weighed 80 100 Mean weight 105.9 100.5 Sample variance 0.21 0.19 Test at alpha = 0.05 whether there is a difference in the mean weights.

tstat = (105.9-100.5)/(sqrt((0.21/80)+(0.19/100)))
degf = (((0.21/80)+(0.19/100))^2)/((((0.21/80)^2)/79)+(((0.19/100)^2)/99))
pt(-abs(tstat),df = degf)

## [1] 9.04672e-135

print("We reject the null hypothesis and conclude that there is in fact a difference between the mean weights.")

## [1] "We reject the null hypothesis and conclude that there is in fact a difference between the mean weights."

In order to compare the mean Hemoglobin (Hb) levels of well-nourished and undernourished groups of children, random samples from each of these groups yielded the following summary. Number of Sample Sample standard children mean deviation Well nourished 95 11.2 0.9 Undernourished 75 9.8 1.2 Test at alpha = 0.01 whether the mean Hb levels of well-nourished children were higher than those of undernourished children.

var1 = 0.9^2
var2 = 1.2^2
tstat = (11.2-9.8)/(sqrt((var1/95)+(var2/75)))
degf = (((var1/95)+(var2/75))^2)/((((var1/95)^2)/94)+(((var2/75)^2)/74))
pt(-abs(tstat),degf)

## [1] 2.762958e-14

print("We reject the null hypothesis and conclude that the mean Hb levels of well-nourished children were higher than those of undernourished children.")

## [1] "We reject the null hypothesis and conclude that the mean Hb levels of well-nourished children were higher than those of undernourished children."

The IQs of 17 students from one area of a city showed a mean of 106 with a standard deviation of 10, whereas the IQs of 14 students from another area showed a mean of 109 with a standard deviation of 7. Test for equality of variances between the IQs of the two groups at alpha = 0.02.

Fstat1 = 100/49
1-pf(Fstat1,17,14)

## [1] 0.09178643

print("We fail to reject the null hypothesis and conclude that the variances of the IQs of the students in the two different areas of the city may be the same.")

## [1] "We fail to reject the null hypothesis and conclude that the variances of the IQs of the students in the two different areas of the city may be the same."

Because of the impact of the global economy on a high-wage country such as the United States, it is claimed that the domestic content in manufacturing industries fell between 1977 and 1997. A survey of 36 randomly picked U.S. companies gave the proportion of domestic content total manufacturing in 1977 as 0.37 and in 1997 as 0.36. At the 1% level of significance, test the claim that the domestic content really fell during the period 1977–1997.

eco <- matrix(c(36*0.37,36*(1-0.37),36*0.36,36*(1-0.36)),ncol = 2,byrow = TRUE)
eco = as.table(eco)
eco

##       A     B
## A 13.32 22.68
## B 12.96 23.04

prop.test(eco,alternative = "less")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  eco
## X-squared = 1.6734e-31, df = 1, p-value = 0.5
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.0000000  0.2066383
## sample estimates:
## prop 1 prop 2 
##   0.37   0.36

print("We fail to reject the null hypothesis and hence conclude that the domestic content might not have fallen during the period 1977–1997.")

## [1] "We fail to reject the null hypothesis and hence conclude that the domestic content might not have fallen during the period 1977–1997."

The following data give SAT mean scores for math by state for 1989 and 1999 for 16 randomly selected states in USA. State 1989 1999 Arizona 523 525 Connecticut 498 509 Alabama 539 555 Indiana 487 498 Kansas 561 576 Oregon 509 525 Nebraska 560 571 New York 496 502 Virginia 507 499 Washington 515 526 Illinois 539 585 North Carolina 469 493 Georgia 475 482 Nevada 512 517 Ohio 520 568 New Hampshire 510 518 Assuming that the samples come from a normal distribution:

Test that the mean SAT score for math in 1999 is greater than that in 1989 at alpha = 0.05. Assume the variances are equal.
Test for the equality of the variances at alpha = 0.05.

year_1989 <- c(523,498,539,487,561,509,560,496,507,515,539,469,475,512,520,510)
year_1999 <- c(525,509,555,498,576,525,571,502,499,526,585,493,482,517,568,518)
t.test(year_1989,year_1999,alternative = 'greater',conf.level=0.95,var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  year_1989 and year_1999
## t = -1.3561, df = 30, p-value = 0.9074
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -32.22575       Inf
## sample estimates:
## mean of x mean of y 
##  513.7500  528.0625

var.test(year_1989,year_1999,alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  year_1989 and year_1999
## F = 0.66122, num df = 15, denom df = 15, p-value = 0.4324
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2310274 1.8924778
## sample estimates:
## ratio of variances 
##          0.6612217

print("We fail to reject the null hypotheses in both cases and conclude that both the mean and variance in both years may have been the same")

## [1] "We fail to reject the null hypotheses in both cases and conclude that both the mean and variance in both years may have been the same"

A new diet and exercise program has been advertised as remarkable way to reduce blood glucose levels in diabetic patients. Ten randomly selected diabetic patients are put on the program, and the results after 1 month are given by the following table:

Before 268 225 252 192 307 228 246 298 231 185 After 106 186 223 110 203 101 211 176 194 203

Do the data provide sucient evidence to support the claim that the new program reduces blood glucose level in diabetic patients? Use alpha = 0.05.

Before <- c(268,225,252,192,307,228,246,298,231,185)
After <- c(106,186,223,110,203,101,211,176,194,203)
t.test(Before,After,alternative = 'greater')

## 
##  Welch Two Sample t-test
## 
## data:  Before and After
## t = 3.6739, df = 17.556, p-value = 0.000899
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  37.91728      Inf
## sample estimates:
## mean of x mean of y 
##     243.2     171.3

print("We reject the null hypothesis and conclude that the new program reduces blood glucose level in diabetic patients")

## [1] "We reject the null hypothesis and conclude that the new program reduces blood glucose level in diabetic patients"

Suppose we want to know the effect on driving of a drug for cold and allergy, in a study in which the same people were tested twice, once after 1 hour of taking the drug and once when no drug is taken. Suppose we obtain the following data, which represent the number of cones (placed in a certain pattern) knocked down by each of the nine individuals before taking the drug and after an hour of taking the drug. No drug 0 0 3 2 0 0 3 3 1 After drug 1 5 6 5 5 5 6 1 6 Assuming that the difference of each pair is coming from an approximately normal distribution, test if there is any difference in the individuals driving ability under the two conditions. Use alpha = 0.05.

No_Drug <- c(0,0,3,2,0,0,3,3,1)
After_Drug <- c(1,5,6,5,5,5,6,1,6)
t.test(No_Drug,After_Drug)

## 
##  Welch Two Sample t-test
## 
## data:  No_Drug and After_Drug
## t = -3.8015, df = 14.373, p-value = 0.001863
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.862103 -1.360119
## sample estimates:
## mean of x mean of y 
##  1.333333  4.444444

print("We reject the null hypothesis and conclude that there is any difference in the individuals driving ability under the two conditions.")

## [1] "We reject the null hypothesis and conclude that there is any difference in the individuals driving ability under the two conditions."

An aquaculture farm takes water from a stream and returns it after it has circulated through the fish tanks. In order to find out how much organic matter is left in the waste water after the circulation, some samples of the water are taken at the intake and other samples are taken at the downstream outlet and tested for biochemical oxygen demand (BOD). BOD is a common environmental measure of the quantity of oxygen consumed by microorganisms during the decomposition of organic matter. If BOD increases, it can be said that the waste matter contains more organic matter than the stream can handle. The following table gives data for this problem.

Upstream 9.0 6.8 6.5 8.0 7.7 8.6 6.8 8.9 7.2 7.0 Downstream 10.2 10.2 9.9 11.1 9.6 8.7 9.6 9.7 10.4 8.1

Assuming that the samples come from a normal distribution, (a) Test that the mean BOD for the downstream samples is less than for the samples upstream at alpha = 0.05. Assume that the variances are equal. (b) Test for the equality of the variances at alpha = 0.05. (c) In parts (a) and (b), we assumed samples are independent. Now, we feel this assumption is not reasonable. Assuming that the difference of each pair is approximately normal, test that the mean BOD for the downstream samples is less than for the upstream samples at alpha = 0.05.

Upstream=c(9.0,6.8,6.5,8.0,7.7,8.6,6.8,8.9,7.2,7.0)
Downstream=c(10.2,10.2,9.9,11.1,9.6,8.7,9.6,9.7,10.4,8.1)
t.test(Upstream,Downstream,alternative='greater',var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  Upstream and Downstream
## t = -5.2591, df = 18, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -2.79242      Inf
## sample estimates:
## mean of x mean of y 
##      7.65      9.75

var.test(Upstream,Downstream,alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  Upstream and Downstream
## F = 1.1925, num df = 9, denom df = 9, p-value = 0.7974
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2962035 4.8010519
## sample estimates:
## ratio of variances 
##           1.192513

t.test(Upstream,Downstream,alternative='greater',paired = TRUE)

## 
##  Paired t-test
## 
## data:  Upstream and Downstream
## t = -5.3982, df = 9, p-value = 0.9998
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -2.81311      Inf
## sample estimates:
## mean of the differences 
##                    -2.1

cat("We fail to reject the null hypothesis in the first case and conclude that that the mean BOD for the downstream samples is not less than that of the upstream sample. \n")

## We fail to reject the null hypothesis in the first case and conclude that that the mean BOD for the downstream samples is not less than that of the upstream sample.

cat("We fail to reject the null hypothesis in the second case and conclude that that the variance of BOD for the downstream samples may be equal to that of the upstream sample. \n")

## We fail to reject the null hypothesis in the second case and conclude that that the variance of BOD for the downstream samples may be equal to that of the upstream sample.

cat("We fail to reject the null hypothesis in the first case and conclude that that the mean BOD for the downstream samples is not less than that of the upstream sample if the two samples are dependent. \n")

## We fail to reject the null hypothesis in the first case and conclude that that the mean BOD for the downstream samples is not less than that of the upstream sample if the two samples are dependent.

• Hypothesis Testing: Goodness-of-fit test

A die is rolled 60 times and the face values are recorded. The results are as follows. Up face 1 2 3 4 5 6 Frequency 8 11 5 12 15 9 Is the die balanced? Test using alpha = 0.05.

Up_face=c(1,2,3,4,5,6)
Frequency=c(8,11,5,12,15,9)
freq <- c(Frequency/sum(Frequency))
chisq.test(Up_face,freq)

## Warning in chisq.test(Up_face, freq): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  Up_face and freq
## X-squared = 30, df = 25, p-value = 0.2243

print("We fail to reject the null hypothesis and hence conclude that the die may be balanced.")

## [1] "We fail to reject the null hypothesis and hence conclude that the die may be balanced."

Someone claims to have found a long lost work by Jane Austen. She asks you to decide whether or not the book was actually written by Austen. You buy a copy of Sense and Sensibility and count the frequencies of certain common words on some randomly selected pages. You do the same thing for the ‘long lost work’. You get the following table of counts. Word a an this that Sense and Sensibility 150 30 30 90 Long lost work 90 20 10 80 Using this data, set up and evaluate a significance test of the claim that the long lost book is by Jane Austen. Use a significance level of 0.1.

Sense_and_Sensibility=c(150,30,30,90)
Long_lost_work=c(90,20,10,80)
chisq.test(Sense_and_Sensibility,Long_lost_work,)

## Warning in chisq.test(Sense_and_Sensibility, Long_lost_work, ): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  Sense_and_Sensibility and Long_lost_work
## X-squared = 8, df = 6, p-value = 0.2381

print("We fail to reject the null hypothesis and hence conclude that the long lost book may be the one by Jane Austen.")

## [1] "We fail to reject the null hypothesis and hence conclude that the long lost book may be the one by Jane Austen."

The grades of students in a class of 200 are given in the following table. Test the hypothesis that the grades are normally distributed with a mean of 75 and a standard deviation of 8. Use alpha = 0.05. Range 0-59 60-69 70-79 80-89 90-100 Number of Students 12 36 90 44 18

Number_of_Students=c(12,36,90,44,18)
pnorm(Number_of_Students,mean = 75,sd=8,lower.tail = TRUE)

## [1] 1.703714e-15 5.440423e-07 9.696036e-01 5.331235e-05 5.204034e-13

Based on the sample data of 50 days contained in the following table, test the hypothesis that the daily mean temperatures in the city are normally distributed with mean 77 and variance 6. Use alpha = 0.05. Temperature 46-55 56-65 66-75 76-85 86-95 Number of days 4 6 13 23 14

Number_of_days=c(4,6,13,23,14)
pnorm(Number_of_days,mean = 75,sd=sqrt(6),lower.tail = TRUE)

## [1] 4.992733e-185 6.986430e-175 1.196519e-141 2.582264e-100 3.439361e-137

Define X as the number of under-filled bottles from a filling operation in a carton of 24 bottles from a filling operation in a carton of 24 bottles. Under the inspection of 75 cartons, the following observations on X were recorded. Value 0 1 2 3 Frequency 39 23 12 1 Based on these 75 observations, is a binomial distribution an appropriate model? Perform a goodness-of-fit procedure with alpha = 0.05.

freq <- c(39,23,12,1)
exp <- c(0,1,2,3)
chisq.test(exp,freq)

## Warning in chisq.test(exp, freq): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  exp and freq
## X-squared = 12, df = 9, p-value = 0.2133

• Hypothesis Testing: Testing of Independence

A random sample was taken of 300 undergraduate students from a university. The students in the sample were classified according to their gender and according to the choice of their major. The result is given in the following table. Gender Arts Science Engineering Business Female 23 45 12 15 Male 66 75 40 24 Test the hypothesis that the choice of the major by undergraduate students in this university is independent of their gender. Use a significance level of 0.1.

Female=c(23,45,12,15)
Male=c(66,75,40,24)
x=matrix(c(23,45,12,15,66,75,40,24),nrow=2,byrow=T)
subject=c('Arts','Science','Engineering','Business')
k=data.frame(Female,Male,subject)
y=rowSums(x)%*%t(colSums(x))/sum(x)                         # Expected: E         
testStatistic=sum((x-y)^2/y)
pValue=pchisq(testStatistic,prod(dim(x)-1),lower.tail = FALSE)
chisq.test(x)

## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 5.8873, df = 3, p-value = 0.1172

print("We fail to reject the null hypothesis and hence conclude that the choice of the major by undergraduate students in this university may be independent of their gender.")

## [1] "We fail to reject the null hypothesis and hence conclude that the choice of the major by undergraduate students in this university may be independent of their gender."

The following table gives a classification according to religious aliation and marital status for 500 randomly selected individuals. Marital Status Religious aliation A B C D N Single 39 19 12 28 18 Married 172 61 44 70 37 For alpha = 0.05, test the null hypothesis that marital status and religious aliation are independent.

x=matrix(c(39,19,12,28,18,172,61,44,70,37),nrow=2,byrow=T)  # Observed: O
colnames(x)=c("Religion A","Religion B","Religion C","Religion D","No Religion")
rownames(x)=c("Single","Married")
print(x)

##         Religion A Religion B Religion C Religion D No Religion
## Single          39         19         12         28          18
## Married        172         61         44         70          37

y=rowSums(x)%*%t(colSums(x))/sum(x)                         # Expected: E         
testStatistic=sum((x-y)^2/y)
pValue=pchisq(testStatistic,prod(dim(x)-1),lower.tail = FALSE)
chisq.test(x)

## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 7.1355, df = 4, p-value = 0.1289

print("We fail to reject the null hypothesis that marital status and religious aliation are independent.")

## [1] "We fail to reject the null hypothesis that marital status and religious aliation are independent."

The following table gives the opinion on collective bargaining by a random sample of 200 employees of a school system, belonging to a teachers union. Type of Opinion on Collective Bargaining Employee For Against Undecided Staff 30 15 15 Faculty 50 10 40 Administration 10 25 5 Test the hypotheses that Opinion on collective bargaining is independent of employee classification, using alpha = .05.

x=matrix(c(30,15,15,50,10,40,10,25,5),nrow=3,byrow=T)
Type_of_Employee=c('Staff','Faculty','Administration')
Opinion_on_Collective_Bargaining=c('For','Against','Undecided')
y=rowSums(x)%*%t(colSums(x))/sum(x)
testStatistic=sum((x-y)^2/y)
pValue=pchisq(testStatistic,prod(dim(x)-1),lower.tail = FALSE)
chisq.test(x)

## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 43.861, df = 4, p-value = 6.856e-09

print("We reject the null hypothesis that opinion on collective bargaining is independent of employee classification.")

## [1] "We reject the null hypothesis that opinion on collective bargaining is independent of employee classification."

A survey of footwear preferences of a random sample of 100 undergraduate students (50 females and 50 males) from a large university resulted in the following data. Gender Boots Leather shoes Sneakers Sandals Others Female 12 9 12 10 7 Male 10 12 17 7 4 Test the hypothesis that the choice of footwear by undergraduate students in this university is independent of their gender, using alpha = .05.

x=matrix(c(12,9,12,10,7,10,12,17,7,4),nrow=2,byrow=T)

accesories=c('Boots','Leather shoes','Sneakers','Sandals','Others')
y=rowSums(x)%*%t(colSums(x))/sum(x)
testStatistic=sum((x-y)^2/y)
pValue=pchisq(testStatistic,prod(dim(x)-1),lower.tail = FALSE)
chisq.test(x)

## 
##  Pearson's Chi-squared test
## 
## data:  x
## X-squared = 2.8201, df = 4, p-value = 0.5884

print("We fail to reject the null hypothesis that the choice of footwear by undergraduate students in this university is independent of their gender.")

## [1] "We fail to reject the null hypothesis that the choice of footwear by undergraduate students in this university is independent of their gender."

Hypothesis Testing

Aritra Halder

2/19/2021