You can get more detailed information about a command by typing it in the Console with a question mark in front. For example, ?prop.test.

In general, each command produces an object or a list of objects related to the output for future access. You can save the output of a command by giving it a name. See the first examples for prop.test under the heading Inference for Proportions.

1 Managing Data

1.1 Getting a dataset into R

1.1.1 By entering the data values manually

Sex Height Weight
M 66 153
F 62 124
M 73 210
M 69 160

A dataset in R is call a data frame. We create a data frame df for the above table using the following command.

df = data.frame(Sex = c("M","F","M","M"),
                Height = c(66,62,73,69),
                Weight = c(153,124,210,160))

Show df.

df
##   Sex Height Weight
## 1   M     66    153
## 2   F     62    124
## 3   M     73    210
## 4   M     69    160

1.1.2 By importing .csv file using its URL

read.csv(url(“URL address”))

The dataset ICU below is created by reading the a .csv file when it has an URL. The code book for the dataset can be found HERE

ICU = read.csv(url("http://www.lock5stat.com/datasets/ICUAdmissions.csv"))
head(ICU)
##   ID Status Age Sex Race Service Cancer Renal Infection CPR Systolic
## 1  8      0  27   1    1       0      0     0         1   0      142
## 2 12      0  59   0    1       0      0     0         0   0      112
## 3 14      0  77   0    1       1      0     0         0   0      100
## 4 28      0  54   0    1       0      0     0         1   0      142
## 5 32      0  87   1    1       1      0     0         1   0      110
## 6 38      0  69   0    1       0      0     0         1   0      110
##   HeartRate Previous Type Fracture PO2 PH PCO2 Bicarbonate Creatinine
## 1        88        0    1        0   0  0    0           0          0
## 2        80        1    1        0   0  0    0           0          0
## 3        70        0    0        0   0  0    0           0          0
## 4       103        0    1        1   0  0    0           0          0
## 5       154        1    1        0   0  0    0           0          0
## 6       132        0    1        0   1  0    0           1          0
##   Consciousness
## 1             1
## 2             1
## 3             1
## 4             1
## 5             1
## 6             1

1.1.3 By reading a .csv in the working directory

.csv file extension stands for “comma separated values.” You can create such a file easily by creating a spreadsheet in a program like Excel or Google Sheets, and save it in .csv format.

First, locate your working directory.

getwd()
## [1] "/cloud/project"

After putting your file into the working directory, use read.csv to read it into the dataset ICU.

ICU = read.csv("ICUAdmissions.csv")
head(ICU)
##   ID Status Age Sex Race Service Cancer Renal Infection CPR Systolic
## 1  8      0  27   1    1       0      0     0         1   0      142
## 2 12      0  59   0    1       0      0     0         0   0      112
## 3 14      0  77   0    1       1      0     0         0   0      100
## 4 28      0  54   0    1       0      0     0         1   0      142
## 5 32      0  87   1    1       1      0     0         1   0      110
## 6 38      0  69   0    1       0      0     0         1   0      110
##   HeartRate Previous Type Fracture PO2 PH PCO2 Bicarbonate Creatinine
## 1        88        0    1        0   0  0    0           0          0
## 2        80        1    1        0   0  0    0           0          0
## 3        70        0    0        0   0  0    0           0          0
## 4       103        0    1        1   0  0    0           0          0
## 5       154        1    1        0   0  0    0           0          0
## 6       132        0    1        0   1  0    0           1          0
##   Consciousness
## 1             1
## 2             1
## 3             1
## 4             1
## 5             1
## 6             1

1.2 Creating a Two-Way Table

1.2.1 By Entering Each Cell

In this example, we want to create the two-way table below in R, call it GenderSeat.

  Front Middle Back
Female 37 91 22
Male 15 46 25

WE first create GenderSeat as a matrix by entering the values in the two-way table by rows.

GenderSeat = matrix(c(37, 91,22, 15, 46, 25), ncol = 3, byrow = TRUE)

Then we add column and row headings to the matrix

colnames(GenderSeat) = c("Front", "Middle", "Back")
rownames(GenderSeat) = c("Female", "Male")

Finally, we convert GenderSeat into a table.

GenderSeat = as.table(GenderSeat)

We can display the table.

GenderSeat
##        Front Middle Back
## Female    37     91   22
## Male      15     46   25

1.2.2 By Using Two Categorical Variables in a Dataset

table(row_var, col_var)

The code book for the dataset ICU an be found HERE.

SexRace = table(ICU$Sex,ICU$Race) ## Summarize the variables Sex and Race in the dataset ICU as a two-way table, with Sex in rows and Race in columns.
SexRace
##    
##       1   2   3
##   0 108  10   6
##   1  67   5   4

We can look at the code book to know what the codes for Sex and Race represent, and update the table.

colnames(SexRace) = c("white", "black", "other")
rownames(SexRace) = c("male", "female")
SexRace
##         
##          white black other
##   male     108    10     6
##   female    67     5     4

2 Exploratory Data Analysy

2.1 Summary statistics

2.1.1 One Quantitative Variable

We wil illustrate some R commands using the dataset ICU, the code book for which can be found HERE.

mean(ICU$HeartRate)
## [1] 98.925
sd(ICU$HeartRate)
## [1] 26.82962
median(ICU$HeartRate)
## [1] 96
summary(ICU$HeartRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39.00   80.00   96.00   98.92  118.25  192.00

Other functions include range, iqr, max, etc.

2.1.2 One Quantitative Variable Grouped by a Categorical Variable

The formula relating two variables has the form Response~Explanatory. In this case, since we are using the categorical variable to explain the differences in the quantitative variable, the formula has the form Quantitatve~Categorical.

The following finds the mean heart rate for each group of male and female patients.

mean(HeartRate ~ Sex, data=ICU)
##         0         1 
##  98.09677 100.27632

By referring to the code book, the output means that heart rates of male patients averaged about 98 beats per minute, while that of female patents averaged about 100 bpm.

2.1.3 Two Quantitative Variables

The formula relating two quantitative variables has the form Response~Explanatory.

cor(HeartRate ~ Systolic, data=ICU) ## Correlation coefficient r.
## [1] -0.05658246
cor(HeartRate ~ Systolic, data=ICU)^2 ## Coefficient of determination r^2
## [1] 0.003201575

To find the regression equation, we use lm, which stands for “linear model.”

lm(HeartRate ~ Systolic, data=ICU)
## 
## Call:
## lm(formula = HeartRate ~ Systolic, data = ICU)
## 
## Coefficients:
## (Intercept)     Systolic  
##   105.01907     -0.04607
## Another way to write the command is
##   lm(ICU$HeartRate ~ ICU$Systolic) 

The output means that the regression equation is

HeartRate = 105.01907-0.04607(Systolic) 

2.1.4 Two Catgorical Variables

We will use the following table as example. To see how to create a two-way table, see Section

GenderSeat
##        Front Middle Back
## Female    37     91   22
## Male      15     46   25

Add margins to the table.

addmargins(GenderSeat)
##        Front Middle Back Sum
## Female    37     91   22 150
## Male      15     46   25  86
## Sum       52    137   47 236

Table of percents:

totPercents(GenderSeat,1) ## Total percents. Second argument indicates the number of decimal places.
##        Front Middle Back Total
## Female  15.7   38.6  9.3  63.6
## Male     6.4   19.5 10.6  36.4
## Total   22.0   58.1 19.9 100.0
rowPercents(GenderSeat,0)
##        Front Middle Back Total Count
## Female    25     61   15   101   150
## Male      17     53   29    99    86
colPercents(GenderSeat,2)
##         Front Middle   Back
## Female  71.15  66.42  46.81
## Male    28.85  33.58  53.19
## Total  100.00 100.00 100.00
## Count   52.00 137.00  47.00

2.2 Graphical Displays

2.2.1 Mosaic Plot

Mosaic plots display conditional distributions associated with a two-way table of row/column percents. Let’s use the following table as example

GenderSeat
##        Front Middle Back
## Female    37     91   22
## Male      15     46   25

Below shows a table of row percents and its associated mosaic plot.

rowPercents(GenderSeat,1) 
##        Front Middle Back Total Count
## Female  24.7   60.7 14.7 100.1   150
## Male    17.4   53.5 29.1 100.0    86
mosaicplot(GenderSeat, main = "Conditional Distribution of Seating by Gender",
           xlab = "Seating", ylab = "Gender", dir = c("h","v"))  

Below shows a table of column percents and its associated mosaic plot.

colPercents(GenderSeat,2)
##         Front Middle   Back
## Female  71.15  66.42  46.81
## Male    28.85  33.58  53.19
## Total  100.00 100.00 100.00
## Count   52.00 137.00  47.00
mosaicplot(GenderSeat, main = "Conditional Distribution of Gender by Seating",
           xlab = "Seating", ylab = "Gender", dir = c("h","v"), sort=c(2,1))

3 Inference for Propotion

prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”), conf.level = 0.95, data = NULL, success = NULL, …)

  • Default value of the null hypothesis is p=0.5.
  • Two-sided test is the default.
  • Just ignore the result of hypothesis test if you are looking for CI.
  • Use the default correct = FALSE to match the formula you learned in class.
  • In the output, X-squared is the square of the z-score, which is stored as statistic in the output list.

3.1 Summarized Data

3.1.1 One-Sample Inference

Suppose your data has 14 successes out of 50. The following performs a two-sided HT with null value 0.3.

prop.test(x=14, n=50, p=0.3, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  14 out of 50
## X-squared = 0.095238, df = 1, p-value = 0.7576
## alternative hypothesis: true p is not equal to 0.3
## 95 percent confidence interval:
##  0.1747417 0.4166512
## sample estimates:
##    p 
## 0.28

Save the result of the test:

result = prop.test(x=14, n=50, p=0.3, correct = FALSE)

You can see what values are store in result:

summary(result)
##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## estimate    1      -none- numeric  
## null.value  1      -none- numeric  
## conf.int    2      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

Access a stored value by using $ after object name, followed by the name of the value:

result$statistic
## X-squared 
## 0.0952381

To get the absolute value of z-score:

z=sqrt(result$statistic); names(z)="|Z-score|"; z  
## |Z-score| 
## 0.3086067

You have to choose the sign based on whether your sample estimte is above or below the null value. In this example, since p-hat (0.28) is less than null value (0.3), the z-score is really negative (z=-0.3086067).

You can get the confidence interval without the output of hypothesis test by using $conf.int. For CI, the null value for HT is irrelevant, so you can just leave it out.

prop.test(x=14, n=50, conf.level=0.99, correct = FALSE)$conf.int
## [1] 0.1499462 0.4616007
## attr(,"conf.level")
## [1] 0.99

The following performs a right-tail HT with the default null value of 0.5 by omitting it in the arguments of prop.test.

prop.test(x=14, n=50, alternative="great", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  14 out of 50
## X-squared = 9.68, df = 1, p-value = 0.9991
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.1889395 1.0000000
## sample estimates:
##    p 
## 0.28

3.1.2 Two-Sample Inference

The following tests the alternative hypothesis that p1-p2 is greater than zero, with p-hat1 = 14/50 and p-hat2 = 30/100.

prop.test(x=c(14, 30), n=c(50, 100), alternative = "greater", correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  c(14, 30) out of c(50, 100)
## X-squared = 0.064322, df = 1, p-value = 0.6001
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.1488037  1.0000000
## sample estimates:
## prop 1 prop 2 
##   0.28   0.30

3.2 Raw Data

3.2.1 One-Sample Inference

The code book for the data loaded below can be found HERE.

The data is loaded directly from an URL into the data frame df. df$Sex refers to the variable sex. In the code book, males are coded as “0”. The argument success is used to identified which the sample poportion is studied. Thus, the example below is about proportion of males in the dataset, using null value of 50%.

prop.test(ICU$Sex, p=0.5, success = "0") ## For some reason, the argument "correction = FALSE" doesn't work when raw data is used.
## 
##  1-sample proportions test with continuity correction
## 
## data:  ICU$Sex  [with success = 0]
## X-squared = 11.045, df = 1, p-value = 0.0008893
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5485320 0.6867787
## sample estimates:
##    p 
## 0.62

3.2.2 Two-Sample Inference

The following tests to see if the proportion of Black patients among males is different than that among females. See code book.

prop.test(Race ~ Sex, data = ICU, success = "2")
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  tally(Race ~ Sex)
## X-squared = 0.012236, df = 1, p-value = 0.9119
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06926372  0.09897510
## sample estimates:
##     prop 1     prop 2 
## 0.08064516 0.06578947

The problem with the above is that the output does not identify clearly which sample estimates is male or female. We can get around this problem by creating a table with Sex and Race and look at the conditional distribution of Race by Sex.

RaceSex = table(ICU$Race, ICU$Sex) ## Sex are in columns.
colPercents(RaceSex)
##        
##             0     1
##   1      87.1  88.2
##   2       8.1   6.6
##   3       4.8   5.3
##   Total 100.0 100.1
##   Count 124.0  76.0

We can see that, for the sample esimate in the output of prop.test, prop 1 with 0.0806 refers to proportion of Blacks among male (code 0) patients, matching the 8.1% in the table of column percents.

4 Inference for Mean

4.1 Summarized Data

tsum.test(mean.x, s.x = NULL, n.x = NULL, mean.y = NULL, s.y = NULL, n.y = NULL, alternative = “two.sided”, mu = 0, var.equal = FALSE, conf.level = 0.95)

4.1.1 One-Sample Inference

The following example matches the example in OLI on p186, where the sample mean is 13.5, sample sd is 3.2, and sample size is 45. The alternative hypothesis is that the population mean is greater than 12.

tsum.test(13.5, 3.2, 45, alternative = "greater", mu = 12)
## Warning in tsum.test(13.5, 3.2, 45, alternative = "greater", mu = 12):
## argument 'var.equal' ignored for one-sample test.
## 
##  One-sample t-Test
## 
## data:  Summarized x
## t = 3.1445, df = 44, p-value = 0.001489
## alternative hypothesis: true mean is greater than 12
## 95 percent confidence interval:
##  12.69848       NA
## sample estimates:
## mean of x 
##      13.5

4.1.2 Two-Sample Inference

The following example matches the example in OLI on p219 having the following sample statistics.

Age sample mean sample sd sample size
20-29 83.4 18.7 712
75+ 78.5 19.0 1001

We test with the alterntive hypothesis that younger men weigh on average more than older men.

tsum.test(83.4, 18.7, 712, 78.5, 19, 1001, alternative = "greater")
## 
##  Welch Modified Two-Sample t-Test
## 
## data:  Summarized x and y
## t = 5.3092, df = 1545.9, p-value = 6.305e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  3.381025       NA
## sample estimates:
## mean of x mean of y 
##      83.4      78.5

4.2 Raw Data

t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)

t.test(formula, data, subset, na.action, …)

4.2.1 One-Sample Inference

\(H_0:\mu=80\)

\(H_a:\mu>80\)

t.test(ICU$HeartRate, alternative="greater", mu=80)
## 
##  One Sample t-test
## 
## data:  ICU$HeartRate
## t = 9.9755, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 80
## 95 percent confidence interval:
##  95.78989      Inf
## sample estimates:
## mean of x 
##    98.925

To find a 90% confidence interval, do a two sided t-test, and you can find the CI in the output. It won’t matter what the null value is, so you can skip it, which actually uses null value of zero.

t.test(ICU$HeartRate, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  ICU$HeartRate
## t = 52.144, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##   95.78989 102.06011
## sample estimates:
## mean of x 
##    98.925

4.2.2 Two-Sample Inference

\(H_0:\mu_{male}=\mu_{female}\)

\(H_a:\mu_{male}\ne\mu_{female}\)

t.test(HeartRate ~ Sex, data=ICU)
## 
##  Welch Two Sample t-test
## 
## data:  HeartRate by Sex
## t = -0.55481, df = 157.04, p-value = 0.5798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.938963  5.579880
## sample estimates:
## mean in group 0 mean in group 1 
##        98.09677       100.27632

The problem with the above is that you have no control over which sex is the first or second group. To get around that, you can explicitly choose as in the following example:

\(H_0:\mu_{female}=\mu_{male}\)

\(H_a:\mu_{female}\gt\mu_{male}\)

The code book shows that 1 represents female and 0 represents male.

t.test(ICU$HeartRate[ICU$Sex == "1"], ICU$HeartRate[ICU$Sex == "0"], alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  ICU$HeartRate[ICU$Sex == "1"] and ICU$HeartRate[ICU$Sex == "0"]
## t = 0.55481, df = 157.04, p-value = 0.2899
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -4.320533       Inf
## sample estimates:
## mean of x mean of y 
## 100.27632  98.09677

4.2.3 Matched Pairs

The data for each matched pair is in the same row.

The code book for the data loaded below can be found HERE.

chimp=read.csv(url("http://www.chabotcollege.edu/faculty/mho/MTH43/data/ChimpanzeeCollaborate.csv"))
t.test(chimp$ColYes, chimp$ColNo, alternative="greater", paired = TRUE) ## The difference is the first argument minus the second argument.
## 
##  Paired t-test
## 
## data:  chimp$ColYes and chimp$ColNo
## t = 7.3703, df = 7, p-value = 7.663e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  9.193915      Inf
## sample estimates:
## mean of the differences 
##                  12.375

5 Inference using Chi Square

5.1 Goodness-of-Fit Test

The folowing example peforms the goodness-of-fit test on the second Learning by Doing titled Hypothesis Test about the Color Distribution for Plain M&Ms on p202 of OLI.

chisq.test(c(38,32,51,58,74,47), p=c(0.13,0.13,0.14,0.24,0.20,0.16))
## 
##  Chi-squared test for given probabilities
## 
## data:  c(38, 32, 51, 58, 74, 47)
## X-squared = 9.2203, df = 5, p-value = 0.1006

5.2 Test of Independence

5.2.1 Using Two-Way Table

First, get your two-way table into R. We will use the table GenderSeat that was created in the Section 1.2.

A table summary automatically performs a chi-square test

summary(GenderSeat)
## Number of cases in table: 236 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 7.474, df = 2, p-value = 0.02383

However, if you want to know the related information such as the chi-square contribution of each cell, you would nee to perform a chi-square test on the table and save the result for processing.

result = chisq.test(GenderSeat)

You can see the result of chisq.test is similar to applying summary to a table.

result
## 
##  Pearson's Chi-squared test
## 
## data:  GenderSeat
## X-squared = 7.4739, df = 2, p-value = 0.02383

The following shows the expected counts for each cell

result$expected
##           Front   Middle     Back
## Female 33.05085 87.07627 29.87288
## Male   18.94915 49.92373 17.12712

The following shows the standardized residuals for each cell, \(\frac{O-E}{\sqrt{E}}\).

result$residuals
##             Front     Middle       Back
## Female  0.6869302  0.4204836 -1.4404399
## Male   -0.9072125 -0.5553228  1.9023550

The following shows the chi square contribution for each cell, \(\frac{(O-E)^2}{E}\).

result$residuals^2
##            Front    Middle      Back
## Female 0.4718731 0.1768065 2.0748672
## Male   0.8230345 0.3083834 3.6189544

5.2.2 Using Two Categorical Variables

In the datase ICU, we are testing if the variables Sex and Race are independent.

result = chisq.test(ICU$Sex, ICU$Race)
## Warning in chisq.test(ICU$Sex, ICU$Race): Chi-squared approximation may be
## incorrect
result
## 
##  Pearson's Chi-squared test
## 
## data:  ICU$Sex and ICU$Race
## X-squared = 0.16169, df = 2, p-value = 0.9223

The following shows the expected counts for each cell

result$expected
##        ICU$Race
## ICU$Sex     1   2   3
##       0 108.5 9.3 6.2
##       1  66.5 5.7 3.8

The following shows the standardized residuals for each cell, \(\frac{O-E}{\sqrt{E}}\).

result$residuals
##        ICU$Race
## ICU$Sex           1           2           3
##       0 -0.04800154  0.22953904 -0.08032193
##       1  0.06131393 -0.29319774  0.10259784

The following shows the chi square contribution for each cell, \(\frac{(O-E)^2}{E}\).

result$residuals^2
##        ICU$Race
## ICU$Sex           1           2           3
##       0 0.002304147 0.052688172 0.006451613
##       1 0.003759398 0.085964912 0.010526316

6 Code Books for datasets.

Click on the name of the dataset below to see its code book. * chimp * ICU